Distributed Databases

Distributed databases in a nutshell

Hey I just met you, the network’s laggy

but here’s my data, so store it maybe

A distributed database is a distributed system designed to provide read/write access to data, using the relational or other format.

Splitting the data among nodes

We will examine 3 topics in distributed databases:

Replication: Keep a copy of the same data on several different nodes.
Partitioning: Split the database into smaller subsets and distributed the partitions to different nodes.
Transactions: Units of work that group several reads and writes to be performed together in the database

Q: Usually, distributed databases employ both replication and partitioning at the same time. Why?

Replication

MySQL async replication

Why replicate?

With replication, we keep identical copies of the data on different nodes.

D: Why do we need to replicate data?

Data scalability: To increase read throughput, by allowing more machines to serve read-only requests
Geo-scalability: To have the data (geographically) close to the clients
Fault tolerance: To allow the system to work, even if parts of it are down

How to replicate?

We will discuss replication design choices for:

Replication architecture: Leader-based, multi-leader, leaderless replication
Implementation of replication logs: Statement based, write-ahead log based, logical-based
Distribution of updates: Synchronous, asynchronous

Replication Architectures

How is a replicated data set accessed and updated?

In a replicated system, we have two node roles:

Leaders or Masters: Nodes that accept write queries from clients
Followers, Slaves or replicas: Nodes that provide read-only access to data

Depending on how replication is configured, we can see the following architectures

Single leader or master-slave: A single leader accepts writes, which are distributed to followers
Multi-leader or master-master: Multiple leaders accept writes, keep themselves in sync, then update followers
Leaderless replication All nodes are peers in the replication network

Statement based replication in databases

The leader ships all write requests to the followers, e.g. all INSERT or UPDATE queries. However, this is problematic:

UPDATE foo
SET updated_at=NOW()
WHERE id = 10

D: What is the issue with the code above in a replicated context?

Statement based replication is non-deterministic and mostly abandoned.

Write-ahead log based replication

Typically, databases write their data to data structures, such as \(B^+\)-trees. However, before actually modifying the data structure, they write the intended change to an append-only write-ahead log (WAL).

WAL-based replication writes all changes to the leader WAL and also to the followers. The followers apply the WAL entries to get consistent data.

The main problem with WAL-based replication is that it is bound to the implementation of the underlying data structure; if this changes in the leader, the followers stop working.

Postgres and Oracle use WAL-based replication.

Logical-based replication

The database generates a stream of logical updates for each update to the WAL. Logical updates can be:

For new records, the values that were inserted
For deleted records, their unique id
For updated records, their id and the updated values

MongoDB and MySQL use this replication mechanism.

Process for creating a replica

Take a consistent snapshot from the leader
Ship it to the replica
Get an id to the state of the leader’s replication log at the time the snapshot was created
Initialize the replication function to the latest leader id
The replica must retrieve and apply the replication log until it catches up with the leader

A real example: MySQL

Leader

> SHOW MASTER STATUS;
+--------------------+----------+
| File               | Position |
+--------------------+----------+
| mariadb-bin.004252 | 30591477 |
+--------------------+----------+
1 row in set (0.00 sec)

Follower

>CHANGE MASTER TO
  MASTER_HOST='10.0.0.7',
  MASTER_USER='replicator',
  MASTER_PORT=3306,
  MASTER_CONNECT_RETRY=20,
  MASTER_LOG_FILE='mariadb-bin.452',
  MASTER_LOG_POS= 30591477;

> SHOW SLAVE STATUS\G

Master_Host: 10.0.0.7
Master_User: replicator
Master_Port: 3306
Master_Log_File: mariadb-bin.452
Read_Master_Log_Pos: 34791477
Relay_Log_File: relay-bin.000032
Relay_Log_Pos: 1332
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

How does replication work?

When a write occurs in node, this write is distributed to other nodes in either of the two following modes:

Synchronously: Writes need to be confirmed by a configurable number of followers before the leader reports success.
Asynchronously: The leader reports success immediately after a write was committed to its own disk; followers apply the changes in their own pace.

D: What are the advantages/disadvantages of each replication type?

Synchronous replication

The leader waits until the followers receive the update and before reporting success

A follower is guaranteed to have an up-to-date copy of the data that is consistent with the leader.
If the leader suddenly fails, we can be sure that the data is still available on the follower.
If the synchronous follower does not respond, the write cannot be processed.
The leader must block all writes and wait until the synchronous replica is available again.

It’s impractical for all followers to be synchronous.

Asynchronous replication

The leader reports success and asynchronously updates the followers.

Higher availability: The leader does not need to block writes in case of inaccessible follower.
A follower is not guaranteed to have an up-to-date copy of the data that is consistent with the leader.
Writes are not guaranteed to be durable in case of leader failure.

Asynchronous replication complications: Ordering problems

Asynchronous replication may cause clients to observe anomalous scenarios such as:

Read-after-write: Clients may not see their own writes, i.e., when they connect to a replica which does not have the update.
(non-) Monotonic reads: When reading from multiple replicas concurrently, a stale replica might not return records that the client read from an up to date one.
Causality violations: Updates might be visible in different orders at different replicas. This allows for violations of the happened-before property.

The client reads stale data

Multi-leader replication and its complications: Write conflicts

Multiple leader nodes to accept writes. Replication to followers in a similar way to single-leader case

The biggest problem with multi-leader replication are write conflicts.

To demonstrate write conflicts, let’s think in terms of git:

                                # Clock
# User 1
git clone git://....            # t+1
git add foo.c                   # t+2
##hack hack hack
git commit -a -m 'Hacked v1'    # t+3
git push                        # t+5

# User 2
git clone git://....            # t+2
git add foo.c                   # t+5
##hack hack hack
git commit -m 'Hacked new file' # t+6
git push # fails                # t+7
git pull # CONFLICT             # t+7

If we replace user with leader node, we have exactly the same problem

How to avoid or resolve write conflicts?

One leader per session: If session writes do not interfere (e.g., data are only stored per user), this will avoid issues altogether.
Converge to consistent state: Apply conflict resolution policies:
- last-write-wins policy to order writes by timestamp (may lose data)
- report conflict to the application and let it resolve it (same as git or Google docs)
- use conflict-free data types with specific conflict resolution logics

Leaderless replication

Different from leader-based replication schemes, any replica can accept read/write queries from the clients.

Writes are successful if written to W replicas and reads are successful if written to R replicas:

W + R > N : We expect to read up-to-date value
W < N : We can process writes if a node is unavailable
R < N : We can process reads if a node is unavailable

Example scenario with N=3, W=2, R=2:

Replication of a total order of operations is fundamentally a consensus problem.

Trade-off between consistency and availability

Q: Can we design a system that is available (i.e., it can serve client requests) even when some nodes in the system are not available or unreachable?

Synchronous replication: Up-to-date replicas (consistent), but not available in case a failure of any follower

Asynchronous replication: Replicas may not have up to date data (inconsistent), but the system is available in the existence of failures

Brewer’s CAP theorem

CAP theorem: In a replicated system, it is impossible to get all three of:

(Strong) Consistency: – All nodes in the network have the same (most recent) value
Availability: – Every request to a non-failing replica receives a response
Partition tolerance: The system continues to operate in the existence of component or network faults

The trade-off is not “Availability vs Consistency” but “Availability vs Strong Consistency”

CAP Extended to PACELC

Consistency models for replicated systems

Consistency model is a contract between programmer and replicated system, i.e., it specifies the consistency between replicas and what can be observed as possible results of operations.

It answers questions such as:

What are the possible results of a read query?
Can a client see its own updates?
Is the causal order of queries preserved at all copies of data?
Is there a total order of operations preserved at all replicas?

A variety of consistency models has been proposed and implemented.

Hierarchy of non-transactional consistency models (Viotti & Vikolic, 2016)

Spectrum of consistency models

In this course, we will discuss the following consistency models:

Strong consistency models (illusion of maintaining a single copy)
- Linearizable consistency
- Sequential consistency
Weak consistency models
- Client-centric consistency models (Read your writes, monotonic reads, monotonic writes)
- Causal consistency
- Eventual consistency

Weak consistency models trade-off strong consistency for better performance.

Eventual consistency

All updates are eventually delivered to all replicas.

All replicas reach a consistent state if no more user updates arrive

Examples:

Search engines: Search results are not always consistent with the current state of the web
Cloud file systems: File contents may be out-of-sync with their latest versions
Social media applications: Number of likes for a video may not be up to date

D: Which synchronization architectures can be used to implement eventual consistency?

Causal consistency

Causally related operations are delivered to other replicas in the correct order.

Maintains a partial order of events based on causality.

Causal consistency does not require synchronous coordination across nodes and still provide sensible view of operations.

Not allowed by causal consistency

Allowed by causal consistency

D: For what kind of applications would you use a causally consistent system?

Client-centric consistency models

Provide guarantees about ordering of operations only for a single client process:

Monotonic reads: If a process reads the value of x, any successive read operation on x by that process will always return that same value or a more recent value.
Monotonic writes: If a process writes a value to x, the replica on which a successive operation is performed reflects the effect of a previous write operation by the same process
Read your writes: The effect of a write operation by a process on x will always be seen by a successive read operation on x by the same process
Writes follow reads: If a process writes a value to x following a previous read operation on x by the same process, it is guaranteed to take place on the same or more recent values of x that was read.

Sequential consistency

The operations appear to take place in some total order that is consistent with the order of operations on each replica.

All replicas observe the same order of operations

Causally consistent

Sequentially consistent

Also sequentially consistent

Linearizability

Linearizability: Sequential consistency + the total order of operations conform to the real time ordering:

As soon as writes complete successfully, the result is immediately replicated to all nodes atomically and is made available to reads.
At any time, concurrent reads from any node return the same values.

Linearizable

Also linearizable

Not linearizable

Linearizability: Examples

Q: Is the execution above linearizable? (Adapted from the examples from Kleppman[1])

A: Yes, there is a total order of operations that conform to their real time order and satisfy the specification of the data variable x.

Q: Is the execution above linearizable?

A: No, in a linearizable system, it is not possible the last read operation at R1 to read the value 0.

Partitioning

Partitioning breaks whole the dataset into fractions and distributes them to different hosts, also known as sharding.

Types of partitioning

Why partitioning?

The main reason is scalability:

Queries can be run in parallel, on parts of the dataset
Reads and writes are spread on multiple machines

Partitioning is always combined with replication. The reason is that with partitioning, a node failure will result in irreversible data corruption.

How to partition?

The following 3 strategies are

Range partitioning Takes into account the natural order of keys to split the dataset in the required number of partitions. Requires keys to be naturally ordered and keys to be equally distributed across the value range.
Hash partitioning Calculates a hash over the each item key and then produces the modulo of this hash to determine the new partition.
Custom partitioning Exploits locality or uniqueness properties of the data to calculate the appropriate partition to store the data to. An example would be pre-hashed data (e.g. git commits) or location specific data (e.g. all records from Europe).

Request routing

On partitioned datasets, clients need to be aware of the partitioning scheme in order to direct queries and writes to the appropriate nodes.

To hide partitioning details from the client, most partitioned systems feature a query router component siting between the client and the partitions.

The query router knows the employed partitioning scheme and directs requests to the appropriate partitions.

Partition and replication example: MongoDB

Sharding and replication in MongoDB

Transactions

Optional reading (not part of the exam material)

Many clients on one database

What could potentially go wrong?

Many clients try to update the same data store at the same time, when….

the network fails, and then…

the database leader server cannot reach its network-mounted disk, so…

the database tries to fail over to a follower, but it is unreachable, and then…

the application writes timeout.

What is the state of the data after this scenario?

As programmers, we are mostly concerned about the code’s happy path. Systems use transactions to guard against catastrophic scenarios.

What are transactions?

Transactions[2] are blocks of operations (reads and writes), that either succeed or fail, as a whole.

Transactions are defined in the context of an application that wants to modify some data and a system that guards data consistency.

The application initiates a transaction, and when done, it commits it.
If the system deems the transaction successful, it guarantees that the modifications are safely stored
If the system deems the transaction unsuccessful, the transaction is rollbacked; the application can safely retry.

Transactions in relational databases

The concept of the transaction started with the first SQL database, System R; while the mechanics changed, the semantics are stable for the last 40 years.

Transactions provide the following guarantees[3]:

Atomicity: The transaction either succeeds or fails; in case of failure, outstanding writes are ignored (all or nothing).
Consistency: Any transaction will bring the database from one valid state to another.
Isolation: Concurrent execution of transactions do not interfere and effect each other.
Durability: Once a transaction has been committed, it will remain so.

Note that the notion of consistency is different from the “consistency” of replication. In the context of ACID, consistency refers to the good state of the database where the data integrity constraints hold.

Isolation

Databases try to hide concurrency issues from applications by isolating transactions from each other.

Isolation ensures that transactions are processed without interference: The goal is to prevent reads and writes of data written by incomplete or aborting transactions.

Isolation guarantees restrict how and when writes become visible. There are various degrees of isolation (degree of how isolated a transaction is):

High isolation: Discourages concurrency (it uses more locks) and availability, favors database consistency (valid state)
Low isolation: Encourages concurrency and availability, allows some inconsistent data.

The highest level of isolation provides serializable transactions.

Serializability

Highest level of isolation entails executing transactions as if they were serially executed. To implement serializability, databases:

Only execute in a single core (e.g. Redis)
Use 2 phase locking: all database objects (e.g. rows) have a shared lock and an exclusive lock
- All readers acquire a shared lock
- A writer needs to acquire the exclusive lock
- Deadlocks can happen when a transaction \(A\) waits for \(B\) to release their exclusive locks and vice versa
Use MVCC and Copy-on-Write data structures

Multi-Version Concurrency Control

With MVCC, the database works like Git:

Each transaction \(A\) sees the most recent copy of the data (git checkout -b A)
When \(A\) commits (git commit -a):
- If \(A\) touched no object that was updated before by another transaction the database will create a new version (git checkout master; git merge A)
- If \(A\) changed an object that was updated before, the database with report a conflict
The current state is the result of the application of all intermediate transactions on an initial state
Occasionally, we need to clean up ignored intermediate states (git gc)

Weaker isolation levels

While serializable isolation works it may be slow. Consequently, historically, databases made compromises in how they implement isolation.

Dirty reads: A transaction reads data written by a concurrent uncommitted transaction
Non Repeatable Reads: A transaction re-reads data it has previously read and finds that data modified
Phantom reads: Results to queries change due to other transactions being committed

Isolation	DR	NRR	PR
Read uncommitted	X	X	X
Read committed		X	X
Repeatable read			X
Serializable

Isolation vs consistency

Consistency without isolation:

Two concurrent transactions incrementing the value of x. The transactions are not serialized, the final value of x is incorrect.

Isolation without consistency:

Transactions run in isolation, but T3 reads and operates on stale/incorrect data.

Isolation levels and consistency models

Distributed Transactions

Atomic commits

In a distributed database, a transaction spanning multiple nodes must either succeed on all nodes or fail (to maintain atomicity).

Transactions may also span multiple systems; for example, we may try to remove a record from a database and add it to a queue service in an atomic way.

In contrast to the distributed systems consensus problem, all nodes must agree on whether a transaction has been successfully committed.

The most common mechanism used to dead with distributed atomic commits is the two-phase commit (2PC) protocol.

2-phase commits

In 2PC, we find the following roles:

A coordinator or transaction manager
Participants or cohorts

The 2PC protocol makes the following assumptions:

There is stable, reliable storage on the cohorts
No participant will crash forever
Participants use (indestructible) write-ahead log
Any two nodes can communicate with each other

A transaction in 2PC

A client starts a 2PC transaction by acquiring a globally unique transaction id (GTID)
The client attaches the GTID to each transaction it begins with individual nodes
When the client wants to commit, the coordinator sends a PREPARE message to all nodes
If at least one node replies with ‘NO’, the transaction is aborted
If all nodes reply with ‘YES’, the coordinator writes the decision to disk and tries forever to commit individual transactions to the cohort

D: What problems do you see with 2PC?

Problems with 2PC

Holding locks in nodes: While a decision is being made, faster nodes must lock objects; consequently, a slow or crashing node will kill the performance of the whole system
Exactly-once semantics: All nodes participating in a distributed transaction must support transactions to guarantee exactly once semantics
Cohort failures: Failure of a single node can prevent transactions from happening
Coordinator failures: Inability of the coordinator to proceed with a commit or abort leaves the cluster in an undecided state

2PC is a blocking atomic commitment algorithm.

Image/content credits

Jespen, by K. Kingsbury
MySQL async replication, MySQL documentation
Hierarchy of non-transactional consistency models Paolo Viotti and Marko Vukolić. “Consistency in non-transactional distributed storage systems”. ACM Computing Surveys 49, 2016.
Linearizability example from Kleppmann [1]
Database partitioning from the MSQL documentation
List, range, and hash partitioning from the Oracle 11g documentation
MongoDB sharding by StackExchange user Jerry
Examples for “Isolation vs Consistency” from Fauna blog
Two phase commit for distributed transactions, from “Database Internals” by Alex Petrov [4].

Bibliography

[1]

M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.

[2]

J. Gray, “The transaction concept: Virtues and limitations (invited paper),” in Proceedings of the seventh international conference on very large data bases - volume 7, 1981, pp. 144–154.

[3]

T. Haerder and A. Reuter, “Principles of transaction-oriented database recovery,” ACM Comput. Surv., vol. 15, no. 4, pp. 287–317, Dec. 1983.

[4]

A. Petrov, “Database internals: A deep dive into how distributed data systems work.” O’Reilly Media, Inc, 2019.

[5]

K. Kingsbury, “Jespen: On the perils of network partitions,” 2013. [Online]. Available: https://aphyr.com/tags/jepsen.

Distributed Databases

Georgios Gousios and Burcu Kulahcioglu Ozkan

01 October 2022

Distributed Databases