A distributed database is a distributed system designed to provide read/write access to data, using the relational or other format.
We will examine 3 topics in distributed databases:
Replication: Keep a copy of the same data on several different nodes.
Partitioning: Split the database into smaller subsets and distributed the partitions to different nodes.
Transactions: Units of work that group several reads and writes to be performed together in the database
Q: Usually, distributed databases employ both replication and partitioning at the same time. Why?
With replication, we keep identical copies of the data on different nodes.
D: Why do we need to replicate data?
We will discuss replication design choices for:
Replication architecture: Leader-based, multi-leader, leaderless replication
Implementation of replication logs: Statement based, write-ahead log based, logical-based
Distribution of updates: Synchronous, asynchronous
How is a replicated data set accessed and updated?
In a replicated system, we have two node roles:
Depending on how replication is configured, we can see the following architectures
The leader ships all write requests to the followers, e.g. all
INSERT
or UPDATE
queries. However, this is
problematic:
UPDATE foo
SET updated_at=NOW()
WHERE id = 10
D: What is the issue with the code above in a replicated context?
Statement based replication is non-deterministic and mostly abandoned.
Typically, databases write their data to data structures, such as \(B^+\)-trees. However, before actually modifying the data structure, they write the intended change to an append-only write-ahead log (WAL).
WAL-based replication writes all changes to the leader WAL and also to the followers. The followers apply the WAL entries to get consistent data.
The main problem with WAL-based replication is that it is bound to the implementation of the underlying data structure; if this changes in the leader, the followers stop working.
Postgres and Oracle use WAL-based replication.
The database generates a stream of logical updates for each update to the WAL. Logical updates can be:
MongoDB and MySQL use this replication mechanism.
id
to the state of the leader’s replication log
at the time the snapshot was createdid
Leader
> SHOW MASTER STATUS;
+--------------------+----------+
File | Position |
| +--------------------+----------+
-bin.004252 | 30591477 |
| mariadb+--------------------+----------+
1 row in set (0.00 sec)
Follower
>CHANGE MASTER TO
='10.0.0.7',
MASTER_HOST='replicator',
MASTER_USER=3306,
MASTER_PORT=20,
MASTER_CONNECT_RETRY='mariadb-bin.452',
MASTER_LOG_FILE= 30591477; MASTER_LOG_POS
> SHOW SLAVE STATUS\G
10.0.0.7
Master_Host:
Master_User: replicator3306
Master_Port: -bin.452
Master_Log_File: mariadb34791477
Read_Master_Log_Pos: -bin.000032
Relay_Log_File: relay1332
Relay_Log_Pos:
Slave_IO_Running: Yes Slave_SQL_Running: Yes
When a write occurs in node, this write is distributed to other nodes in either of the two following modes:
Synchronously: Writes need to be confirmed by a configurable number of followers before the leader reports success.
Asynchronously: The leader reports success immediately after a write was committed to its own disk; followers apply the changes in their own pace.
D: What are the advantages/disadvantages of each replication type?
The leader waits until the followers receive the update and before reporting success
It’s impractical for all followers to be synchronous.
The leader reports success and asynchronously updates the followers.
Asynchronous replication may cause clients to observe anomalous scenarios such as:
Read-after-write: Clients may not see their own writes, i.e., when they connect to a replica which does not have the update.
(non-) Monotonic reads: When reading from multiple replicas concurrently, a stale replica might not return records that the client read from an up to date one.
Causality violations: Updates might be visible in different orders at different replicas. This allows for violations of the happened-before property.
Multiple leader nodes to accept writes. Replication to followers in a similar way to single-leader case
The biggest problem with multi-leader replication are write conflicts.
To demonstrate write conflicts, let’s think in terms of
git
:
# Clock
# User 1
git clone git://.... # t+1
git add foo.c # t+2
##hack hack hack
git commit -a -m 'Hacked v1' # t+3
git push # t+5
# User 2
git clone git://.... # t+2
git add foo.c # t+5
##hack hack hack
git commit -m 'Hacked new file' # t+6
git push # fails # t+7
git pull # CONFLICT # t+7
If we replace user with leader node, we have exactly the same problem
One leader per session: If session writes do not interfere (e.g., data are only stored per user), this will avoid issues altogether.
Converge to consistent state: Apply conflict resolution policies:
git
or Google docs)Different from leader-based replication schemes, any replica can accept read/write queries from the clients.
Writes are successful if written to W replicas and reads are successful if written to R replicas:
Example scenario with N=3, W=2, R=2:
Replication of a total order of operations is fundamentally a consensus problem.
Q: Can we design a system that is available (i.e., it can serve client requests) even when some nodes in the system are not available or unreachable?
Synchronous replication: Up-to-date replicas (consistent), but not available in case a failure of any follower
Asynchronous replication: Replicas may not have up to date data (inconsistent), but the system is available in the existence of failures
CAP theorem: In a replicated system, it is impossible to get all three of:
The trade-off is not “Availability vs Consistency” but “Availability vs Strong Consistency”
Consistency model is a contract between programmer and replicated system, i.e., it specifies the consistency between replicas and what can be observed as possible results of operations.
It answers questions such as:
A variety of consistency models has been proposed and implemented.
In this course, we will discuss the following consistency models:
Weak consistency models trade-off strong consistency for better performance.
All updates are eventually delivered to all replicas.
All replicas reach a consistent state if no more user updates arrive
Examples:
D: Which synchronization architectures can be used to implement eventual consistency?
Causally related operations are delivered to other replicas in the correct order.
Causal consistency does not require synchronous coordination across nodes and still provide sensible view of operations.
D: For what kind of applications would you use a causally consistent system?
Provide guarantees about ordering of operations only for a single client process:
Monotonic reads: If a process reads the value of x, any successive read operation on x by that process will always return that same value or a more recent value.
Monotonic writes: If a process writes a value to x, the replica on which a successive operation is performed reflects the effect of a previous write operation by the same process
Read your writes: The effect of a write operation by a process on x will always be seen by a successive read operation on x by the same process
Writes follow reads: If a process writes a value to x following a previous read operation on x by the same process, it is guaranteed to take place on the same or more recent values of x that was read.
The operations appear to take place in some total order that is consistent with the order of operations on each replica.
Linearizability: Sequential consistency + the total order of operations conform to the real time ordering:
Q: Is the execution above linearizable? (Adapted from the examples from Kleppman[1])
A: Yes, there is a total order of operations that conform to their real time order and satisfy the specification of the data variable x.
Q: Is the execution above linearizable?
A: No, in a linearizable system, it is not possible the last read operation at R1 to read the value 0.
Partitioning breaks whole the dataset into fractions and distributes them to different hosts, also known as sharding.
The main reason is scalability:
Partitioning is always combined with replication. The reason is that with partitioning, a node failure will result in irreversible data corruption.
The following 3 strategies are
git
commits) or location specific data (e.g. all records from Europe).On partitioned datasets, clients need to be aware of the partitioning scheme in order to direct queries and writes to the appropriate nodes.
To hide partitioning details from the client, most partitioned systems feature a query router component siting between the client and the partitions.
The query router knows the employed partitioning scheme and directs requests to the appropriate partitions.
Optional reading (not part of the exam material)
What could potentially go wrong?
What is the state of the data after this scenario?
As programmers, we are mostly concerned about the code’s happy path. Systems use transactions to guard against catastrophic scenarios.
Transactions[2] are blocks of operations (reads and writes), that either succeed or fail, as a whole.
Transactions are defined in the context of an application that wants to modify some data and a system that guards data consistency.
The concept of the transaction started with the first SQL database, System R; while the mechanics changed, the semantics are stable for the last 40 years.
Transactions provide the following guarantees[3]:
Note that the notion of consistency is different from the “consistency” of replication. In the context of ACID, consistency refers to the good state of the database where the data integrity constraints hold.
Databases try to hide concurrency issues from applications by isolating transactions from each other.
Isolation ensures that transactions are processed without interference: The goal is to prevent reads and writes of data written by incomplete or aborting transactions.
Isolation guarantees restrict how and when writes become visible. There are various degrees of isolation (degree of how isolated a transaction is):
High isolation: Discourages concurrency (it uses more locks) and availability, favors database consistency (valid state)
Low isolation: Encourages concurrency and availability, allows some inconsistent data.
The highest level of isolation provides serializable transactions.
Highest level of isolation entails executing transactions as if they were serially executed. To implement serializability, databases:
With MVCC, the database works like Git:
git checkout -b A
)git commit -a
):
git checkout master; git merge A
)git gc
)While serializable isolation works it may be slow. Consequently, historically, databases made compromises in how they implement isolation.
Isolation | DR | NRR | PR |
---|---|---|---|
Read uncommitted | X | X | X |
Read committed | X | X | |
Repeatable read | X | ||
Serializable |
Consistency without isolation:
Two concurrent transactions incrementing the value of x
.
The transactions are not serialized, the final value of x
is incorrect.
Isolation without consistency:
Transactions run in isolation, but T3 reads and operates on stale/incorrect data.
In a distributed database, a transaction spanning multiple nodes must either succeed on all nodes or fail (to maintain atomicity).
Transactions may also span multiple systems; for example, we may try to remove a record from a database and add it to a queue service in an atomic way.
In contrast to the distributed systems consensus problem, all nodes must agree on whether a transaction has been successfully committed.
The most common mechanism used to dead with distributed atomic commits is the two-phase commit (2PC) protocol.
In 2PC, we find the following roles:
The 2PC protocol makes the following assumptions:
D: What problems do you see with 2PC?
2PC is a blocking atomic commitment algorithm.
This work is (c) 2017, 2018, 2019, 2020, 2021 - onwards by TU Delft and Georgios Gousios and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.