What is a distributed system?
According to Wikipedia: A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.
We will focus on the following four issues with distributed systems
“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable” (Leslie Lamport)
Distributed systems must tolerate partial failures: Any one point in the system can fail, yet the system must continue to correctly function as a whole.
Partial failures occur frequently in large distributed systems and they may cause real-world outages.
Hard to detect whether something failed or not, as the time it takes for a message to travel across a network.
Q: Imagine a client server application. The client sends a message to the server, but receives no response. What might have gone wrong?
A: Has the service failed? Is the request or response message lost in the network?
The answer depends on the network assumptions.
Two types of network systems:
Purely synchronous systems only exist in theory.
Most distributed systems use some form of asynchronous networking to communicate.
Upon waiting for a response to a requests in an asynchronous system, it is not possible to distinguish whether:
The usual remedy is to set timeouts and retry the request until it succeeds.
Timeouts is a fundamental design choice in asynchronous networks: Ethernet, TCP and most application protocols work with timeouts.
The retransmission of the messages may help ensure the reliability of the communication links but introduce unpredictable delays. In this sense, practical systems are partially synchronous:
In a study of network failures by Microsoft Research [1], they found that:
The data is for a professionally managed data center by a single company.
On the public cloud, failures may affect thousands of systems in parallel.
In a distributed system, time is the only global constant nodes can rely on to make distributed decisions on ordering problems.
When do we need to order?
Time in computers is kept in two ways:
System.getCurrentTimeMillis
).System.nanoTime
)Q: Which clock type would you use to benchmark a routine? Why?
Monotonic clocks are maintained by the OS and rely on HW counters exposed by CPUs. They are (usually!) good for determining order within a node, but each node only has its own notion of time.
NTP can synchronize time across nodes with an accuracy of ms. A modern CPU can execute \(10^6\) instructions (\(\times\) number of cores) in an ms!
Moreover, leap seconds are introduced every now and then; minutes may last for 61 or 59 seconds on occasion
\(\mu s\) accuracy is possible with GPS clocks, but expensive
Lamport introduced the notion of logical time in 1978 in his celebrated paper “Time, Clocks, and the Ordering of Events in a Distributed System”
Idea: Instead of using the precise clock time, capture the events relationship between a pair of events
Based on causality: If some event possibly causes another event, then the first event happened-before the other
What is order? A way of arranging items in a set so that the following properties are maintained.
Strict partial order:
Strict total order:
Events in the distributed system:
Lamport introduced happens-before relation to capture dependencies between events:
It is a strict partial order: it is irreflexive, antisymmetric and transitive.
Two events not related to happened-before are concurrent.
Lamport introduced the eponymous logical timestamps in 1978[2]:
For two events \(a\) and \(b\), if \(a \rightarrow b\), then \(LT(a) < LT(b)\).
Q: The reverse is not true: If \(LT(a) < LT(b)\), then it does not mean that \(a \rightarrow b\)! Why?
Example scenario with 4 nodes that exchange events.
Initial state of timestamps: \([A(0), B(0), C(0), D(0)]\)
E1. \(A\) sends to \(C\): \([A(1), B(0), C(0), D(0)]\)
E2. \(C\) receives from \(A\): \([A(1), B(0), C(2), D(0)]\)
E3. \(C\) sends to \(A\): \([A(1), B(0), C(3), D(0)]\)
E4. \(A\) receives from \(C\): \([A(4), B(0), C(3), D(0)]\)
E5. \(B\) sends to \(D\): \([A(4), B(1), C(3), D(0)]\)
E6. \(D\) receives from \(B\): \([A(4), B(1), C(3), D(2)]\)
At this point, \(LT\)(E6) < \(LT\)(E4), but it does not mean that E6 \(\rightarrow\) E4!
Events 4 and 6 are independent.
Vector clocks [3] can maintain causal order.
On a system with \(N\) nodes, each node \(i\) maintains a vector \(V_i\) of size \(N\).
They are updated as follows:
For two events \(a\) and \(b\) and their vector clocks \(VC(a)\) and \(VC(b)\):
Vector clocks are expensive to maintain: they require \(O(n)\) timestamps to be exchanged with each communication.
However, it has been shown[4] that we cannot do better than vector clocks.
Initial state of vector clocks: \([A(0, 0, 0, 0), B(0, 0, 0, 0), C(0, 0, 0, 0), D(0, 0, 0, 0)]\)
E1. \(A\) sends to \(C\): \([A(1, 0, 0, 0), B(0, 0, 0, 0), C(0, 0, 0, 0), D(0, 0, 0, 0)]\)
E2. \(C\) receives from \(A\): \([A(1, 0, 0, 0), B(0, 0, 0, 0), C(1, 0, 1, 0), D(0, 0, 0, 0)]\)
E3. \(C\) sends to \(A\): \([A(1, 0, 0, 0), B(0, 0, 0, 0), C(1, 0, 2, 0), D(0, 0, 0, 0)]\)
E4. \(A\) receives from \(C\): \([A(2, 0, 2, 0), B(0, 0, 0, 0), C(1, 0, 2, 0), D(0, 0, 0, 0)]\)
E5. \(B\) sends to \(D\): \([A(2, 0, 2, 0), B(0, 1, 0, 0), C(1, 0, 2, 0), D(0, 0, 0, 0)]\)
E6. \(D\) receives from \(B\): \([A(2, 0, 2, 0), B(0, 1, 0, 0), C(1, 0, 2, 0), D(0, 1, 0, 1)]\)
Q. Are E6 and E4 causally related?
Nodes in distributed systems cannot know anything for sure
Individual nodes cannot rely on their own information
“Split-brain” scenarios: Parts of the system know a different version of the truth than the other part(s)
Reaching consensus is a fundamental problem in distributed systems.
Consensus: Providing agreement in the presence of faulty nodes.
Q How can the generals decide when to attack?
A It is impossible to make a reliable decision.
Consequences: we can only make distributed decisions using either reliable communication or more than 2 parties.
Formulated by Lamport et al.[5], the Byzantine generals problem shaped distributed systems research for the next 40 years.
With only three Generals no solution can work in the presence of a single traitor.
Fault tolerant consensus:
A consensus protocol defines a set of rules for message exchange and processing for distributed components to reach agreement.
Basic properties of crash fault tolerant consensus protocols include:
Consensus algorithms can be used in state-machine replication (SMR) systems to keep the log module of replicated state machines in sync.
Consensus protocol provides all replicas to agree on the same order of commands/entries.
Paxos and Raft are crash fault tolerant consensus protocols for state machine replication.
Three roles:
* Proposer: Chooses a value (or receives from a client) and sends it to a set of acceptors to collect votes.
* Acceptor: Vote to accept or reject the values proposed by the proposer. For fault tolerance, the algorithm requires only a majority of acceptor votes.
* Learner: They adopt the value when a large enough number of acceptors have accepted it.
Every proposal consists of a value, proposed by the client, and a unique monotonically increasing proposal number, aka ballot number.
Acceptance of the proposals by a majority of processes/servers provide fault tolerance.
Phase 1: Voting
Phase 2: Replication
Separates leader election and log replication states:
Raft ensures that: logs are always consistent and that only servers with up-to-date logs can become leader
Terms are used to identify obsolete information
Raft defines the following server states:
Each server maintains current term value (no global view):
Raft selects at most one leader at each term. For some terms, the election may fail and result in no leader for that term.
The leader accepts log entries from clients and replicates them across the servers:
AppendEntries
message to append an entry to the logSee more details in this visualization
Raft and Paxos tolerate crash faults but not Byzantine faults: They assume that the exchanged messages are valid and true (i.e. non-Byzantine).
In open distributed systems (e.g. BitCoin) this assumption is not necessarily valid. Most open distributed systems use public-key cryptography and node registration before they start and sign messages to avoid MITM attacks.
Still, decisions require majority votes, so the \(n/2 + 1\) rule applies.
Example Byzantine fault tolerant protocols: Practical Byzantine Fault Tolerance (PBFT), BitCoin
Fischer, Linch and Patterson (FLP) theorem[9]: In an asynchronous network, consensus cannot be reached if at least one node fails in asynchronous networks
A foundational result, proving the impossibility of distributed consensus. The system model the authors assume is fairly restrictive:
In practice however, we can mitigate the consequences, as we are indeed allowed to use both clocks and/or random numbers.
This work is (c) 2017, 2018, 2019, 2020, 2021 - onwards by TU Delft and Georgios Gousios and licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.