Introduction to Distributed Systems

What is a distributed system?

According to Wikipedia: A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.

Parallel and distributed systems

Parallel systems use shared memory
- Distributed parallel systems still use the notion of shared memory, but this is coordinated with special HW/SW that unifies memory accesses across multiple computers

Parallel system

Distributed systems use no shared components

Distributed system

Why distributed systems?

Scalability
- Moore’s law: The number of transistors on a single chip doubles about every two year.
- The advancement has slowed since around 2010.
- Distribution provides massive performance.
Distribution of tasks and collaboration
Reduced latency
Fault tolerance
Mobility

Distributed system characteristics

Computational entities each with own memory
- Need to synchronize distributed state
Entities communicate with message passing
Each entity maintains parts of the complete picture
Need to tolerate failure

Building distributed systems is hard

They fail often (and failure is difficult to spot!)
- Split-brain scenarios
Maintaining order/consistency is hard
Coordination is hard
Partial operation must be possible
Testing is hard
Profiling is hard: “it’s slow” might be due to 1000s of factors

Fallacies of distributed systems

By Peter Deutsch

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology does not change
Transport cost is zero
The network is homogeneous

Four Main Problems

We will focus on the following four issues with distributed systems

Partial failures: Some parts of the system may fail nondeterministically, while other parts work fine.
Unreliable networks: Distributed systems communicate over unreliable networks.
Unreliable time: Time is a universal coordination principle. However, we cannot use time determine order.
No single source of truth, different opinions: Distributed systems need to co-ordinate and agree upon a (version of) truth.

Partial Failures

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable” (Leslie Lamport)

Distributed systems must tolerate partial failures: Any one point in the system can fail, yet the system must continue to correctly function as a whole.

Partial failures occur frequently in large distributed systems and they may cause real-world outages.

Hard to detect whether something failed or not, as the time it takes for a message to travel across a network.

Unreliable Networks

Unreliable networks

Network failures

Q: Imagine a client server application. The client sends a message to the server, but receives no response. What might have gone wrong?

A: Has the service failed? Is the request or response message lost in the network?

The answer depends on the network assumptions.

Asynchronous vs synchronous systems

Two types of network systems:

Synchronous system: Process execution speeds or message delivery times are bounded. In a synchronous system, we can have:
- Timed failure detection
- Time based coordination
- Worst-case performance
Asynchronous system: No assumptions about process execution speeds or message delivery times are made.

Purely synchronous systems only exist in theory.

Most distributed systems use some form of asynchronous networking to communicate.

Failures in asynchronous systems

Upon waiting for a response to a requests in an asynchronous system, it is not possible to distinguish whether:

the request was lost
the remote node is down
the response was lost

The usual remedy is to set timeouts and retry the request until it succeeds.

Timeouts is a fundamental design choice in asynchronous networks: Ethernet, TCP and most application protocols work with timeouts.

The retransmission of the messages may help ensure the reliability of the communication links but introduce unpredictable delays. In this sense, practical systems are partially synchronous:

Partially-synchronous system: There exist upper bounds on the network delay but the programmer does not know them.

Network failures in practice

In a study of network failures by Microsoft Research [1], they found that:

5 devices per day fail
41 links per day fail
Load balancers fail with a probability >20% once per year
MMTR 5 mins
Redundancy helps but is not entirely effective
Most failures are due to misconfiguration

The data is for a professionally managed data center by a single company.

On the public cloud, failures may affect thousands of systems in parallel.

Unreliable Time

Soft Watch At The Moment Of First Explosion, by Salvador Dali

Time is essential

In a distributed system, time is the only global constant nodes can rely on to make distributed decisions on ordering problems.

When do we need to order?

Sequencing items in memory
Mutual exclusion of access to a shared resource
Encoding history (“happens before” relationships)
Transactions in a database
Consistency of distributed data
Debugging (finding the root cause of a bug)

Time in computer systems

Time in computers is kept in two ways:

“Real” time clocks (RTCs): Capture our intuition about time and are kept in sync with the NTP protocol with centralized servers. (e.g. System.getCurrentTimeMillis).
Monotonic clocks: Those clocks only move forward. (e.g. System.nanoTime)

Q: Which clock type would you use to benchmark a routine? Why?

The trouble with computer clocks

Monotonic clocks are maintained by the OS and rely on HW counters exposed by CPUs. They are (usually!) good for determining order within a node, but each node only has its own notion of time.

NTP can synchronize time across nodes with an accuracy of ms. A modern CPU can execute \(10^6\) instructions (\(\times\) number of cores) in an ms!

Moreover, leap seconds are introduced every now and then; minutes may last for 61 or 59 seconds on occasion

\(\mu s\) accuracy is possible with GPS clocks, but expensive

Logical time

Lamport introduced the notion of logical time in 1978 in his celebrated paper “Time, Clocks, and the Ordering of Events in a Distributed System”

Idea: Instead of using the precise clock time, capture the events relationship between a pair of events

Based on causality: If some event possibly causes another event, then the first event happened-before the other

Order

What is order? A way of arranging items in a set so that the following properties are maintained.

Strict partial order:

Irreflexivity: \(\forall a. \neg a < a\) (items not comparable with self)
Transitivity: if \(a \le b\) and \(b \le c\) then \(a \le c\)
Antisymmetry: if \(a \le b\) and \(b \le a\) \(a = b\)

Strict total order:

An additional property: \(\forall a, b, a \le b \vee b \le a \vee a = b\)

Order and time

FIFO is enough to maintain order with a single sender
Time at the receiver end is enough to maintain order at the receiver end
When multiple senders/receivers are involved, we need external ordering scheme
- Total order: If our message rate is globally bounded (e.g. 1 msg/sec/receiver), and less fine-grained than our clock accuracy (e.g. ms range), then synchronized RTCs are enough to guarantee order.
- Causal order: Otherwise, we need to rely on happens before (\(\rightarrow\)) relationships.

Happens-before relation

Events in the distributed system:

A process performs some local computation
A process sends a message
A process receives a message

Lamport introduced happens-before relation to capture dependencies between events:

If \(a\) and \(b\) are events in the same node, and \(a\) occurs before \(b\), then \(a \rightarrow b\).
If \(a\) is the event of sending a message and \(b\) is the event of receiving that message, then \(a \rightarrow b\).
The relation is transitive.

It is a strict partial order: it is irreflexive, antisymmetric and transitive.

Two events not related to happened-before are concurrent.

Lamport timestamps: How they work

Lamport introduced the eponymous logical timestamps in 1978[2]:

Each individual process \(p\) maintains a counter: \(LT(p)\).
When a process \(p\) performs an action, it increments \(LT(p)\).
When a process \(p\) sends a message, it includes \(LT(p)\) in the message.
When a process \(p\) receives a message from a process \(q\), that message includes the value of \(LT(q)\); \(p\) updates its \(LT(p)\) to the \(\max(LT(p), LT(q))+1\)

For two events \(a\) and \(b\), if \(a \rightarrow b\), then \(LT(a) < LT(b)\).

Q: The reverse is not true: If \(LT(a) < LT(b)\), then it does not mean that \(a \rightarrow b\)! Why?

Why is the LT invariant not symmetric?

Example scenario with 4 nodes that exchange events.

Initial state of timestamps: \([A(0), B(0), C(0), D(0)]\)

E1. \(A\) sends to \(C\): \([A(1), B(0), C(0), D(0)]\)

E2. \(C\) receives from \(A\): \([A(1), B(0), C(2), D(0)]\)

E3. \(C\) sends to \(A\): \([A(1), B(0), C(3), D(0)]\)

E4. \(A\) receives from \(C\): \([A(4), B(0), C(3), D(0)]\)

E5. \(B\) sends to \(D\): \([A(4), B(1), C(3), D(0)]\)

E6. \(D\) receives from \(B\): \([A(4), B(1), C(3), D(2)]\)

At this point, \(LT\)(E6) < \(LT\)(E4), but it does not mean that E6 \(\rightarrow\) E4!

Events 4 and 6 are independent.

Vector clocks

Vector clocks [3] can maintain causal order.

On a system with \(N\) nodes, each node \(i\) maintains a vector \(V_i\) of size \(N\).

\(V_i[i]\) is the number of events that occurred at node \(i\)
\(V_i[j]\) is the number of events that node \(i\) knows occurred at node \(j\)

They are updated as follows:

Local events increment \(V_i[i]\)
When \(i\) sends a message to \(j\), it includes \(V_i\)
When \(j\) receives \(V_i\), it updates all elements of \(V_j\) to \(V_j[a] = \max(V_i[a], V_j[a])\)

Vector clocks guarantees

For two events \(a\) and \(b\) and their vector clocks \(VC(a)\) and \(VC(b)\):

if \(a \rightarrow b\), then \(VC(a) < VC(b)\)
if \(VC(a) < VC(b)\), then \(a \rightarrow b\)
if \(VC(a) < VC(b)\) and VC(b) < VC(c) then \(a \rightarrow c\)
if \(VC(a) < VC(b)\), then \(RT(a) < RT(b)\), where RT is the clock time of events \(a\) and \(b\)

Vector clocks are expensive to maintain: they require \(O(n)\) timestamps to be exchanged with each communication.

However, it has been shown[4] that we cannot do better than vector clocks.

Vector clocks example

Initial state of vector clocks: \([A(0, 0, 0, 0), B(0, 0, 0, 0), C(0, 0, 0, 0), D(0, 0, 0, 0)]\)

E1. \(A\) sends to \(C\): \([A(1, 0, 0, 0), B(0, 0, 0, 0), C(0, 0, 0, 0), D(0, 0, 0, 0)]\)

E2. \(C\) receives from \(A\): \([A(1, 0, 0, 0), B(0, 0, 0, 0), C(1, 0, 1, 0), D(0, 0, 0, 0)]\)

E3. \(C\) sends to \(A\): \([A(1, 0, 0, 0), B(0, 0, 0, 0), C(1, 0, 2, 0), D(0, 0, 0, 0)]\)

E4. \(A\) receives from \(C\): \([A(2, 0, 2, 0), B(0, 0, 0, 0), C(1, 0, 2, 0), D(0, 0, 0, 0)]\)

E5. \(B\) sends to \(D\): \([A(2, 0, 2, 0), B(0, 1, 0, 0), C(1, 0, 2, 0), D(0, 0, 0, 0)]\)

E6. \(D\) receives from \(B\): \([A(2, 0, 2, 0), B(0, 1, 0, 0), C(1, 0, 2, 0), D(0, 1, 0, 1)]\)

Q. Are E6 and E4 causally related?

Distributed Decision Making

What is true in a distributed setting?

Nodes in distributed systems cannot know anything for sure
Individual nodes cannot rely on their own information
- Clocks can be unsynchronized
- Other nodes may be unresponsive when updating state
“Split-brain” scenarios: Parts of the system know a different version of the truth than the other part(s)

Reaching consensus is a fundamental problem in distributed systems.

Consensus for distributed decision making

Consensus: Providing agreement in the presence of faulty nodes.

Resource allocation
Committing a transaction
Synchronizing state machines
Leader election
Atomic broadcasts

The 2-generals problem

The two generals problem setting

2 armies camped in opposing hills (A1 and A2)
The are only able to communicate with messengers
They need to decide on a time to attack
Enemy (B) is camped between the two hills and can at any time intercept the messengers

Q How can the generals decide when to attack?

A It is impossible to make a reliable decision.

The 2 generals problem: Approximate solutions

Approximate solutions:
- Pre-agree on timeouts
- Send \(n\) labeled messages
- Receiver calculates received messages within time window, then decides how many messages to send for ack.

Consequences: we can only make distributed decisions using either reliable communication or more than 2 parties.

The Byzantine generals problem

Formulated by Lamport et al.[5], the Byzantine generals problem shaped distributed systems research for the next 40 years.

Byzantine Generals

Several divisions of the Byzantine army are camped outside an enemy city, each division commanded by its own general.
They must decide upon a common plan of action: Attack or Retreat.
The generals can communicate with each other only by messengers.
There might be traitors (malicious or arbitrary behavior).
All loyal generals must agree on a plan.

Byzantine generals solution: Fault tolerant consensus

With only three Generals no solution can work in the presence of a single traitor.

Fault tolerant consensus:

PBFT [6] - Byzantine fault tolerant consensus with at least 3f+1 nodes with f traitors
Paxos [7], Raft [8] - Crash fault tolerant consensus with at least 2f+1 nodes with f faulty nodes

Consensus protocols

A consensus protocol defines a set of rules for message exchange and processing for distributed components to reach agreement.

Basic properties of crash fault tolerant consensus protocols include:

Safety: Never returning an incorrect result, in the presence of non- Byzantine conditions.
Availability: Able to provide an answer if \(n/2 + 1\) servers are operational.
No clocks: They do not depend on RTCs to work
Immune to stranglers: If \(n/2 + 1\) servers vote, then the result is considered safe.

Consensus for state machine replication (SMR)

Consensus algorithms can be used in state-machine replication (SMR) systems to keep the log module of replicated state machines in sync.

Consensus protocol provides all replicas to agree on the same order of commands/entries.
Paxos and Raft are crash fault tolerant consensus protocols for state machine replication.

Paxos consensus algorithm: Roles

Three roles:
* Proposer: Chooses a value (or receives from a client) and sends it to a set of acceptors to collect votes.
* Acceptor: Vote to accept or reject the values proposed by the proposer. For fault tolerance, the algorithm requires only a majority of acceptor votes.
* Learner: They adopt the value when a large enough number of acceptors have accepted it.

Every proposal consists of a value, proposed by the client, and a unique monotonically increasing proposal number, aka ballot number.

Acceptance of the proposals by a majority of processes/servers provide fault tolerance.

Paxos consensus algorithm: Steps

Phase 1: Voting

(Prepare) A proposer selects a proposal number n and sends a “prepare” request Prepare(n) to acceptors.
(Promise) If n is higher than every previous proposal number received, then the Acceptor returns “Promise,” to the Proposer, to ignore all future proposals having a number less than n. If the Acceptor accepted a proposal at some point in the past, it must include the previous proposal number, say m, and the corresponding accepted value, say w, in its response to the Proposer.

Phase 2: Replication

(Accept) If the proposer receives a response from a majority of acceptors, then it sends an accept request for a proposal numbered n with the highest-numbered proposal among the responses.
(Accepted) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater or equal than the proposal number.

Raft consensus algorithm - Cluster states

Separates leader election and log replication states:

Leader election
- Select one server to act as leader
- Detect crashes, choose new leader
Log replication (normal operation)
- Leader accepts commands from clients, appends to its log
- Leader replicates its log to other servers (overwrites inconsistencies)

Raft ensures that: logs are always consistent and that only servers with up-to-date logs can become leader

Terms are used to identify obsolete information

Raft leader election

Raft defines the following server states:

Candidate: Candidate for being a leader, asking for votes
Leader: Accepts log entries from clients, replicates them on other servers
Follower: Replicate the leader’s state machine

Each server maintains current term value (no global view):

Exchanged in every RPC
Peer has later term? Update term, revert to follower

Raft selects at most one leader at each term. For some terms, the election may fail and result in no leader for that term.

Raft log replication

The leader accepts log entries from clients and replicates them across the servers:

Each log entry also has an integer index identifying its position in the log.
The leader sends AppendEntries message to append an entry to the log
A log entry is committed once the leader that created the entry has replicated it on a majority of the servers.

See more details in this visualization

Byzantine fault tolerant consensus

Raft and Paxos tolerate crash faults but not Byzantine faults: They assume that the exchanged messages are valid and true (i.e. non-Byzantine).

In open distributed systems (e.g. BitCoin) this assumption is not necessarily valid. Most open distributed systems use public-key cryptography and node registration before they start and sign messages to avoid MITM attacks.

Still, decisions require majority votes, so the \(n/2 + 1\) rule applies.

Example Byzantine fault tolerant protocols: Practical Byzantine Fault Tolerance (PBFT), BitCoin

The FLP impossibility

Fischer, Linch and Patterson (FLP) theorem[9]: In an asynchronous network, consensus cannot be reached if at least one node fails in asynchronous networks

A foundational result, proving the impossibility of distributed consensus. The system model the authors assume is fairly restrictive:

Asynchronous communication
No clocks or timeouts
No random number generators

In practice however, we can mitigate the consequences, as we are indeed allowed to use both clocks and/or random numbers.

General advice

Avoid implementing a distributed system
Avoid relying on a distributed system
If the above fail, only use available solutions

Content credits

Distributed systems, by Miym
Timing assumptions in distributed systems, partial synchrony: “Introduction to Reliable and Secure Distributed Programming,” Christian Cachin, Rachid Guerraoui and Luís Rodrigues, 2011.
Logical clocks from Concurrency: The Works of Leslie Lamport
Lamport clock example
Types of failures from Cristian 1991 and Hadzilacos and Toueg 1993.
2 generals problem picture, by Jens Erat
Images in Raft explanation, by Diego Ongaro
Linearizability examples are from Kleppmann[10]

Bibliography

[1]

P. Gill, N. Jain, and N. Nagappan, “Understanding network failures in data centers: Measurement, analysis, and implications,” SIGCOMM Comput. Commun. Rev., vol. 41, no. 4, pp. 350–361, Aug. 2011.

[2]

L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Communications of the ACM, vol. 21, no. 7, pp. 558–565, 1978.

[3]

F. Mattern, “Virtual time and global states of distributed systems,” in PARALLEL AND DISTRIBUTED ALGORITHMS, 1988, pp. 215–226.

[4]

B. Charron-Bost, “Concerning the size of logical clocks in distributed systems,” Information Processing Letters, vol. 39, no. 1, pp. 11–16, 1991.

[5]

L. Lamport, R. Shostak, and M. Pease, “The byzantine generals problem,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 4, no. 3, pp. 382–401, 1982.

[6]

M. Castro, B. Liskov, and others, “Practical byzantine fault tolerance,” in OSDI, 1999, vol. 99, pp. 173–186.

[7]

L. Lamport, “The part-time parliament,” ACM Trans. Comput. Syst., vol. 16, no. 2, pp. 133–169, May 1998.

[8]

D. Ongaro and J. K. Ousterhout, “In search of an understandable consensus algorithm.” in USENIX annual technical conference, 2014, pp. 305–319.

[9]

M. J. Fischer, N. A. Lynch, and M. S. Paterson, “Impossibility of distributed consensus with one faulty process,” J. ACM, vol. 32, no. 2, pp. 374–382, Apr. 1985.

[10]

M. Kleppmann, Designing data-intensive applications. O’Reilly Media, Inc., 2017.

[11]

K. Kingsbury, “Jespen: On the perils of network partitions,” 2013. [Online]. Available: https://aphyr.com/tags/jepsen.

[12]

M. Castro and B. Liskov, “Practical byzantine fault tolerance and proactive recovery,” ACM Trans. Comput. Syst., vol. 20, no. 4, pp. 398–461, Nov. 2002.

[13]

P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica, “Highly available transactions: Virtues and limitations,” Proc. VLDB Endow., vol. 7, no. 3, pp. 181–192, Nov. 2013.

Introduction to Distributed Systems

Georgios Gousios and Burcu Kulahcioglu Ozkan

16 September 2021

Introduction to Distributed Systems

Parallel and distributed systems

Why distributed systems?

Distributed system characteristics

Building distributed systems is hard

Fallacies of distributed systems

Four Main Problems

Partial Failures

Unreliable Networks

Network failures

Asynchronous vs synchronous systems

Failures in asynchronous systems

Network failures in practice

Unreliable Time

Time is essential

Time in computer systems

The trouble with computer clocks

Logical time

Order

Order and time

Happens-before relation

Lamport timestamps: How they work

Why is the LT invariant not symmetric?

Vector clocks

Vector clocks guarantees

Vector clocks example

Distributed Decision Making

What is true in a distributed setting?

Consensus for distributed decision making

The 2-generals problem

The 2 generals problem: Approximate solutions

The Byzantine generals problem

Byzantine generals solution: Fault tolerant consensus

Consensus protocols

Consensus for state machine replication (SMR)

Paxos consensus algorithm: Roles

Paxos consensus algorithm: Steps

Raft consensus algorithm - Cluster states

Raft leader election

Raft log replication

Byzantine fault tolerant consensus

The FLP impossibility

General advice

Content credits

Bibliography

Copyright