Distributed filesystems

What is a filesystem?

File systems determine how data is stored and retrieved. A file system keeps track of the following data items:

Files, where the data we want to store are.
Directories, which group files together
Metadata, such as file length, permissions and file types

The primary job of the filesystem is to make sure that the data is always accessible and in tact. To maintain consistency, most modern filesystems use a log (remember databases!).

Common filesystems are EXT4, NTFS, APFS and ZFS

Distributed file systems

A distributed filesystem is a file system which is shared by being simultaneously mounted on multiple servers. Data in a distributed file system is partitioned and replicated across the network. Reads and writes can occur on any node.

Q: Windows and MacOSX already have network filesystems. Why do we need extra ones?

Network file systems, such as CIFS and NFS, are not distributed filesystems, because they rely on a centralized server to maintain consistency. The server becomes a SPOF and a performance bottleneck.

Big data revolution and large-scale file-system organization

Need for large, distributed, highly fault tolerant file system to store and process the queries.

Seminal papers:

“The Google file system.” By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (2003)
“MapReduce: Simplified data processing on large clusters.” By Jeffrey Dean and Sanjay Ghemawat (2004)
“The Hadoop distributed file system” by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (2010)

The Google filesystem

The Google Filesystem paper[1] kicked-off the Big Data revolution. Why did they need it though?

Hardware failures are common (commodity hardware)
Files are large (GB/TB) and their number is limited (millions, not billions)
Two main types of reads: large streaming reads and small random reads
Workloads with sequential writes that append data to files
Once written, the files are seldom modified, they are read often sequentially
High sustained bandwidth trumps low latency

Large-scale distributed file systems:

Google File System (GFS)
Hadoop Distributed File System (HDFS)
CloudStore
Amazon Simple Storage Service (Amazon S3)

GFS architecture

GFS Architecture

GFS storage model

A single file can contain many objects (e.g. Web documents)
Files are divided into fixed size chunks (64MB) with unique identifiers, generated at insertion time.
- Disk seek time small compared to transfer time
- A single file can be larger than a node’s disk space
- Fixed size makes allocation computations easy
Files are replicated (\(\ge 3\)) across chunk servers.
The master maintains all file system metadata (e.g. mapping between file names and chunk locations)
Chunkservers store chunks on local disk as Linux files
- Reading & writing of data specified by the tuple (chunk_handle, byte_range)
Neither the client nor the chunkserver cache file data (Why?)

GFS reads

Client sends file read request to GFS master
Master replies with chunk handle and locations
Client caches the metadata
Client sends a data request to one of the replicas
The corresponding chunk server returns the requested data

GFS writes

The client sends a request to the master server to allocate the primary replica chunkserver
The master sends to the client the location of the chunkserver replicas and the primary replica
The client sends the write data to all the replicas chunk server’s buffer
Once the replicas receive the data, the client tells the primary replica to begin the write function (Primary assigns serial number to write requests)
The primary replica writes the data to the appropriate chunk and then the same is done on the secondary replicas
The secondary replicas complete the write function and reports back to the primary
The primary sends the confirmation to the client

Writes on a GFS filesystem

GFS operation

The master does not keep a persistent record of chunk locations, but instead queries the chunk servers at startup and then is updated by periodic polling.
GFS is a journaled filesystem: all operations are added to a log first, then applied. Periodically, log compaction creates checkpoints. The log is replicated across nodes.
If a node fails:
- If it is a master, external instrumentation is required to start it somewhere else, by rerunning the operation log
- If it is a chunkserver, it just restarts
Chunkservers use checksums to detect data corruption
The master maintains a chunk version number to distinguish between up-to-date and stale replicas (which missed some mutations while its chunkserver was down
- Before an operation on a chunk, master ensures that version number is advanced

GFS consistency model

GFS has a relaxed consistency model that supports highly distributed applications.

File namespace mutations (e.g. file creation) are atomic, handled in a global order at the master.
The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations.
- Write: Changes to data are ordered as chosen by a primary
- Record append: Completes atomically at least once

File region state after mutation

Consistent: if all clients will always see the same data, regardless of which replicas they read from.
Defined: if the region is consistent and clients will see what the mutation writes in its entirety.

HDFS – The Hadoop FileSystem

HDFS started at Yahoo as an open source replica of the GFS paper, but since v2.0 it is different system.

GFS	HDFS
Master	NameNode
Chunkserver	DataNode
operation log	journal
chunk	block
random file writes	append-only
multiple writer/reader	single writer, multiple readers
chunk: 64MB data, 32bit checksums	128MB data, separate metadata file

HDFS architecture

HDFS architecture diagram

The main difference with GFS is that HDFS is a user-space filesystem written in Java.

HDFS access session

# List directory
$ hadoop fs ls /
Found 7 items
drwxr-xr-x   - gousiosg sg          0 2017-10-04 08:23 /audioscrobbler
-rw-r--r--   3 gousiosg sg 1215803135 2017-10-04 08:25 /ghtorrent-logs.txt
-rw-r--r--   3 gousiosg sg   66773425 2017-10-04 08:23 /imdb.csv
-rw-r--r--   3 gousiosg sg     198376 2017-10-23 12:39 /important-repos.csv
-rw-r--r--   3 gousiosg sg     611300 2017-10-04 08:24 /odyssey.mb.txt
-rw-r--r--   3 gousiosg sg  388422973 2017-10-03 15:40 /pullreqs.csv

# Create a new file
$ dd if=/dev/zero of=foobar bs=1M count=100

# Upload file
$ hadoop fs -put foobar /
-rw-r--r--   3 gousiosg sg  104857600 2017-11-27 15:42 /foobar

HDFS looks like a UNIX filesystem, but does not offer the full set of operations.

Content credits

FreeBSD read syscall, by Diomidis Spinellis
The block I/O hierarchy of the Linux kernel
The detailed Linux block layer diagram, by Thomas Krenn
GFS and HDFS content, by Claudia Hauff
GFS write operation “Analyzing Google File System and Hadoop Distributed File System” by Nader Gemayel.
Research Journal of Information Technology, 8: 66- 74, 2016.
GFS block diagrams, by Ghemawat et al [1]
HDFS block diagram, by

Bibliography

[1]

S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in Proceedings of the nineteenth ACM symposium on operating systems principles, 2003, pp. 29–43.

[2]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in IEEE 26th symposium on mass storage systems and technologies (MSST), 2010, pp. 1–10.

Distributed filesystems

Georgios Gousios and Burcu Kulahcioglu Ozkan

01 October 2022

Distributed filesystems

What is a filesystem?

Distributed file systems

Big data revolution and large-scale file-system organization

The Google filesystem

Large-scale distributed file systems:

GFS architecture

GFS storage model

GFS reads

GFS writes

GFS operation

GFS consistency model

HDFS – The Hadoop FileSystem

HDFS architecture

HDFS access session

Content credits

Bibliography

Copyright