Distributed filesystems

What is a filesystem?

File systems determine how data is stored and retrieved. A file system keeps track of the following data items:

  • Files, where the data we want to store are.
  • Directories, which group files together
  • Metadata, such as file length, permissions and file types

The primary job of the filesystem is to make sure that the data is always accessible and in tact. To maintain consistency, most modern filesystems use a log (remember databases!).

Common filesystems are EXT4, NTFS, APFS and ZFS

Distributed file systems

A distributed filesystem is a file system which is shared by being simultaneously mounted on multiple servers. Data in a distributed file system is partitioned and replicated across the network. Reads and writes can occur on any node.

Q: Windows and MacOSX already have network filesystems. Why do we need extra ones?

. . .

Network file systems, such as CIFS and NFS, are not distributed filesystems, because they rely on a centralized server to maintain consistency. The server becomes a SPOF and a performance bottleneck.

Big data revolution and large-scale file-system organization

Need for large, distributed, highly fault tolerant file system to store and process the queries.

Seminal papers:

  • The Google file system.” By Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (2003)
  • MapReduce: Simplified data processing on large clusters.” By Jeffrey Dean and Sanjay Ghemawat (2004)
  • The Hadoop distributed file system” by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (2010)

The Google filesystem

The Google Filesystem paper[1] kicked-off the Big Data revolution. Why did they need it though?

  • Hardware failures are common (commodity hardware)
  • Files are large (GB/TB) and their number is limited (millions, not billions)
  • Two main types of reads: large streaming reads and small random reads
  • Workloads with sequential writes that append data to files
  • Once written, the files are seldom modified, they are read often sequentially
  • High sustained bandwidth trumps low latency

Large-scale distributed file systems:

  • Google File System (GFS)
  • Hadoop Distributed File System (HDFS)
  • CloudStore
  • Amazon Simple Storage Service (Amazon S3)

GFS architecture

GFS Architecture
GFS Architecture

GFS storage model

  • A single file can contain many objects (e.g. Web documents)
  • Files are divided into fixed size chunks (64MB) with unique identifiers, generated at insertion time.
    • Disk seek time small compared to transfer time
    • A single file can be larger than a node’s disk space
    • Fixed size makes allocation computations easy
  • Files are replicated (\(\ge 3\)) across chunk servers.
  • The master maintains all file system metadata (e.g. mapping between file names and chunk locations)
  • Chunkservers store chunks on local disk as Linux files
    • Reading & writing of data specified by the tuple (chunk_handle, byte_range)
  • Neither the client nor the chunkserver cache file data (Why?)

GFS reads

  • Client sends file read request to GFS master
  • Master replies with chunk handle and locations
  • Client caches the metadata
  • Client sends a data request to one of the replicas
  • The corresponding chunk server returns the requested data

GFS writes

  • The client sends a request to the master server to allocate the primary replica chunkserver
  • The master sends to the client the location of the chunkserver replicas and the primary replica
  • The client sends the write data to all the replicas chunk server’s buffer
  • Once the replicas receive the data, the client tells the primary replica to begin the write function (Primary assigns serial number to write requests)
  • The primary replica writes the data to the appropriate chunk and then the same is done on the secondary replicas
  • The secondary replicas complete the write function and reports back to the primary
  • The primary sends the confirmation to the client
Writes on a GFS filesystem
Writes on a GFS filesystem

GFS operation

  • The master does not keep a persistent record of chunk locations, but instead queries the chunk servers at startup and then is updated by periodic polling.

  • GFS is a journaled filesystem: all operations are added to a log first, then applied. Periodically, log compaction creates checkpoints. The log is replicated across nodes.

  • If a node fails:

    • If it is a master, external instrumentation is required to start it somewhere else, by rerunning the operation log
    • If it is a chunkserver, it just restarts
  • Chunkservers use checksums to detect data corruption

  • The master maintains a chunk version number to distinguish between up-to-date and stale replicas (which missed some mutations while its chunkserver was down

    • Before an operation on a chunk, master ensures that version number is advanced

HDFS – The Hadoop FileSystem

HDFS started at Yahoo as an open source replica of the GFS paper, but since v2.0 it is different system.

GFS HDFS
Master NameNode
Chunkserver DataNode
operation log journal
chunk block
random file writes append-only
multiple writer/reader single writer, multiple readers
chunk: 64MB data, 32bit checksums 128MB data, separate metadata file

HDFS architecture

HDFS architecture diagram
HDFS architecture diagram

The main difference with GFS is that HDFS is a user-space filesystem written in Java.

HDFS access session

# List directory
$ hadoop fs ls /
Found 7 items
drwxr-xr-x   - gousiosg sg          0 2017-10-04 08:23 /audioscrobbler
-rw-r--r--   3 gousiosg sg 1215803135 2017-10-04 08:25 /ghtorrent-logs.txt
-rw-r--r--   3 gousiosg sg   66773425 2017-10-04 08:23 /imdb.csv
-rw-r--r--   3 gousiosg sg     198376 2017-10-23 12:39 /important-repos.csv
-rw-r--r--   3 gousiosg sg     611300 2017-10-04 08:24 /odyssey.mb.txt
-rw-r--r--   3 gousiosg sg  388422973 2017-10-03 15:40 /pullreqs.csv

# Create a new file
$ dd if=/dev/zero of=foobar bs=1M count=100

# Upload file
$ hadoop fs -put foobar /
-rw-r--r--   3 gousiosg sg  104857600 2017-11-27 15:42 /foobar

HDFS looks like a UNIX filesystem, but does not offer the full set of operations.

Content credits

Bibliography

[1]
S. Ghemawat, H. Gobioff, and S.-T. Leung, The google file system,” in Proceedings of the nineteenth ACM symposium on operating systems principles, 2003, pp. 29–43.
[2]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The hadoop distributed file system,” in IEEE 26th symposium on mass storage systems and technologies (MSST), 2010, pp. 1–10.