Distributed filesystems
What is a filesystem?
File systems determine how data is stored and retrieved. A file
system keeps track of the following data items:
- Files, where the data we want to store are.
- Directories, which group files together
- Metadata, such as file length, permissions and file
types
The primary job of the filesystem is to make sure that the data is
always accessible and in tact. To maintain consistency, most modern
filesystems use a log (remember databases!).
Common filesystems are EXT4, NTFS, APFS and ZFS
Distributed file systems
A distributed filesystem is a file system which is shared by being
simultaneously mounted on multiple servers. Data in a distributed file
system is partitioned and replicated across the
network. Reads and writes can occur on any node.
Q: Windows and MacOSX already have network
filesystems. Why do we need extra ones?
. . .
Network file systems, such as CIFS and NFS, are not distributed
filesystems, because they rely on a centralized server to maintain
consistency. The server becomes a SPOF and a performance bottleneck.
Big data revolution and large-scale file-system organization
Need for large, distributed, highly fault tolerant file system to
store and process the queries.
Seminal papers:
- “The Google file system.” By Sanjay Ghemawat, Howard
Gobioff, and Shun-Tak Leung (2003)
- “MapReduce: Simplified data processing on large clusters.”
By Jeffrey Dean and Sanjay Ghemawat (2004)
- “The Hadoop distributed file system” by Konstantin
Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (2010)
The Google filesystem
The Google Filesystem paper[1]
kicked-off the Big Data revolution. Why did they need it though?
- Hardware failures are common (commodity hardware)
- Files are large (GB/TB) and their number is limited (millions, not
billions)
- Two main types of reads: large streaming reads and small random
reads
- Workloads with sequential writes that append data to files
- Once written, the files are seldom modified, they are read often
sequentially
- High sustained bandwidth trumps low latency
Large-scale distributed file systems:
- Google File System (GFS)
- Hadoop Distributed File System (HDFS)
- CloudStore
- Amazon Simple Storage Service (Amazon S3)
GFS architecture
GFS Architecture
GFS storage model
- A single file can contain many objects (e.g. Web documents)
- Files are divided into fixed size chunks (64MB) with unique
identifiers, generated at insertion time.
- Disk seek time small compared to transfer time
- A single file can be larger than a node’s disk space
- Fixed size makes allocation computations easy
- Files are replicated (\(\ge 3\))
across chunk servers.
- The master maintains all file system metadata (e.g. mapping
between file names and chunk locations)
- Chunkservers store chunks on local disk as Linux files
- Reading & writing of data specified by the tuple
(chunk_handle, byte_range)
- Neither the client nor the chunkserver cache file data
(Why?)
GFS reads
- Client sends file read request to GFS master
- Master replies with chunk handle and locations
- Client caches the metadata
- Client sends a data request to one of the replicas
- The corresponding chunk server returns the requested data
GFS writes
- The client sends a request to the master server to allocate the
primary replica chunkserver
- The master sends to the client the location of the chunkserver
replicas and the primary replica
- The client sends the write data to all the replicas chunk server’s
buffer
- Once the replicas receive the data, the client tells the primary
replica to begin the write function (Primary assigns serial number to
write requests)
- The primary replica writes the data to the appropriate chunk and
then the same is done on the secondary replicas
- The secondary replicas complete the write function and reports back
to the primary
- The primary sends the confirmation to the client
Writes on a GFS filesystem
GFS operation
The master does not keep a persistent record of chunk locations,
but instead queries the chunk servers at startup and then is updated by
periodic polling.
GFS is a journaled filesystem: all operations are added
to a log first, then applied. Periodically, log compaction creates
checkpoints. The log is replicated across nodes.
If a node fails:
- If it is a master, external instrumentation is required to start it
somewhere else, by rerunning the operation log
- If it is a chunkserver, it just restarts
Chunkservers use checksums to detect data corruption
The master maintains a chunk version number to distinguish
between up-to-date and stale replicas (which missed some mutations while
its chunkserver was down
- Before an operation on a chunk, master ensures that version number
is advanced
HDFS – The Hadoop FileSystem
HDFS started at Yahoo as an open source replica of the GFS paper, but
since v2.0 it is different system.
Master |
NameNode |
Chunkserver |
DataNode |
operation log |
journal |
chunk |
block |
random file writes |
append-only |
multiple writer/reader |
single writer, multiple readers |
chunk: 64MB data, 32bit checksums |
128MB data, separate metadata file |
HDFS architecture
HDFS architecture diagram
The main difference with GFS is that HDFS is a user-space filesystem
written in Java.
HDFS access session
# List directory
$ hadoop fs ls /
Found 7 items
drwxr-xr-x - gousiosg sg 0 2017-10-04 08:23 /audioscrobbler
-rw-r--r-- 3 gousiosg sg 1215803135 2017-10-04 08:25 /ghtorrent-logs.txt
-rw-r--r-- 3 gousiosg sg 66773425 2017-10-04 08:23 /imdb.csv
-rw-r--r-- 3 gousiosg sg 198376 2017-10-23 12:39 /important-repos.csv
-rw-r--r-- 3 gousiosg sg 611300 2017-10-04 08:24 /odyssey.mb.txt
-rw-r--r-- 3 gousiosg sg 388422973 2017-10-03 15:40 /pullreqs.csv
# Create a new file
$ dd if=/dev/zero of=foobar bs=1M count=100
# Upload file
$ hadoop fs -put foobar /
-rw-r--r-- 3 gousiosg sg 104857600 2017-11-27 15:42 /foobar
HDFS looks like a UNIX filesystem, but does not offer the full set of
operations.
Bibliography
[1]
S.
Ghemawat, H. Gobioff, and S.-T. Leung,
“The google file
system,” in
Proceedings of the nineteenth ACM symposium
on operating systems principles, 2003, pp. 29–43.
[2]
K.
Shvachko, H. Kuang, S. Radia, and R. Chansler,
“The hadoop distributed
file system,” in
IEEE 26th symposium on mass storage
systems and technologies (MSST), 2010, pp. 1–10.