The Hadoop Distributed File System, inspired by GFS from Google, is a distributed ﬁlesystem which runs on low cost commodity hardware in a fault tolerant manner to redundantly store Terabyte and larger data sets. The architecture of this ﬁlesystem is quite similar to other distributed ﬁle systems. Main objectives of HDFS include high violability, fault tolerance, and high speed I/O access to data. Another key goal of HDFS is high reliability of the system. Since hardware failure in a cluster composed of high number of nodes is the norm and not the exception. HDFS differs from GFS in many fundamental ways but has many similar goals, the most important of which is to provide reliable distributed access to large amounts of data.
Typically, this ﬁle system runs on the same nodes as the MapReduce framework to allow tasks to execute locally to the data. The data is stored in 64MB or larger blocks in a set of DataNodes which serve the data to the applications and there is typically one DataNode per node on the cluster. A master NameNode maintains all of the metadata associated with the data stored on the DataNodes. The NameNode is responsible for coordinating access to the data and coordinating replication of the data. The NameNode stores all of this information in RAM for fast access, but keeps an image on disk in case of failure. If the NameNode fails this image may be used to restart the NameNode. The DataNodes themselves store no information about the data they contain. The DataNodes simply store each block of data as a separate ﬁle on the local ﬁle system. When the NameNode splits and distributes data, it does not respect record boundaries in the data. If is often the case that the ﬁrst and the last record in a spilt are truncated. In the case of truncated records, DataNodes will transfer the remaining portion of the truncated records to the appropriate machine during runtime.
The HDFS architecture has been designed to write once and read many times. Once a ﬁle is closed, it is replicated across the cluster, and cannot be written to again. The system is optimized for high sustained throughout streaming reads. This means it is faster to read an entire 64MB block, ﬁltering out information that isn’t needed, than it is to perform a low latency seek to speciﬁc locations in that block to read the same information.
The Apache team identiﬁes three types of failures: NameNode failures, DataNode failures and network partitions. The NameNode is a single point of failure, and of the NameNode fails, it must be manually restarted using the most recent image. The image may be replicated on multiple machines in case of disk failure on the NameNode. In the case of a network partition or a DataNodes failure, data on the affected DataNodes are available and are re-replicated on the surviving DataNodes. A heartbeat message is repeated across the network during execution to track the availability of the DataNodes for this reason. Data in the HDFS can be accessed using the command line, a Java API or a C Language wrapper for that same API. From the users perspective, the data access is similar to that of a normal ﬁle system. The details of the data storage in the HDFS discussed in this section are hidden from the user. The Java API for accessing HDFS data is very similar to the APIs used for normal ﬁle access, allowing for all of the familiar operators including implementations of input and output streams which allow for buffered reads and writes using the standard Java API.
Concepts and Design
HDFS is a distributed ﬁle system designed to store very large ﬁles reliably across clusternodes using commodity hardware. HDFS uses a master/slave architecture, which simpliﬁes the design to a great extent. Below are some of the HDFS features to have a detailed understanding of its working.
HDFS provides a distributed ﬁle system to the applications, having ability to store and process very large ﬁles. It is not uncommon for an Hadoop cluster to store ﬁles with terabytes in size. HDFS is similar to other ﬁle systems meaning that it oﬀers the concept of directory and ﬁle structure. Morover, it consists of various features that provides operational optimization on extremely large data sets.
Hadoop uses commodity hardware to scale-out. Which is a cost-eﬀective way to add more storage space and computational power. However, with the increase in number of cluster nodes, the chance of getting node failure increases as well. Fortunately, HDFS is designed keeping these hardware failures under consideration and therefore, there is no need of bringing the cluster down for recovery. HDFS takes care of the consistency and availability of data in case of failure where a cluster node need to be replaced.
The concept of blocks storage is available in many ﬁle systems. Usually, these ﬁle system blocks have size in kilobytes which is the smallest unit that can be loaded into memory in a single read operation. However since we need to deal with very large size ﬁles, the default block size is 128MB. A ﬁle is split into multiple blocks during the write operations and the blocks are distributed across the cluster nodes. Block is the data unit on which job executes. This plays an important role in our proposed blocks placement approach. By placing blocks in the Hadoop cluster strategically, we can guide the execution of the MapReduce layer.
Streaming Data Access
Although HDFS does provide both read and write operations, it is built around the concept of write once and read many times. Once the ﬁle has been written it cannot be modiﬁed but can only be appended. This is because the data usually copied into the Hadoop once and ued for analysis multiple times. Here read throughput is very important. Users generally want to analyze the complete data set. Thus, delivering a high throughput is more important than low data access latency.
The data can get corrupted because of storage hardware faults, network faults or data degradation. To protect the business data against such types of failures, HDFS maintains a checksum for every block stored in the cluster. HDFS compares its checksum against the saved value for every block read request. If the values doesnot match, HDFS marks the block as invalid and fetches its replica from the other node. Eventually, the corrupted block gets replaced by a correct replica.
File Read & Write
Applications can add data on the Hadoop Distributed File System by creating ﬁle and writing data. Once the ﬁle is closed, the bytes already written to the ﬁle cannot be changed or deleted. Only new data can be added to the ﬁle by reopening the ﬁle for append. HDFS works on the principle of single writer, multiple reader.
The HDFS client that opens the ﬁle for writing is granted a ﬁxed time lease for the ﬁle. During this duration, no other client can write to the ﬁle. The client to which the lease has been granted periodically renews the lease by sending a heartbeat signal to the NameNode. Once the ﬁle is closed, the lease is revoked.
The duration of the lease is bound by a soft limit and a hard limit. The client writing data to ﬁle is certain of exclusive access to the ﬁle until the soft lease expires. If the soft lease expires and the client is unable to close the ﬁle or renew the lease, another client can preempt the lease to write data to the ﬁle. If after the hard limit has expired and the writing client failed to renew the lease, HDFS assumes that the client has quit and will automatically close the ﬁle on behalf of the writer and revoke the lease.
The DataNodes form a pipeline in the cluster, whos order minimizes the total network distance of the client and the last DataNode. Bytes are pushed into the pipeline as a sequence of packets. The bytes that the client writes are ﬁrst buffered. After the application buffer is ﬁlled, the data are pushed to the pipeline. The next packet can be pushed to the pipeline before receiving the acknowledgment for the previous packets.
HDFS only guarantees that data is visible to the reader, once the ﬁle being written is closed. hﬂush is the command used if the user application needs the visibility guarantee.