Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System, inspired by GFS from Google, is a distributed filesystem which runs on low cost commodity hardware in a fault tolerant manner to redundantly store Terabyte and larger data sets. The architecture of this filesystem is quite similar to other distributed file systems. Main objectives of HDFS include high violability, fault tolerance, and high speed I/O access to data. Another key goal of HDFS is high reliability of the system. Since hardware failure in a cluster composed of high number of nodes is the norm and not the exception. HDFS differs from GFS in many fundamental ways but has many similar goals, the most important of which is to provide reliable distributed access to large amounts of data.

Typically, this file system runs on the same nodes as the MapReduce framework to allow tasks to execute locally to the data. The data is stored in 64MB or larger blocks in a set of DataNodes which serve the data to the applications and there is typically one DataNode per node on the cluster. A master NameNode maintains all of the metadata associated with the data stored on the DataNodes. The NameNode is responsible for coordinating access to the data and coordinating replication of the data. The NameNode stores all of this information in RAM for fast access, but keeps an image on disk in case of failure. If the NameNode fails this image may be used to restart the NameNode. The DataNodes themselves store no information about the data they contain. The DataNodes simply store each block of data as a separate file on the local file system. When the NameNode splits and distributes data, it does not respect record boundaries in the data. If is often the case that the first and the last record in a spilt are truncated. In the case of truncated records, DataNodes will transfer the remaining portion of the truncated records to the appropriate machine during runtime.

HDFS

The HDFS architecture has been designed to write once and read many times. Once a file is closed, it is replicated across the cluster, and cannot be written to again. The system is optimized for high sustained throughout streaming reads. This means it is faster to read an entire 64MB block, filtering out information that isn’t needed, than it is to perform a low latency seek to specific locations in that block to read the same information.

The Apache team identifies three types of failures: NameNode failures, DataNode failures and network partitions. The NameNode is a single point of failure, and of the NameNode fails, it must be manually restarted using the most recent image. The image may be replicated on multiple machines in case of disk failure on the NameNode. In the case of a network partition or a DataNodes failure, data on the affected DataNodes are available and are re-replicated on the surviving DataNodes. A heartbeat message is repeated across the network during execution to track the availability of the DataNodes for this reason. Data in the HDFS can be accessed using the command line, a Java API or a C Language wrapper for that same API. From the users perspective, the data access is similar to that of a normal file system. The details of the data storage in the HDFS discussed in this section are hidden from the user. The Java API for accessing HDFS data is very similar to the APIs used for normal file access, allowing for all of the familiar operators including implementations of input and output streams which allow for buffered reads and writes using the standard Java API.

 Concepts and Design

HDFS is a distributed file system designed to store very large files reliably across clusternodes using commodity hardware. HDFS uses a master/slave architecture, which simplifies the design to a great extent. Below are some of the HDFS features to have a detailed understanding of its working.

File System

HDFS provides a distributed file system to the applications, having ability to store and process very large files. It is not uncommon for an Hadoop cluster to store files with terabytes in size. HDFS is similar to other file systems meaning that it offers the concept of directory and file structure. Morover, it consists of various features that provides operational optimization on extremely large data sets.

Commodity Hardware

Hadoop uses commodity hardware to scale-out. Which is a cost-effective way to add more storage space and computational power. However, with the increase in number of cluster nodes, the chance of getting node failure increases as well. Fortunately, HDFS is designed keeping these hardware failures under consideration and therefore, there is no need of bringing the cluster down for recovery. HDFS takes care of the consistency and availability of data in case of failure where a cluster node need to be replaced.

Block Size

The concept of blocks storage is available in many file systems. Usually, these file system blocks have size in kilobytes which is the smallest unit that can be loaded into memory in a single read operation. However since we need to deal with very large size files, the default block size is 128MB. A file is split into multiple blocks during the write operations and the blocks are distributed across the cluster nodes. Block is the data unit on which job executes. This plays an important role in our proposed blocks placement approach. By placing blocks in the Hadoop cluster strategically, we can guide the execution of the MapReduce layer.

Streaming Data Access

Although HDFS does provide both read and write operations, it is built around the concept of write once and read many times. Once the file has been written it cannot be modified but can only be appended. This is because the data usually copied into the Hadoop once and ued for analysis multiple times. Here read throughput is very important. Users generally want to analyze the complete data set. Thus, delivering a high throughput is more important than low data access latency.

Data Integrity

The data can get corrupted because of storage hardware faults, network faults or data degradation. To protect the business data against such types of failures, HDFS maintains a checksum for every block stored in the cluster. HDFS compares its checksum against the saved value for every block read request. If the values doesnot match, HDFS marks the block as invalid and fetches its replica from the other node. Eventually, the corrupted block gets replaced by a correct replica.

File Read & Write

Applications can add data on the Hadoop Distributed File System by creating file and writing data. Once the file is closed, the bytes already written to the file cannot be changed or deleted. Only new data can be added to the file by reopening the file for append. HDFS works on the principle of single writer, multiple reader.

The HDFS client that opens the file for writing is granted a fixed time lease for the file. During this duration, no other client can write to the file. The client to which the lease has been granted periodically renews the lease by sending a heartbeat signal to the NameNode. Once the file is closed, the lease is revoked.

The duration of the lease is bound by a soft limit and a hard limit. The client writing data to file is certain of exclusive access to the file until the soft lease expires. If the soft lease expires and the client is unable to close the file or renew the lease, another client can preempt the lease to write data to the file. If after the hard limit has expired and the writing client failed to renew the lease, HDFS assumes that the client has quit and will automatically close the file on behalf of the writer and revoke the lease.

The DataNodes form a pipeline in the cluster, whos order minimizes the total network distance of the client and the last DataNode. Bytes are pushed into the pipeline as a sequence of packets. The bytes that the client writes are first buffered. After the application buffer is filled, the data are pushed to the pipeline. The next packet can be pushed to the pipeline before receiving the acknowledgment for the previous packets.

HDFS only guarantees that data is visible to the reader, once the file being written is closed. hflush is the command used if the user application needs the visibility guarantee.

Advertisements

Written by Varun Kumar

Varun works with Microsoft as a Cloud Consultant. He comes with 10+ years of experience into Consultant, Solution Architect, and Delivery Management roles. As a Consultant in Microsoft, his job is to design, develop and deploy enterprise level solutions using Azure, to help organizations to achieve more.

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: