Apache Hadoop

Apache Hadoop was initially developed by “Doug Cutting” in 2005 because he needed a faster data processing framework for the web crawler project called Nutch. Based on the MapReduce paper which was published by Google in 2004, he replaced the existing data processing system with Hadoop. Yahoo! decided to invest in the Hadoop development  project and agreed to make it an open source platform. Today Hadoop is expanding under the umbrella of the Apache Software Foundation and is used by many giants across the globe.

The Apache Hadoop MapReduce project is a distributed computing framework that enables developers to write applications using simple programming models, which run reliably on a large number of unreliable machines with the goal of processing Terabyte and larger data sets in parallel clusters consisting of thousands of nodes. Instead of using large scale server hardware it uses commodity hardware to process and store data. Hadoop is an open source implementation of the MapReduce framework inspired by Google MapReduce, and the Google File System (GFS), although the two systems are very different. A major part of Hadoop is written in Java language. However, for performance reasons, critical code has been written in C. Hadoop can be run on various Linux versions and since version 2.2.0 also supported on Windows.

In Hadoop, a MapReduce job is a unit work that user wants to be performed consisting of the input data, the MapReduce program and the configuration details. Hadoop performs the job by dividing the job into tasks, which can be classified as map tasks or reduce tasks. There are two types of nodes that control the job execution process: a job tracker and a number of task tracker. Task trackers run tasks and send progress reports to the JobTracker, which maintains the overall progress of each job.

Hadoop divides the input to a MapReduce job into fixed-size pieces called splits or chunks. Hadoop creates one map task for each split, on which the user defined map function is run for each record in the split.The advantage of having many splits is that the time taken to process each split is much smaller compared to the time to process the whole input. So if the splits are processed in parallel, the processing is better load-balanced if the splits are small, since a faster machine will be able to process proportionally more splits over the course of the job than a slower machine. Even if the machines are identical, failed processes or other jobs running concurrently make load balancing desirable, and the quality of the load balancing increases as the splits become more fine-grained. On the other hand, if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default, although this can be changed for the cluster (for all newly created files), or specified when each file is created. Sometimes, all three (default number of replicas of a split) nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks so the job scheduler has to look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer.

Instead of HDFS , map tasks write their output to local disk because map output is intermediate output. It is further processed by the reduce tasks to generate the final output, and once the job is completed the output of map tasks can be discarded. Thus storing it in HDFS , with replication, would be an unused overhead. If due to some reasons the node that is running the map task got fail before the output is been consumed by the reduce task, then Hadoop automatically reruns the map task on some other available node in the cluster to generate the map output.

Reduce tasks don’t have advantage of data locality. The input to single reduce task is normally the output from multiple mappers. The output of the reduce is normally stored in HDFS for reliability. For each HDFS block of the reduce output, the first replica is stored on the local node, with other replicas being stored on nodes on the other racks. Thus, writing the reduce output does consume network bandwidth, but only as much as a normal HDFS write pipeline consumes.

The number of reduce tasks is not governed by the size of the input, but is specified independently. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner works very well.

As mentioned earlier, Hadoop cluster is consists of two types of nodes: NameNodes and DataNodes. Usually there is one NameNode and several DataNodes spanning across multiple racks. The following section gives an overview of their role. disk, and serving them to clients. It manages its locally connected storage and is responsible for serving read and write requests and managing creation, deletion and replication with the supervision of the NameNode. NameNode instructs the nodes on what to do for keeping the file system coherent to the desired constraints. DataNode stores two files for each block on its native local file system, that has been configured for HDFS : one contains the data itself and the second contains block metadata, checksum and block’s generation stamp. In a cluster there are many such DataNodes, usually one for each member node. This permits to use all the local space available in all the nodes of the Hadoop cluster. Each DataNode joins the cluster register- ing to the NameNode the first time. After the registration, it is allocated a unique permanent id. A block report is sent to the NamNode once the stored metadata is analyzed. When online, DataNode periodically sends a heartbeat signal to NameNode. If NameNode gets no updates from a DataNode within a stipulated time, considers it failed and start actions for preserving the service level of each block that were contained in the failed DataNodes.

NameNode

The role of NameNode is to keep the file system namespace in the cluster. It contains the file and directory hierarchy and the locations of file, stored in tree structure. Since in HDFS files are divided into blocks and these are replicated multiple times, NameNode contains also the list of DataNodes containing each single block. Similar to classic file systems, HDFS files are represented by nodes that store permissions and other metadata such as access times and namespace quota.

Another relevant task of the NameNode is to keep a track of the state of HDFS cluster, verifying if the file system is consistent, receiving the heartbeats of the DataNodes, checking which blocks need to be replicated and initiating replication when as and when needed.

There is only one NameNode in Hadoop and data never flow through it. This is however a single point of failure. Thus, NameNode should be never overloaded and should be the most reliable node of the cluster as without NameNode, HDFS is totally unserviceable. In recent releases of Hadoop, there is also a backup node, which is always up to date with latest namespace status. It receives all the operations done by NameNode and stores them in local memory. This allows to have the latest, up to date, namespace status when the NameNode fails. This node is so called a read only NameNode, since it perform all the operations of the NameNode that don’t require knowledge of the block location (that are known only by NameNode due to block report messages by DataNodes).

DataNode

A DataNode is a member of cluster with the role of containing file blocks on its local disk, and serving them to clients when requested. It manages its locally connected storage and is responsible for serving read and write requests and managing creation, deletion and replication under the supervision of the NameNode. NameNode instructs the data nodes on what to do to keep the file system coherent to the desired constraints. DataNode stores two files for each block on its native local file system, that has been configured for HDFS : one contains the data itself and the second contains block metadata, checksum and block’s generation stamp. In a cluster there are many such DataNodes, usually one for each member node. This permits to use all the local space available in all the nodes of the Hadoop cluster. Each DataNode joins the cluster register- ing to the NameNode the first time. After the registration, it is allocated a unique permanent id. A block report is sent to the NamNode once the stored metadata is analyzed. When online, DataNode periodically sends a heartbeat signal to NameNode. If NameNode gets no updates from a DataNode within a stipulated time, considers it failed and start actions for preserving the service level of each block that were contained in the failed DataNodes.

DataNodes are also contacted directly by clients for reading and writing blocks, so their behavior is critical for the overall performances of the system.

Advertisements

Written by Varun Kumar

Varun works with Microsoft as a Cloud Consultant. He comes with 10+ years of experience into Consultant, Solution Architect, and Delivery Management roles. As a Consultant in Microsoft, his job is to design, develop and deploy enterprise level solutions using Azure, to help organizations to achieve more.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s