Hadoop – Cloud Avenue

Microsoft Azure Data Lake Store: an introduction

by Varun KumarJune 14, 2018August 28, 2018Leave a comment

What is Data Lake? A data lake is a storage repository that holds a vast amount of raw data in its original format to apply analytics and run big data analysis. Data lake handles the three Vs of big data (Volume,…

Performance improvement of map reduce through new Hadoop block placement algorithm

by Varun KumarNovember 15, 2017August 10, 20181 Comment

HDFS estimates the network bandwidth between two nodes by their distance. The distance from a node to its parent node is assumed to be one. A shorter distance between two nodes means that the greater bandwidth they can utilize to…

Hadoop Distributed File System (HDFS)

by Varun KumarJuly 26, 2017July 30, 20182 Comments

The Hadoop Distributed File System, inspired by GFS from Google, is a distributed ﬁlesystem which runs on low cost commodity hardware in a fault tolerant manner to redundantly store Terabyte and larger data sets. The architecture of this ﬁlesystem is…

HDFS Balancer to balance disk space usage on the cluster

by Varun KumarJuly 25, 2017July 30, 2018Leave a comment

In order to avoid placing new data, HDFS Block placement policy does not take into account DataNode disk space utilization. Hence, data this might cause non-uniform data placement across the cluster. Also, uneveness might occur when new nodes are being…

HDFS Pipelining to minimize inter-node network traﬃc

by Varun KumarJuly 25, 2017July 30, 20182 Comments

Pipelining is the technique used by HDFS to minimize inter-node network traﬃc. Whenever the ﬁrst block replica is written by a to a node, then it’s the responsibility of that node to write the second replica to a random oﬀ-rack…

Apache Hadoop

by Varun KumarJuly 24, 2017July 30, 20182 Comments

Apache Hadoop was initially developed by “Doug Cutting” in 2005 because he needed a faster data processing framework for the web crawler project called Nutch. Based on the MapReduce paper which was published by Google in 2004, he replaced the…