What is Data Lake? A data lake is a storage repository that holds a vast amount of raw data in its original format to apply analytics and run big data analysis. Data lake handles the three Vs of big data (Volume,…

What is Data Lake? A data lake is a storage repository that holds a vast amount of raw data in its original format to apply analytics and run big data analysis. Data lake handles the three Vs of big data (Volume,…
HDFS estimates the network bandwidth between two nodes by their distance. The distance from a node to its parent node is assumed to be one. A shorter distance between two nodes means that the greater bandwidth they can utilize to…
Read More Performance improvement of map reduce through new Hadoop block placement algorithm
The Hadoop Distributed File System, inspired by GFS from Google, is a distributed filesystem which runs on low cost commodity hardware in a fault tolerant manner to redundantly store Terabyte and larger data sets. The architecture of this filesystem is…
In order to avoid placing new data, HDFS Block placement policy does not take into account DataNode disk space utilization. Hence, data this might cause non-uniform data placement across the cluster. Also, uneveness might occur when new nodes are being…
Read More HDFS Balancer to balance disk space usage on the cluster
Pipelining is the technique used by HDFS to minimize inter-node network traffic. Whenever the first block replica is written by a to a node, then it’s the responsibility of that node to write the second replica to a random off-rack…
Read More HDFS Pipelining to minimize inter-node network traffic
Apache Hadoop was initially developed by “Doug Cutting” in 2005 because he needed a faster data processing framework for the web crawler project called Nutch. Based on the MapReduce paper which was published by Google in 2004, he replaced the…