In order to avoid placing new data, HDFS Block placement policy does not take into account DataNode disk space utilization. Hence, data this might cause non-uniform data placement across the cluster. Also, uneveness might occur when new nodes are being added to the cluster.
The Hadoop balancer is used to balance disk space usage on the cluster. It takes a threshold value as an input parameter, a fuzzy value in the range (0, 1). A cluster is balanced if for each DataNode, the utilization of the node differs from the utilizations of the whole cluster by no more than the threshold value.
The tool can be used as an application program that can be run by the cluste administrator. It moves replicas from DataNodes with high utilization to DataNodes with low utilization, iteratively. One of the key requirements for the balancer is to maintain the data availability. When choosing a replica to move, and deciding its destination, the balancer guarantees that the decision does not impact the number of replicas and the number of racks.
The balancing tool optimizes the balancing process by minimizing the inter-rack data copying. As an example, if the balancer decides that a replica X needs to be shifted to a different rack and the destination rack happens to have a replica Y of the same block, then data will be copied from replica Y instead of replica X.