In my previous blog I had given an introduction to Azure Data Lake and its offerings. In this blog I am going to show how to create a data lake in Azure using the Azure portal. To create a Data…

In my previous blog I had given an introduction to Azure Data Lake and its offerings. In this blog I am going to show how to create a data lake in Azure using the Azure portal. To create a Data…
What is Data Lake? A data lake is a storage repository that holds a vast amount of raw data in its original format to apply analytics and run big data analysis. Data lake handles the three Vs of big data (Volume,…
HDFS estimates the network bandwidth between two nodes by their distance. The distance from a node to its parent node is assumed to be one. A shorter distance between two nodes means that the greater bandwidth they can utilize to…
Read More Performance improvement of map reduce through new Hadoop block placement algorithm
The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value. When a data set has outliers or extreme…
The lack of communication between statisticians and the managers is the major roadblock for using statistics. Below are some statistical terminology to ease a precise communication Population: A population is any entire collection of people, animals, plants or things on which…
We can break up any statistical problem into three steps: Data collection and Sampling. Data analysis. Decision making. It is well understood that step 1 typically requires some thought of steps 2 and 3. It is only when you have…
Read More Relevance of decision-based thinking in statistical analysis
One of the important aspect of data analytics is the relationship between models and data. Thinking of data as inputs to models, which generate outputs (predictions, trends etc.). Most of the articles in the data science community revolves around models, or algorithms that implement…