Common Statistical Terminology with Applications

The lack of communication between statisticians and the managers is the major roadblock for using statistics. Below are some statistical terminology to ease a precise communication

Population: A population is any entire collection of people, animals, plants or things on which we may collect data. It is the entire group of interest, which we wish to describe or about which we wish to draw conclusions. In the above figure the life of the light bulbs manufactured say by GE, is the concerned population.

Sample: A subset of a population or universe.

Variable: A characteristic or phenomenon, which may take different values, such as weight, gender since they are different from individual to individual.

Qualitative and Quantitative Variables: Any object or event, which can vary in successive observations either in quantity or quality is called a “variable.” Variables are classified accordingly as quantitative or qualitative. A qualitative variable, unlike a quantitative variable does not vary in magnitude in successive observations. The values of quantitative and qualitative variables are called “Variates” and “Attributes”, respectively.

Randomness: Randomness means unpredictability. The fascinating fact about inferential statistics is that, although each random observation may not be predictable when taken alone, collectively they follow a predictable pattern called its distribution function. For example, it is a fact that the distribution of a sample average follows a normal distribution for sample size over 30. In other words, an extreme value of the sample mean is less likely than an extreme value of a few raw data.

An Experiment: An experiment is a process whose outcome is not known in advance with certainty.

Statistical Experiment: An experiment in general is an operation in which one chooses the values of some variables and measures the values of other variables, as in physics. A statistical experiment, in contrast is an operation in which one take a random sample from a population and infers the values of some variables. For example, in a survey, we “survey” i.e. “look at” the situation without aiming to change it, such as in a survey of political opinions. A random sample from the relevant population provides information about the voting intentions.

In order to make any generalization about a population, a random sample from the entire population; that is meant to be representative of the population, is often studied. For each population, there are many possible samples. A sample statistic gives information about a corresponding population parameter. For example, the sample mean for a set of data would give information about the overall population mean m.

It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included.

Example: The population for a study of infant health might be all children born in the U.S.A. in the 1980’s. The sample might be all babies born on 7th of May in any of the years.

An experiment is any process or study which results in the collection of data, the outcome of which is unknown. In statistics, the term is usually restricted to situations in which the researcher has control over some of the conditions under which the experiment takes place.

Example: Before introducing a new drug treatment to reduce high blood pressure, the manufacturer carries out an experiment to compare the effectiveness of the new drug with that of one currently prescribed. Newly diagnosed subjects are recruited from a group of local general practices. Half of them are chosen at random to receive the new drug, the remainder receives the present one. So, the researcher has control over the subjects recruited and the way in which they are allocated to treatment.

Design of experiments is a key tool for increasing the rate of acquiring new knowledge. Knowledge in turn can be used to gain competitive advantage, shorten the product development cycle, and produce new products and processes which will meet and exceed your customer’s expectations.

Primary data and Secondary data sets: If the data are from a planned experiment relevant to the objective(s) of the statistical investigation, collected by the analyst, it is called a Primary Data set. However, if some condensed records are given to the analyst, it is called a Secondary Data set.

Random Variable: A random variable is a real function (yes, it is called ” variable”, but in reality it is a function) that assigns a numerical value to each simple event. For example, in sampling for quality control an item could be defective or non-defective, therefore, one may assign X=1, and X = 0 for a defective and non-defective item, respectively. You may assign any other two distinct real numbers, as you wish; however, non-negative integer random variables are easy to work with. Random variables are needed since one cannot do arithmetic operations on words; the random variable enables us to compute statistics, such as average and variance. Any random variable has a distribution of probabilities associated with it.

Probability: Probability (i.e., probing for the unknown) is the tool used for anticipating what the distribution of data should look like under a given model. Random phenomena are not haphazard: they display an order that emerges only in the long run and is described by a distribution. The mathematical description of variation is central to statistics. The probability required for statistical inference is not primarily axiomatic or combinatorial, but is oriented toward describing data distributions.

Sampling Unit: A unit is a person, animal, plant or thing which is actually studied by a researcher; the basic objects upon which the study or experiment is executed. For example, a person; a sample of soil; a pot of seedlings; a zip code area; a doctor’s practice.

Parameter: A parameter is an unknown value, and therefore it has to be estimated. Parameters are used to represent a certain population characteristic. For example, the population mean mis a parameter that is often used to indicate the average value of a quantity.

Within a population, a parameter is a fixed value that does not vary. Each sample drawn from the population has its own value of any statistic that is used to estimate this parameter. For example, the mean of the data in a sample is used to give information about the overall mean min the population from which that sample was drawn.

Statistic: A statistic is a quantity that is calculated from a sample of data. It is used to give information about unknown values in the corresponding population. For example, the average of the data in a sample is used to give information about the overall average in the population from which that sample was drawn.

A statistic is a function of an observable random sample. It is therefore an observable random variable. Notice that, while a statistic is a “function” of observations, unfortunately, it is commonly called a random “variable” not a function.

It is possible to draw more than one sample from the same population, and the value of a statistic will in general vary from sample to sample. For example, the average value in a sample is a statistic. The average values in more than one sample, drawn from the same population, will not necessarily be equal.

Statistics are often assigned Roman letters (e.g.  and s), whereas the equivalent unknown values in the population (parameters ) are assigned Greek letters (e.g., µ, s).

The word estimate means to esteem, that is giving a value to something. A statistical estimate is an indication of the value of an unknown quantity based on observed data.

More formally, an estimate is the particular value of an estimator that is obtained from a particular sample of data and used to indicate the value of a parameter.

Example: Suppose the manager of a shop wanted to know m , the mean expenditure of customers in her shop in the last year. She could calculate the average expenditure of the hundreds (or perhaps thousands) of customers who bought goods in her shop; that is, the population mean m . Instead she could use an estimate of this population mean m by calculating the mean of a representative sample of customers. If this value were found to be $25, then $25 would be her estimate.

There are two broad subdivisions of statistics: Descriptive Statistics and Inferential Statistics as described below.

Descriptive Statistics: The numerical statistical data should be presented clearly, concisely, and in such a way that the decision maker can quickly obtain the essential characteristics of the data in order to incorporate them into decision process.

The principal descriptive quantity derived from sample data is the mean (), which is the arithmetic average of the sample data. It serves as the most reliable single measure of the value of a typical member of the sample. If the sample contains a few values that are so large or so small that they have an exaggerated effect on the value of the mean, the sample is more accurately represented by the median — the value where half the sample values fall below and half above.

The quantities most commonly used to measure the dispersion of the values about their mean are the variance s2 and its square root , the standard deviation s. The variance is calculated by determining the mean, subtracting it from each of the sample values (yielding the deviation of the samples), and then averaging the squares of these deviations. The mean and standard deviation of the sample are used as estimates of the corresponding characteristics of the entire group from which the sample was drawn. They do not, in general, completely describe the distribution (Fx) of values within either the sample or the parent group; indeed, different distributions may have the same mean and standard deviation. They do, however, provide a complete description of the normal distribution, in which positive and negative deviations from the mean are equally common, and small deviations are much more common than large ones. For a normally distributed set of values, a graph showing the dependence of the frequency of the deviations upon their magnitudes is a bell-shaped curve. About 68 percent of the values will differ from the mean by less than the standard deviation, and almost 100 percent will differ by less than three times the standard deviation.

Inferential Statistics: Inferential statistics is concerned with making inferences from samples about the populations from which they have been drawn. In other words, if we find a difference between two samples, we would like to know, is this a “real” difference (i.e., is it present in the population) or just a “chance” difference (i.e. it could just be the result of random sampling error). That’s what tests of statistical significance are all about. Any inferred conclusion from a sample data to the population from which the sample is drawn must be expressed in a probabilistic term. Probability is the language and a measuring tool for uncertainty in our statistical conclusions.

Inferential statistics could be used for explaining a phenomenon or checking for validity of a claim. In these instances, inferential statistics is called Exploratory Data Analysis or Confirmatory Data Analysis, respectively.

Statistical Inference: Statistical inference refers to extending your knowledge obtained from a random sample from the entire population to the whole population. This is known in mathematics as Inductive Reasoning, that is, knowledge of the whole from a particular. Its main application is in hypotheses testing about a given population. Statistical inference guides the selection of appropriate statistical models. Models and data interact in statistical work. Inference from data can be thought of as the process of selecting a reasonable model, including a statement in probability language of how confident one can be about the selection.

Normal Distribution Condition: The normal or Gaussian distribution is a continuous symmetric distribution that follows the familiar bell-shaped curve. One of its nice features is that, the mean and variance uniquely and independently determines the distribution. It has been noted empirically that many measurement variables have distributions that are at least approximately normal. Even when a distribution is non-normal, the distribution of the mean of many independent observations from the same distribution becomes arbitrarily close to a normal distribution, as the number of observations grows large. Many frequently used statistical tests make the condition that the data come from a normal distribution.

Estimation and Hypothesis Testing:Inference in statistics are of two types. The first is estimation, which involves the determination, with a possible error due to sampling, of the unknown value of a population characteristic, such as the proportion having a specific attribute or the average value m of some numerical measurement. To express the accuracy of the estimates of population characteristics, one must also compute the standard errors of the estimates. The second type of inference is hypothesis testing. It involves the definitions of a hypothesis as one set of possible population values and an alternative, a different set. There are many statistical procedures for determining, on the basis of a sample, whether the true population characteristic belongs to the set of values in the hypothesis or the alternative.

Statistical inference is grounded in probability, idealized concepts of the group under study, called the population, and the sample. The statistician may view the population as a set of balls from which the sample is selected at random, that is, in such a way that each ball has the same chance as every other one for inclusion in the sample.

 

Advertisements

Written by Varun Kumar

Varun works with Microsoft as a Cloud Consultant. He comes with 10+ years of experience into Consultant, Solution Architect, and Delivery Management roles. As a Consultant in Microsoft, his job is to design, develop and deploy enterprise level solutions using Azure, to help organizations to achieve more.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s