One of the important aspect of data analytics is the relationship between models and data. Thinking of data as inputs to models, which generate outputs (predictions, trends etc.). Most of the articles in the data science community revolves around models, or algorithms that implement underlying models (random forests, deep learning, etc.). However, there are countless examples of applying good models to bad data, resulting in bad Inference.
The 1936 Election – A Polling Catastrophe
In United States presidential elections of 1936, when Great Depression was 7 years old, The incumbent United States President, Franklin D. Roosevelt, had taken bold steps including his “New Deal.” The “New Deal” included many programs designed to assist Americans struggling under the depression, arguably at the expense of those who were doing better financially.
The Literary Digest, an influential weekly magazine of the time, had begun political polling. They had polled a sample of over 2 million people based upon telephone and car registrations. The results they obtained predicted Landon would win in a landslide with over 57% of the popular vote
However, there was a problem with the sample frame. During the Depression, not everyone could afford a car or a telephone. Those who did were usually wealthier, and therefore less likely to be directly helped by “New Deal” programs. As a result, this group was more likely to disapprove of Roosevelt than the general population.
Discrepancy in Inference
The Prediction: Landon in a Landslide
|Landon, 57.1%, Roosevelt, 42.9%|
Instead, the actual results gave a very different picture.
Actual Result: Roosevelt Runs Away With It
|Roosevelt 60.8%, Landon 36.5%|
An incorrect sample frame can destroy a study, regardless of the sample size. The researchers surveyed over 2 million people (today’s typical political survey asks between 500 & 1000 respondents), yet it missed about as badly as possible.