Common Mistakes: Data Analysis
Overview
Understanding common mistakes in data analysis is just as important as understanding the concepts themselves. The SAT and ACT frequently include answer choices designed to trap students who make these errors. By learning to recognize these pitfalls, you will improve your accuracy and confidence.
Categories of Common Mistakes
- Confusing correlation with causation
- Misinterpreting statistical measures
- Calculation errors with formulas
- Extrapolation beyond the data range
- Misunderstanding confidence intervals
Common Mistakes Explained
Mistake 1: Correlation Implies Causation
The Error: Concluding that because two variables are correlated, one must cause the other.
Example: "Ice cream sales and drowning deaths are positively correlated. Therefore, eating ice cream causes drowning."
The Truth: Both are caused by a third variable (hot weather). This is called a lurking or confounding variable. Correlation shows association, not causation.
How to Avoid: Look for words like "causes," "leads to," or "results in" - these require experimental evidence, not just correlation.
Mistake 2: Confusing r and r-squared
The Error: Interpreting r = 0.7 as meaning 70% of variation is explained.
Example: If r = 0.7, some students say "70% of the variation in y is explained by x."
The Truth: r-squared = 0.49, so only 49% is explained. You must square r to get the coefficient of determination.
How to Avoid: Always square the correlation before discussing "percent of variation explained."
Mistake 3: Extrapolating Beyond the Data
The Error: Using a regression equation to predict values far outside the range of the original data.
Example: Using a model based on temperatures from 50 to 90 degrees to predict behavior at 20 degrees or 120 degrees.
The Truth: The linear relationship may not hold outside the observed range. The prediction becomes unreliable.
How to Avoid: Check if your x-value falls within the range of the original data before trusting the prediction.
Mistake 4: Wrong Sign on Residuals
The Error: Calculating Predicted minus Actual instead of Actual minus Predicted.
Example: If Actual = 80 and Predicted = 75, the wrong answer is -5 instead of +5.
The Truth: Residual = Actual - Predicted. A positive residual means the actual value exceeded the prediction.
How to Avoid: Remember "Actual minus Predicted" and think: "Was the actual higher or lower than expected?"
Mistake 5: Misinterpreting Confidence Intervals
The Error: Saying "There is a 95% chance the true value is in this interval."
Example: "There is a 95% probability that the true proportion is between 0.45 and 0.55."
The Truth: The true value either is or is not in the interval. The 95% refers to the method: if we repeated this process many times, about 95% of our intervals would contain the true value.
How to Avoid: Say "We are 95% confident" rather than "There is a 95% probability."
Mistake 6: Ignoring Units
The Error: Forgetting to check whether values are in thousands, percentages, or other units.
Example: Using x = 15000 in a regression when the data was measured in thousands (should use x = 15).
The Truth: This gives an answer 1000 times too large.
How to Avoid: Always read the problem carefully and note the units of each variable.
Practice: Spot the Error
For each problem, identify the mistake in the reasoning.
Problem 1
Student says: "The correlation between shoe size and reading ability in children is r = 0.75, so larger shoes help children read better."
Problem 2
Student says: "Since r = 0.6, the regression model explains 60% of the variation in the response variable."
Problem 3
The regression equation y = 20 + 5x was developed using x values from 10 to 50. A student uses it to predict y when x = 100 and gets y = 520.
Problem 4
Predicted value = 85, Actual value = 78. Student calculates residual as 85 - 78 = 7.
Problem 5
A 95% confidence interval is (0.42, 0.58). Student says: "There is a 95% chance that the population proportion is in this interval."
Problem 6
The regression equation uses income in thousands of dollars. Student enters x = 45000 instead of x = 45 to predict for someone earning $45,000.
Problem 7
Student says: "The correlation is r = 0.95, which proves that variable X causes changes in variable Y."
Problem 8
A poll shows 52% support with a margin of error of 3%. Student says: "Most people definitely support this because 52% is more than 50%."
Problem 9
Student interprets r = -0.80 as a weak relationship because the number is "less than 1."
Problem 10
Data shows countries with more chocolate consumption have more Nobel Prize winners. Student concludes: "To win more Nobel Prizes, a country should encourage chocolate consumption."
Answer Key
1. Correlation vs. causation error. Age is a lurking variable - older children have larger feet AND better reading ability. Shoes do not cause reading improvement.
2. Confusing r with r-squared. The model explains r-squared = 0.36, or only 36% of the variation, not 60%.
3. Extrapolation error. The value x = 100 is far outside the range of the original data (10 to 50), so this prediction is unreliable.
4. Wrong residual formula. Residual = Actual - Predicted = 78 - 85 = -7 (not +7). The actual was 7 below the prediction.
5. Misinterpreting confidence interval. The correct statement is "We are 95% confident that the interval contains the true proportion." The true value is fixed; it is either in or not in this specific interval.
6. Unit error. Should use x = 45, not 45000. The prediction will be wildly incorrect (off by a factor of about 1000).
7. Correlation vs. causation. Even a very high correlation (0.95) does not prove causation. Only controlled experiments can establish causation.
8. Ignoring margin of error. The confidence interval is (49%, 55%), which includes values below 50%. We cannot definitively say a majority supports this.
9. Misinterpreting negative correlation. The strength is determined by the absolute value. |r| = 0.80 indicates a strong relationship. The negative sign only indicates direction, not strength.
10. Correlation vs. causation with a lurking variable. Wealthy countries can afford more chocolate AND more research institutions. National wealth is the confounding variable.
Next Steps
- Create a personal checklist of mistakes to watch for
- When practicing, ask yourself: "What trap might the test-makers set here?"
- Take the Unit Quiz to test your mastery
- Review this lesson before any standardized test