Grade: Grade 11 Subject: Mathematics Unit: Advanced Data Analysis SAT: ProblemSolving+DataAnalysis ACT: Math

Regression Analysis

📖 Learn

Regression Analysis

Regression analysis is a statistical method used to examine the relationship between one or more independent variables (predictors) and a dependent variable (response). The goal is to find a mathematical model that best fits the data and can be used to make predictions.

Types of Regression

Type Equation Form Best Used When
Linear ŷ = a + bx Data shows a straight-line pattern
Quadratic ŷ = ax² + bx + c Data shows a parabolic curve
Exponential ŷ = a · bˣ Data shows rapid growth or decay
Power ŷ = axᵇ Data shows varying rates of change
Logarithmic ŷ = a + b·ln(x) Data levels off (diminishing returns)

Linear Regression: The Foundation

Line of Best Fit (Least Squares Regression Line)

The least squares regression line is the line that minimizes the sum of the squared vertical distances (residuals) between the data points and the line.

ŷ = a + bx

where:

  • ŷ (y-hat) = predicted value of y
  • a = y-intercept (value of ŷ when x = 0)
  • b = slope (change in ŷ for each unit increase in x)

Calculating the Regression Coefficients

Slope Formula

b = r · (sᵧ / sₓ)

where r is the correlation coefficient, sᵧ is the standard deviation of y, and sₓ is the standard deviation of x.

Y-Intercept Formula

a = ȳ - b · x̄

where x̄ is the mean of x and ȳ is the mean of y.

Correlation Coefficient (r)

Correlation Coefficient

The correlation coefficient (r) measures the strength and direction of the linear relationship between two variables.

  • r = 1: Perfect positive linear relationship
  • r = -1: Perfect negative linear relationship
  • r = 0: No linear relationship
  • |r| > 0.8: Strong correlation
  • 0.5 < |r| < 0.8: Moderate correlation
  • |r| < 0.5: Weak correlation

Coefficient of Determination (r²)

R-Squared

(R-squared) represents the proportion of variance in y that is explained by the regression model.

For example, if r² = 0.85, then 85% of the variation in y can be explained by the linear relationship with x.

Residuals and Residual Plots

Residual

Residual = Observed value - Predicted value = y - ŷ

A residual plot shows residuals vs. x-values. If the regression model is appropriate:

  • Residuals should be randomly scattered around zero
  • No clear patterns should be visible
  • Spread should be roughly constant (homoscedasticity)

Warning: Patterns in Residual Plots

Curved pattern: A linear model is not appropriate; try quadratic or exponential.

Fan shape: Variance is not constant; transformation may be needed.

Outliers: Individual points far from zero may be influential.

Interpreting Regression in Context

When interpreting regression results:

  • Slope interpretation: "For each additional [unit of x], the predicted [y] increases/decreases by [b] units."
  • Intercept interpretation: "When [x] is 0, the predicted [y] is [a]." (Only meaningful if x = 0 is in the data range)
  • Interpolation vs. Extrapolation: Predictions within the data range are more reliable than those outside it.

💡 Examples

Example 1: Finding the Regression Equation

Problem: Given the following statistics for study hours (x) and test scores (y): x̄ = 5, ȳ = 78, sₓ = 2, sᵧ = 10, r = 0.9. Find the regression equation.

Step 1: Calculate the slope b:

b = r · (sᵧ / sₓ) = 0.9 · (10/2) = 0.9 · 5 = 4.5

Step 2: Calculate the y-intercept a:

a = ȳ - b · x̄ = 78 - 4.5 · 5 = 78 - 22.5 = 55.5

Step 3: Write the equation:

ŷ = 55.5 + 4.5x

Answer: The regression equation is ŷ = 55.5 + 4.5x

Example 2: Making Predictions

Problem: Using the equation ŷ = 55.5 + 4.5x from Example 1, predict the test score for a student who studies 7 hours. Also calculate the residual if the student actually scored 85.

Prediction:

ŷ = 55.5 + 4.5(7) = 55.5 + 31.5 = 87

Residual:

Residual = y - ŷ = 85 - 87 = -2

Answer: Predicted score is 87; residual is -2 (the student scored 2 points below predicted).

Example 3: Interpreting r²

Problem: A regression model for predicting house prices from square footage has r² = 0.72. Interpret this value.

Interpretation:

72% of the variation in house prices can be explained by the linear relationship with square footage.

The remaining 28% of variation is due to other factors not included in the model (location, age, features, etc.).

Example 4: Interpreting Slope in Context

Problem: A regression equation relating advertising spending (in thousands of dollars) to sales (in units) is ŷ = 120 + 8.5x. Interpret the slope.

Interpretation:

For each additional $1,000 spent on advertising, the predicted number of units sold increases by 8.5 units.

Note: This describes association, not necessarily causation. Other factors may influence sales.

Example 5: Choosing the Best Model

Problem: For the same data set, linear regression gives r² = 0.75 and exponential regression gives r² = 0.94. Which model is better?

Analysis:

The exponential model has a higher r² (0.94 vs 0.75), meaning it explains more of the variation in the data.

However: Before choosing, also check:

  • The residual plots for both models
  • Whether the relationship makes sense in context
  • Whether the pattern in a scatterplot appears exponential

Answer: Based on r² alone, the exponential model is better, but residual analysis should confirm this choice.

✏️ Practice

Apply your regression analysis skills to these problems.

Problem 1: Given x̄ = 10, ȳ = 25, sₓ = 3, sᵧ = 6, and r = 0.8, find the regression equation ŷ = a + bx.

Problem 2: A regression equation is ŷ = 12 + 2.5x. If x = 8 and the actual y value is 35, find the residual.

Problem 3: For a data set, r = -0.85. (a) Describe the relationship. (b) Calculate r². (c) Interpret r² in context if x represents temperature and y represents heating costs.

Problem 4: A regression model predicts that ŷ = 50 - 3x. The data range for x is 2 to 15. (a) Should you use this model to predict y when x = 10? (b) When x = 25?

Problem 5: The equation for predicting GPA (y) from hours of sleep per night (x) is ŷ = 1.5 + 0.25x with r² = 0.64. Interpret the slope and r² in context.

Problem 6: A residual plot shows a clear U-shaped pattern. What does this indicate about the linear regression model?

Problem 7: Two students fit different models: Student A gets ŷ = 3 + 2x with r² = 0.89; Student B gets ŷ = 5 + 1.8x with r² = 0.89. Can both be correct? Explain.

Problem 8: For the data: (1, 3), (2, 5), (3, 8), (4, 10), (5, 13), calculate the predicted value and residual at x = 3 if the regression line is ŷ = 0.5 + 2.5x.

Problem 9: A scientist finds r = 0.95 between ice cream sales and drowning deaths. Should we conclude that eating ice cream causes drowning? Explain.

Problem 10: If removing one data point changes r from 0.3 to 0.9, what might that point represent?

Click to reveal answers
  1. b = 0.8(6/3) = 1.6; a = 25 - 1.6(10) = 9; ŷ = 9 + 1.6x
  2. ŷ = 12 + 2.5(8) = 32; Residual = 35 - 32 = 3
  3. (a) Strong negative linear relationship (b) r² = 0.7225 (c) 72.25% of the variation in heating costs is explained by temperature
  4. (a) Yes, x = 10 is within the data range (interpolation) (b) No, x = 25 is outside the range (extrapolation is unreliable)
  5. Slope: For each additional hour of sleep, predicted GPA increases by 0.25 points. r²: 64% of the variation in GPA is explained by sleep hours.
  6. The U-shaped pattern indicates a linear model is not appropriate; a quadratic model should be considered.
  7. No, both cannot be correct for the same data set. Different equations would give different predictions and residuals, leading to different r² values.
  8. ŷ = 0.5 + 2.5(3) = 8; Residual = 8 - 8 = 0
  9. No. Correlation does not imply causation. Both variables are likely related to a third variable (hot weather/summer).
  10. The removed point was likely an influential outlier that was masking the true relationship in the data.

✅ Check Your Understanding

Question 1: What is the difference between r and r²? What does each measure?

Show answer

r (correlation coefficient) measures the strength AND direction of a linear relationship. It ranges from -1 to 1, where the sign indicates direction (positive or negative slope).

r² (coefficient of determination) measures only the strength of the relationship as a proportion of variance explained. It ranges from 0 to 1 and is always positive.

Question 2: Why is extrapolation potentially dangerous when using regression models?

Show answer

Extrapolation (predicting beyond the range of the data) is risky because the relationship observed within the data range may not continue outside that range. The pattern could change, level off, or reverse. The model has no information about behavior outside the observed data, making predictions unreliable and potentially misleading.

Question 3: How do you use a residual plot to determine if a linear model is appropriate?

Show answer

A linear model is appropriate if the residual plot shows:

  • Random scatter of points above and below zero (no pattern)
  • Roughly constant spread (no fan or funnel shape)
  • No obvious outliers or influential points

If you see curves, patterns, or uneven spread, a different model (quadratic, exponential, etc.) may be more appropriate.

Question 4: Explain why "correlation does not imply causation."

Show answer

A strong correlation between two variables shows they are related, but it doesn't prove one causes the other. There are several alternatives:

  • Lurking variable: A third variable may cause both
  • Reverse causation: The effect may cause the supposed cause
  • Coincidence: Especially with small samples or many comparisons

To establish causation, you typically need controlled experiments or very strong theoretical and empirical evidence.

🚀 Next Steps

  • Review any concepts that felt challenging
  • Move on to the next lesson when ready
  • Return to practice problems periodically for review