Hypothesis Testing
Learn
Introduction to Hypothesis Testing
Hypothesis testing is a statistical method for making decisions about populations based on sample data. It's used in science, medicine, business, and policy to determine whether observed effects are real or could have occurred by chance.
Hypothesis Testing
Hypothesis testing is a formal procedure that uses sample data to evaluate a claim about a population parameter. We test whether our data provides enough evidence to reject a default assumption.
The Two Hypotheses
Null and Alternative Hypotheses
Null Hypothesis (H₀): The default assumption - usually states "no effect" or "no difference." We assume this is true unless evidence proves otherwise.
Alternative Hypothesis (H₁ or Hₐ): What we're trying to find evidence for - usually states there IS an effect or difference.
H₀: parameter = hypothesized value
H₁: parameter ≠, <, or > hypothesized value
Types of Tests
| Test Type | Alternative Hypothesis | When to Use |
|---|---|---|
| Two-tailed | H₁: μ ≠ μ₀ | Testing for any difference (larger or smaller) |
| Right-tailed (upper) | H₁: μ > μ₀ | Testing if parameter is greater than claimed |
| Left-tailed (lower) | H₁: μ < μ₀ | Testing if parameter is less than claimed |
The Testing Process
Steps of Hypothesis Testing
- State the hypotheses: Write H₀ and H₁ using symbols
- Choose significance level: Select α (commonly 0.05 or 0.01)
- Collect data: Gather a random sample
- Calculate test statistic: Measure how far sample is from H₀
- Find p-value: Probability of getting this result if H₀ is true
- Make decision: Compare p-value to α
- State conclusion: In context of the original question
Test Statistics
Z-Test (when σ is known or n is large)
z = (x̄ - μ₀) / (σ / √n)
- x̄ = sample mean
- μ₀ = hypothesized population mean
- σ = population standard deviation
- n = sample size
T-Test (when σ is unknown)
t = (x̄ - μ₀) / (s / √n)
- s = sample standard deviation
- df = n - 1 (degrees of freedom)
Use t-distribution tables or calculator for p-values.
Significance Level and P-Values
Key Concepts
Significance Level (α): The threshold for rejecting H₀. Common values: 0.05 (5%) or 0.01 (1%).
P-value: The probability of obtaining results at least as extreme as observed, assuming H₀ is true.
If p-value ≤ α → Reject H₀ (significant result)
If p-value > α → Fail to reject H₀ (not significant)
Interpreting P-Values
| P-value | Evidence Against H₀ |
|---|---|
| > 0.10 | Little or no evidence |
| 0.05 - 0.10 | Weak evidence |
| 0.01 - 0.05 | Moderate evidence |
| 0.001 - 0.01 | Strong evidence |
| < 0.001 | Very strong evidence |
Types of Errors
| H₀ is True | H₀ is False | |
|---|---|---|
| Reject H₀ | Type I Error (α) - False Positive | Correct Decision (Power) |
| Fail to Reject H₀ | Correct Decision | Type II Error (β) - False Negative |
Understanding Errors
Type I Error: Rejecting H₀ when it's actually true (false alarm). Probability = α
Type II Error: Failing to reject H₀ when it's actually false (missed detection). Probability = β
Power: Probability of correctly rejecting a false H₀. Power = 1 - β
Confidence Intervals and Hypothesis Testing
Connection
There's a direct relationship between confidence intervals and two-tailed hypothesis tests:
- A 95% confidence interval corresponds to α = 0.05
- If the hypothesized value falls outside the CI, reject H₀
- If the hypothesized value falls inside the CI, fail to reject H₀
Examples
Example 1: Two-Tailed Z-Test
Problem: A company claims batteries last 500 hours on average. A sample of 36 batteries has mean life 490 hours. Population σ = 30 hours. Test at α = 0.05.
Solution:
Step 1: State hypotheses
H₀: μ = 500 (batteries last 500 hours as claimed)
H₁: μ ≠ 500 (batteries don't last 500 hours)
Step 2: Calculate test statistic
z = (x̄ - μ₀) / (σ / √n) = (490 - 500) / (30 / √36)
z = -10 / (30/6) = -10 / 5 = -2.0
Step 3: Find p-value
For z = -2.0, P(Z < -2.0) = 0.0228
Two-tailed p-value = 2(0.0228) = 0.0456
Step 4: Decision
Since 0.0456 < 0.05, we reject H₀
Conclusion: There is sufficient evidence at α = 0.05 that the true mean battery life differs from 500 hours.
Example 2: One-Tailed T-Test
Problem: A teacher claims a new method improves test scores above 75. A sample of 25 students has mean 78 with s = 10. Test at α = 0.05.
Solution:
Step 1: State hypotheses
H₀: μ ≤ 75 (mean is at most 75)
H₁: μ > 75 (mean is greater than 75)
Step 2: Calculate test statistic
t = (78 - 75) / (10 / √25) = 3 / 2 = 1.5
df = 25 - 1 = 24
Step 3: Find p-value
Using t-table with df = 24: P(t > 1.5) ≈ 0.073
Step 4: Decision
Since 0.073 > 0.05, we fail to reject H₀
Conclusion: There is not sufficient evidence at α = 0.05 to conclude that the new method improves scores above 75.
Example 3: Proportion Test
Problem: A company claims 80% of customers are satisfied. In a survey of 200 customers, 150 were satisfied. Test at α = 0.05.
Solution:
Step 1: State hypotheses
H₀: p = 0.80
H₁: p ≠ 0.80
Step 2: Calculate test statistic
p̂ = 150/200 = 0.75
z = (p̂ - p₀) / √(p₀(1-p₀)/n)
z = (0.75 - 0.80) / √(0.80 × 0.20 / 200)
z = -0.05 / √(0.0008) = -0.05 / 0.0283 = -1.77
Step 3: Find p-value
Two-tailed p-value = 2 × P(Z < -1.77) = 2(0.0384) = 0.077
Step 4: Decision
Since 0.077 > 0.05, we fail to reject H₀
Conclusion: There is not sufficient evidence to conclude the satisfaction rate differs from 80%.
Example 4: Understanding Type I and Type II Errors
Problem: A drug company tests whether a new medication lowers blood pressure. Describe the Type I and Type II errors in context.
Solution:
H₀: The drug has no effect on blood pressure
H₁: The drug lowers blood pressure
Type I Error: Concluding the drug works when it actually doesn't. Consequence: Patients take an ineffective medication, possibly instead of treatments that work.
Type II Error: Concluding the drug doesn't work when it actually does. Consequence: An effective treatment is abandoned, patients miss beneficial medication.
Which is worse? Depends on context. If the drug has side effects, Type I might be worse. If the condition is serious and alternatives are limited, Type II might be worse.
Example 5: Using Confidence Intervals
Problem: A 95% CI for mean weight loss is (2.1, 5.3) pounds. Test H₀: μ = 0 vs H₁: μ ≠ 0 at α = 0.05.
Solution:
The hypothesized value μ = 0 falls outside the 95% CI (2.1, 5.3)
Therefore, we reject H₀ at α = 0.05
Conclusion: There is statistically significant evidence that the mean weight loss is different from zero. Since the entire CI is positive, there's evidence of actual weight loss (2.1 to 5.3 pounds on average).
Practice
Apply your understanding of hypothesis testing.
1. Which hypothesis contains the "=" sign?
A) Alternative hypothesis B) Null hypothesis C) Research hypothesis D) None of them
2. A p-value of 0.03 means:
A) 3% chance H₀ is true B) 3% chance of getting this result if H₀ is true C) 97% chance H₁ is true D) 3% chance of Type II error
3. At α = 0.05, which p-value leads to rejecting H₀?
A) 0.06 B) 0.10 C) 0.04 D) 0.08
4. Sample: n = 49, x̄ = 82, σ = 14. Test H₀: μ = 80 vs H₁: μ > 80. Find z.
A) 0.5 B) 1.0 C) 1.5 D) 2.0
5. A Type I error occurs when:
A) H₀ is rejected when false B) H₀ is rejected when true C) H₀ is not rejected when true D) H₀ is not rejected when false
6. The power of a test is:
A) Probability of Type I error B) Probability of Type II error C) 1 minus probability of Type II error D) The significance level
7. A 99% CI for μ is (12, 18). Which conclusion follows for testing H₀: μ = 15 at α = 0.01?
A) Reject H₀ B) Fail to reject H₀ C) Cannot determine D) Test is invalid
8. Increasing sample size generally:
A) Increases power B) Increases Type I error C) Increases Type II error D) Has no effect
9. For a left-tailed test with z = -2.1, the p-value is approximately:
A) 0.018 B) 0.036 C) 0.964 D) 0.982
10. "Statistically significant" means:
A) The result is practically important B) The p-value is less than α C) The effect size is large D) The null hypothesis is true
Click to reveal answers
- B) Null hypothesis - H₀ always contains equality
- B) 3% chance of getting this result if H₀ is true
- C) 0.04 - only p-value less than 0.05
- B) 1.0 - z = (82-80)/(14/7) = 2/2 = 1.0
- B) H₀ is rejected when true (false positive)
- C) 1 minus probability of Type II error (1 - β)
- B) Fail to reject H₀ - 15 is inside the CI
- A) Increases power - more data means better detection
- A) 0.018 - left tail area for z = -2.1
- B) The p-value is less than α
Check Your Understanding
1. Why do we "fail to reject H₀" rather than "accept H₀"?
Show answer
We say "fail to reject" because not finding evidence against H₀ doesn't prove H₀ is true - it just means we don't have enough evidence to conclude it's false. Absence of evidence isn't evidence of absence. Multiple reasons could explain insufficient evidence: small sample size, high variability, or the true effect might be small but real. The burden of proof is on rejecting H₀, not proving it.
2. Why might a statistically significant result not be practically significant?
Show answer
Statistical significance only tells us an effect is unlikely due to chance - not that it's meaningful. With large samples, even tiny differences become statistically significant. For example, a study with n=10,000 might find that a drug reduces blood pressure by 0.5 mmHg (p < 0.001), but this effect is too small to matter clinically. Always consider effect size and practical implications alongside p-values.
3. How do you choose between a one-tailed and two-tailed test?
Show answer
Use a one-tailed test when: (1) you have a directional hypothesis BEFORE seeing data (e.g., "the new drug will LOWER blood pressure"), (2) only one direction matters for your decision, (3) you have theoretical reason to expect a specific direction. Use a two-tailed test when: (1) any difference (higher or lower) is of interest, (2) you're being conservative, (3) you're exploring without prior expectations. Two-tailed is more common and more conservative.
4. Explain the trade-off between Type I and Type II errors. How does α affect both?
Show answer
There's an inverse relationship: decreasing α (being stricter) reduces Type I errors but increases Type II errors (more false negatives). Increasing α does the opposite. Choosing α depends on which error is worse in your context. In medical testing, Type I errors (approving ineffective drugs) might be weighted against Type II (rejecting effective treatments). The only way to reduce both errors simultaneously is to increase sample size or reduce variability.
🚀 Next Steps
- Review any concepts that felt challenging
- Move on to the next lesson when ready
- Return to practice problems periodically for review