Data-Driven Decisions
Learn
Introduction to Data-Driven Decision Making
In today's world, decisions in business, science, healthcare, and policy are increasingly based on data analysis rather than intuition alone. Understanding how to interpret data, identify patterns, and draw valid conclusions is an essential skill for college and career success.
Data-Driven Decision Making
Data-driven decision making (DDDM) is the practice of basing decisions on data analysis and interpretation rather than intuition or observation alone. It involves collecting relevant data, analyzing it using appropriate methods, and using the insights to guide actions.
The Data Analysis Process
Steps in Data Analysis
- Define the Question: What decision needs to be made? What information would help?
- Collect Data: Gather relevant, reliable data from appropriate sources
- Clean and Organize: Remove errors, handle missing values, format consistently
- Analyze: Apply statistical methods to find patterns and relationships
- Interpret: Draw conclusions in context of the original question
- Communicate: Present findings clearly to support decision-making
Types of Data
| Data Type | Description | Examples | Analysis Methods |
|---|---|---|---|
| Quantitative Discrete | Countable numbers | Number of customers, defects | Counting, frequencies |
| Quantitative Continuous | Measurable quantities | Temperature, weight, time | Mean, std deviation, regression |
| Categorical Nominal | Categories without order | Color, brand, gender | Mode, frequency tables |
| Categorical Ordinal | Ordered categories | Rating scales, education level | Median, percentiles |
Descriptive Statistics for Decision Making
Measures of Center
- Mean: Average value; sensitive to outliers
- Median: Middle value when sorted; robust to outliers
- Mode: Most frequent value; useful for categorical data
Mean = (Sum of all values) / (Number of values)
Measures of Spread
- Range: Maximum - Minimum (simple but sensitive to outliers)
- Interquartile Range (IQR): Q3 - Q1 (middle 50% of data)
- Standard Deviation: Average distance from the mean
- Variance: Standard deviation squared
Comparing Data Sets
When making decisions between options, compare both center and spread:
- Higher mean might indicate better overall performance
- Lower standard deviation indicates more consistency/reliability
- Consider the context: Is consistency or peak performance more important?
The Five-Number Summary
A comprehensive snapshot of data distribution:
- Minimum
- First Quartile (Q1) - 25th percentile
- Median (Q2) - 50th percentile
- Third Quartile (Q3) - 75th percentile
- Maximum
This forms the basis for box plots, which visually compare distributions.
Correlation vs. Causation
Critical Distinction
Correlation: Two variables move together (positive or negative relationship)
Causation: One variable directly causes changes in another
Correlation does NOT imply causation!
Third variables (confounders) may explain observed correlations. Only controlled experiments can establish causation.
Using Data Visualizations
| Chart Type | Best For | Key Features to Examine |
|---|---|---|
| Bar Chart | Comparing categories | Relative heights, ordering |
| Histogram | Distribution of continuous data | Shape, center, spread, outliers |
| Box Plot | Comparing distributions | Median, IQR, outliers |
| Scatter Plot | Relationship between variables | Direction, strength, form |
| Line Graph | Trends over time | Increases, decreases, patterns |
Making Predictions from Data
Linear Regression for Prediction
When two variables have a linear relationship, use the regression equation:
y = mx + b
Where m (slope) = change in y per unit change in x
R-squared (R²): Measures how well the line fits the data (0 to 1). Higher R² means better predictions.
Warning: Don't extrapolate far beyond your data range!
Examples
Example 1: Comparing Two Options
Problem: A company is choosing between two suppliers. Quality scores (1-100) from samples:
Supplier A: 85, 88, 82, 90, 85, 87, 84, 89
Supplier B: 95, 75, 88, 72, 92, 78, 90, 70
Which supplier should they choose?
Solution:
Supplier A:
Mean = (85+88+82+90+85+87+84+89)/8 = 86.25
Standard Deviation = 2.66 (calculated)
Supplier B:
Mean = (95+75+88+72+92+78+90+70)/8 = 82.5
Standard Deviation = 9.62 (calculated)
Decision: Supplier A has both a higher mean (86.25 vs 82.5) AND lower variability (SD 2.66 vs 9.62). Supplier A is more reliable and consistently delivers higher quality.
Example 2: Identifying Correlation
Problem: Data shows hours studied and test scores for 6 students:
| Hours | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|
| Score | 65 | 70 | 78 | 82 | 88 | 92 |
Find the correlation and make predictions.
Solution:
Calculate correlation coefficient (r):
Using the formula or calculator: r = 0.99 (very strong positive correlation)
Find regression equation:
Slope m = 5.4 (approximately 5.4 points gained per hour studied)
y-intercept b = 54.3
Equation: Score = 5.4(Hours) + 54.3
Predict score for 8 hours:
Score = 5.4(8) + 54.3 = 97.5
Note: This is a correlation. We cannot conclude studying causes higher scores - motivated students might both study more AND perform better.
Example 3: Interpreting Box Plots
Problem: Box plots show customer wait times (minutes) at two stores:
Store A: Min=2, Q1=5, Median=8, Q3=12, Max=18
Store B: Min=4, Q1=6, Median=7, Q3=9, Max=25 (with outlier at 25)
Compare the stores and advise customers.
Solution:
Store A:
Median wait: 8 minutes
IQR: 12 - 5 = 7 minutes
50% of customers wait between 5-12 minutes
Store B:
Median wait: 7 minutes (slightly lower)
IQR: 9 - 6 = 3 minutes (more consistent)
50% of customers wait between 6-9 minutes
Outlier at 25 minutes indicates occasional long waits
Advice: Store B typically has shorter and more predictable wait times. However, there's a small chance of a very long wait (outlier). For most customers, Store B is the better choice.
Example 4: Using Percentiles for Decisions
Problem: A college considers SAT scores. The score distribution has mean 1050 and standard deviation 200. What minimum score puts a student in the top 16%?
Solution:
Top 16% means the 84th percentile (100% - 16% = 84%)
Using the empirical rule (68-95-99.7 rule):
84th percentile is approximately 1 standard deviation above the mean
Score = Mean + 1(SD) = 1050 + 1(200) = 1250
Students scoring 1250 or above are in the top 16%.
Example 5: Detecting Misleading Statistics
Problem: A company claims "Average salary: $120,000!" but employees complain about low pay. The salaries are: $40K, $45K, $50K, $55K, $60K, and one executive at $470K. Analyze the claim.
Solution:
Mean: (40+45+50+55+60+470)/6 = 720/6 = $120,000 (True but misleading!)
Median: Middle of 40, 45, 50, 55, 60, 470 = (50+55)/2 = $52,500
Analysis: The mean is heavily skewed by one outlier (the executive). The median of $52,500 better represents what a typical employee earns.
Conclusion: The claim is technically true but misleading. When data has outliers, the median is a more appropriate measure of center. Most employees earn around $50K, not $120K.
Practice
Apply your data analysis skills to make informed decisions.
1. Which measure of center is most appropriate for home prices in a neighborhood with a few mansions?
A) Mean B) Median C) Mode D) Range
2. Dataset: 12, 15, 18, 20, 22, 25, 28. What is the IQR?
A) 7 B) 10 C) 13 D) 16
3. A scatter plot shows points clustered tightly around a downward-sloping line. The correlation is:
A) Strong positive B) Weak positive C) Strong negative D) No correlation
4. R² = 0.81 for a regression model. What does this mean?
A) 81% of data points are on the line B) 81% of variation in y is explained by x C) Correlation is 0.81 D) Slope is 0.81
5. Ice cream sales and drowning deaths both increase in summer. This is an example of:
A) Causation B) Correlation with confounding variable C) Random chance D) Negative correlation
6. A box plot shows the median line very close to Q1. The distribution is:
A) Symmetric B) Skewed left C) Skewed right D) Bimodal
7. Data set: 5, 7, 8, 8, 9, 10, 12. The standard deviation is closest to:
A) 2 B) 4 C) 7 D) 8.4
8. A prediction using a regression line for x = 100 when data ranged from x = 10 to x = 30 is called:
A) Interpolation B) Extrapolation C) Correlation D) Residual analysis
9. In a normal distribution with mean 50 and SD 10, approximately what percent of data falls between 40 and 60?
A) 50% B) 68% C) 95% D) 99.7%
10. A study finds r = 0.95 between shoe size and reading ability in elementary students. We should conclude:
A) Bigger feet cause better reading B) Better reading causes bigger feet C) Age is likely a confounding variable D) The correlation is wrong
Click to reveal answers
- B) Median - robust to outliers (mansions)
- B) 10 - Q1=15, Q3=25, IQR=25-15=10
- C) Strong negative - tight cluster, downward slope
- B) 81% of variation in y is explained by x
- B) Correlation with confounding variable (summer/hot weather)
- C) Skewed right - data bunched at low end, stretched toward high
- A) 2 - Mean is 8.4, deviations are small
- B) Extrapolation - predicting outside the data range (risky!)
- B) 68% - within one standard deviation of mean
- C) Age is likely a confounding variable - older students have bigger feet AND read better
Check Your Understanding
1. A company wants to predict sales based on advertising spending. What type of analysis should they use, and what are the limitations?
Show answer
They should use regression analysis to find the relationship between advertising (independent variable) and sales (dependent variable). Limitations include: (1) correlation doesn't prove causation - other factors affect sales, (2) the relationship may not be linear, (3) predictions are unreliable if extrapolating beyond the data range, (4) past relationships may not hold in the future, (5) other variables (confounders) like seasonality or competition should be considered.
2. When would you choose the median over the mean to describe a data set?
Show answer
Use median when: (1) data has outliers that would skew the mean (like income or home prices), (2) data is skewed rather than symmetric, (3) you want to describe what a "typical" value is, (4) data has extreme values that don't represent the population, (5) dealing with ordinal data (rankings). The median is resistant to extreme values and better represents the center of skewed distributions.
3. Why is it dangerous to use correlation to imply causation? Give an example.
Show answer
Correlation only shows that two variables move together, not that one causes the other. Dangers: (1) A third variable (confounder) might cause both, (2) the relationship could be coincidental, (3) the direction of causation might be reversed. Example: Countries with more cell phones have higher life expectancy - but cell phones don't cause longevity. Wealth is the confounding variable: wealthier countries have both more cell phones AND better healthcare.
4. How can data visualizations be misleading, and how should you critically evaluate them?
Show answer
Common misleading tactics: (1) truncated y-axis makes small differences look large, (2) inappropriate scale or aspect ratio distorts trends, (3) cherry-picked time periods hide unfavorable data, (4) 3D effects distort proportions, (5) no labels or misleading labels. To evaluate critically: always check the axis scales and starting points, look for missing data or gaps, consider the source's potential bias, ask what data might have been excluded, and calculate actual percentages rather than relying on visual impressions.
🚀 Next Steps
- Review any concepts that felt challenging
- Move on to the next lesson when ready
- Return to practice problems periodically for review