Data-Driven Decisions | Open Textbooks

Learn

Introduction to Data-Driven Decision Making

In today's world, decisions in business, science, healthcare, and policy are increasingly based on data analysis rather than intuition alone. Understanding how to interpret data, identify patterns, and draw valid conclusions is an essential skill for college and career success.

Data-Driven Decision Making

Data-driven decision making (DDDM) is the practice of basing decisions on data analysis and interpretation rather than intuition or observation alone. It involves collecting relevant data, analyzing it using appropriate methods, and using the insights to guide actions.

The Data Analysis Process

Steps in Data Analysis

Define the Question: What decision needs to be made? What information would help?
Collect Data: Gather relevant, reliable data from appropriate sources
Clean and Organize: Remove errors, handle missing values, format consistently
Analyze: Apply statistical methods to find patterns and relationships
Interpret: Draw conclusions in context of the original question
Communicate: Present findings clearly to support decision-making

Types of Data

Data Type	Description	Examples	Analysis Methods
Quantitative Discrete	Countable numbers	Number of customers, defects	Counting, frequencies
Quantitative Continuous	Measurable quantities	Temperature, weight, time	Mean, std deviation, regression
Categorical Nominal	Categories without order	Color, brand, gender	Mode, frequency tables
Categorical Ordinal	Ordered categories	Rating scales, education level	Median, percentiles

Descriptive Statistics for Decision Making

Measures of Center

Mean: Average value; sensitive to outliers
Median: Middle value when sorted; robust to outliers
Mode: Most frequent value; useful for categorical data

Mean = (Sum of all values) / (Number of values)

Measures of Spread

Range: Maximum - Minimum (simple but sensitive to outliers)
Interquartile Range (IQR): Q3 - Q1 (middle 50% of data)
Standard Deviation: Average distance from the mean
Variance: Standard deviation squared

Comparing Data Sets

When making decisions between options, compare both center and spread:

Higher mean might indicate better overall performance
Lower standard deviation indicates more consistency/reliability
Consider the context: Is consistency or peak performance more important?

The Five-Number Summary

A comprehensive snapshot of data distribution:

Minimum
First Quartile (Q1) - 25th percentile
Median (Q2) - 50th percentile
Third Quartile (Q3) - 75th percentile
Maximum

This forms the basis for box plots, which visually compare distributions.

Correlation vs. Causation

Critical Distinction

Correlation: Two variables move together (positive or negative relationship)

Causation: One variable directly causes changes in another

Correlation does NOT imply causation!

Third variables (confounders) may explain observed correlations. Only controlled experiments can establish causation.

Using Data Visualizations

Chart Type	Best For	Key Features to Examine
Bar Chart	Comparing categories	Relative heights, ordering
Histogram	Distribution of continuous data	Shape, center, spread, outliers
Box Plot	Comparing distributions	Median, IQR, outliers
Scatter Plot	Relationship between variables	Direction, strength, form
Line Graph	Trends over time	Increases, decreases, patterns

Making Predictions from Data

Linear Regression for Prediction

When two variables have a linear relationship, use the regression equation:

y = mx + b

Where m (slope) = change in y per unit change in x

R-squared (R²): Measures how well the line fits the data (0 to 1). Higher R² means better predictions.

Warning: Don't extrapolate far beyond your data range!

Examples

Example 1: Comparing Two Options

Problem: A company is choosing between two suppliers. Quality scores (1-100) from samples:

Supplier A: 85, 88, 82, 90, 85, 87, 84, 89

Supplier B: 95, 75, 88, 72, 92, 78, 90, 70

Which supplier should they choose?

Solution:

Supplier A:

Mean = (85+88+82+90+85+87+84+89)/8 = 86.25

Standard Deviation = 2.66 (calculated)

Supplier B:

Mean = (95+75+88+72+92+78+90+70)/8 = 82.5

Standard Deviation = 9.62 (calculated)

Decision: Supplier A has both a higher mean (86.25 vs 82.5) AND lower variability (SD 2.66 vs 9.62). Supplier A is more reliable and consistently delivers higher quality.

Example 2: Identifying Correlation

Problem: Data shows hours studied and test scores for 6 students:

Hours	2	3	4	5	6	7
Score	65	70	78	82	88	92

Find the correlation and make predictions.

Solution:

Calculate correlation coefficient (r):

Using the formula or calculator: r = 0.99 (very strong positive correlation)

Find regression equation:

Slope m = 5.4 (approximately 5.4 points gained per hour studied)

y-intercept b = 54.3

Equation: Score = 5.4(Hours) + 54.3

Predict score for 8 hours:

Score = 5.4(8) + 54.3 = 97.5

Note: This is a correlation. We cannot conclude studying causes higher scores - motivated students might both study more AND perform better.

Example 3: Interpreting Box Plots

Problem: Box plots show customer wait times (minutes) at two stores:

Store A: Min=2, Q1=5, Median=8, Q3=12, Max=18

Store B: Min=4, Q1=6, Median=7, Q3=9, Max=25 (with outlier at 25)

Compare the stores and advise customers.

Solution:

Store A:

Median wait: 8 minutes

IQR: 12 - 5 = 7 minutes

50% of customers wait between 5-12 minutes

Store B:

Median wait: 7 minutes (slightly lower)

IQR: 9 - 6 = 3 minutes (more consistent)

50% of customers wait between 6-9 minutes

Outlier at 25 minutes indicates occasional long waits

Advice: Store B typically has shorter and more predictable wait times. However, there's a small chance of a very long wait (outlier). For most customers, Store B is the better choice.

Example 4: Using Percentiles for Decisions

Problem: A college considers SAT scores. The score distribution has mean 1050 and standard deviation 200. What minimum score puts a student in the top 16%?

Solution:

Top 16% means the 84th percentile (100% - 16% = 84%)

Using the empirical rule (68-95-99.7 rule):

84th percentile is approximately 1 standard deviation above the mean

Score = Mean + 1(SD) = 1050 + 1(200) = 1250

Students scoring 1250 or above are in the top 16%.

Example 5: Detecting Misleading Statistics

Problem: A company claims "Average salary: $120,000!" but employees complain about low pay. The salaries are: $40K, $45K, $50K, $55K, $60K, and one executive at $470K. Analyze the claim.

Solution:

Mean: (40+45+50+55+60+470)/6 = 720/6 = $120,000 (True but misleading!)

Median: Middle of 40, 45, 50, 55, 60, 470 = (50+55)/2 = $52,500

Analysis: The mean is heavily skewed by one outlier (the executive). The median of $52,500 better represents what a typical employee earns.

Conclusion: The claim is technically true but misleading. When data has outliers, the median is a more appropriate measure of center. Most employees earn around $50K, not $120K.

Practice

Apply your data analysis skills to make informed decisions.

1. Which measure of center is most appropriate for home prices in a neighborhood with a few mansions?

A) Mean B) Median C) Mode D) Range

2. Dataset: 12, 15, 18, 20, 22, 25, 28. What is the IQR?

A) 7 B) 10 C) 13 D) 16

3. A scatter plot shows points clustered tightly around a downward-sloping line. The correlation is:

A) Strong positive B) Weak positive C) Strong negative D) No correlation

4. R² = 0.81 for a regression model. What does this mean?

A) 81% of data points are on the line B) 81% of variation in y is explained by x C) Correlation is 0.81 D) Slope is 0.81

5. Ice cream sales and drowning deaths both increase in summer. This is an example of:

A) Causation B) Correlation with confounding variable C) Random chance D) Negative correlation

6. A box plot shows the median line very close to Q1. The distribution is:

A) Symmetric B) Skewed left C) Skewed right D) Bimodal

7. Data set: 5, 7, 8, 8, 9, 10, 12. The standard deviation is closest to:

A) 2 B) 4 C) 7 D) 8.4

8. A prediction using a regression line for x = 100 when data ranged from x = 10 to x = 30 is called:

A) Interpolation B) Extrapolation C) Correlation D) Residual analysis

9. In a normal distribution with mean 50 and SD 10, approximately what percent of data falls between 40 and 60?

A) 50% B) 68% C) 95% D) 99.7%

10. A study finds r = 0.95 between shoe size and reading ability in elementary students. We should conclude:

A) Bigger feet cause better reading B) Better reading causes bigger feet C) Age is likely a confounding variable D) The correlation is wrong

Click to reveal answers

B) Median - robust to outliers (mansions)
B) 10 - Q1=15, Q3=25, IQR=25-15=10
C) Strong negative - tight cluster, downward slope
B) 81% of variation in y is explained by x
B) Correlation with confounding variable (summer/hot weather)
C) Skewed right - data bunched at low end, stretched toward high
A) 2 - Mean is 8.4, deviations are small
B) Extrapolation - predicting outside the data range (risky!)
B) 68% - within one standard deviation of mean
C) Age is likely a confounding variable - older students have bigger feet AND read better

Check Your Understanding

1. A company wants to predict sales based on advertising spending. What type of analysis should they use, and what are the limitations?

Show answer

They should use regression analysis to find the relationship between advertising (independent variable) and sales (dependent variable). Limitations include: (1) correlation doesn't prove causation - other factors affect sales, (2) the relationship may not be linear, (3) predictions are unreliable if extrapolating beyond the data range, (4) past relationships may not hold in the future, (5) other variables (confounders) like seasonality or competition should be considered.

2. When would you choose the median over the mean to describe a data set?

Show answer

Use median when: (1) data has outliers that would skew the mean (like income or home prices), (2) data is skewed rather than symmetric, (3) you want to describe what a "typical" value is, (4) data has extreme values that don't represent the population, (5) dealing with ordinal data (rankings). The median is resistant to extreme values and better represents the center of skewed distributions.

3. Why is it dangerous to use correlation to imply causation? Give an example.

Show answer

Correlation only shows that two variables move together, not that one causes the other. Dangers: (1) A third variable (confounder) might cause both, (2) the relationship could be coincidental, (3) the direction of causation might be reversed. Example: Countries with more cell phones have higher life expectancy - but cell phones don't cause longevity. Wealth is the confounding variable: wealthier countries have both more cell phones AND better healthcare.

4. How can data visualizations be misleading, and how should you critically evaluate them?

Show answer

Common misleading tactics: (1) truncated y-axis makes small differences look large, (2) inappropriate scale or aspect ratio distorts trends, (3) cherry-picked time periods hide unfavorable data, (4) 3D effects distort proportions, (5) no labels or misleading labels. To evaluate critically: always check the axis scales and starting points, look for missing data or gaps, consider the source's potential bias, ask what data might have been excluded, and calculate actual percentages rather than relying on visual impressions.

🚀 Next Steps

Review any concepts that felt challenging
Move on to the next lesson when ready
Return to practice problems periodically for review