Statistics Module
Welcome to the Statistics Module!
This is your interactive resource designed to help you review, practice, and master core statistics concepts.
We've translated a standard statistics curriculum into an interactive experience. This module is fully responsive and works on both desktop and mobile. You'll find:
- Interactive Tables: Click buttons to see definitions and formulas for mean, median, and mode.
- Dynamic Charts: Visualize how data distributions, like the bell curve, actually look.
- Practice Terminals: Test your knowledge in each section with hands-on questions and get instant feedback.
- Big Ideas: A special section for the "gotchas" and powerful concepts, like "Correlation is not Causation."
Use the menu to navigate between topics. Let's get started!
1. What is Statistics?
Statistics is the science of collecting, organizing, analyzing, and interpreting data. It helps us make sense of the world and make decisions in the face of uncertainty.
Descriptive Statistics
This is about *describing* and *summarizing* data you have. Think of it as painting a picture of your data.
- Example: Calculating the class average (the *mean*) on a test.
- Example: Creating a bar chart to show how many people prefer coffee, tea, or soda.
- Key goal: To summarize. You're not guessing about anything larger.
Inferential Statistics
This is about using data from a small group (a *sample*) to make an educated guess, or *inference*, about a much larger group (the *population*).
- Example: A political poll asks 1,000 people who they'll vote for to *predict* the entire country's election outcome.
- Example: A drug company tests a new medicine on 500 patients to see if it's effective for *everyone*.
- Key goal: To predict, forecast, or generalize.
Types of Data
How you analyze data depends on what *kind* of data it is.
Categorical (Qualitative)
Data that fits into categories. Think descriptions, labels, or groups.
- Examples: Eye color ("Blue", "Brown"), favorite brand ("Nike", "Adidas"), Yes/No answers.
Quantitative (Numerical)
Data that consists of numbers you can do math with (like finding an average).
- Examples: Height (175cm), temperature (72°F), number of students (30).
Practice Time
Q1: You poll 500 voters to predict a major election. Is this 'descriptive' or 'inferential' statistics? (Type your answer)
> Terminal ready. Awaiting answer...
2. Central Tendency
"Central tendency" is a fancy term for describing the "center" or "typical" value of a dataset. These are the most common measures you'll ever use.
| Measure | How to Find It | Description |
|---|
Practice Time
Q1: What is the mean of this dataset: [10, 20, 30]?
> Terminal ready. Awaiting answer...
3. Variability (Spread)
Knowing the "center" isn't enough. We also need to know how "spread out" the data is. Two cities can have the same average temperature, but very different climates!
Range
The simplest measure of spread. Just subtract the smallest value from the largest value.
Data: [5, 1, 10, 3]
Range: 10 (Max) - 1 (Min) = 9
Quartiles & IQR
Quartiles cut your *sorted* data into four equal parts.
- Q1: The 25th percentile (the median of the lower half).
- Q2: The 50th percentile (the median).
- Q3: The 75th percentile (the median of the upper half).
IQR (Interquartile Range): Q3 - Q1. It's the "range of the middle 50%".
Variance & Standard Deviation
Variance (s²)
The average of the *squared* distances from the Mean. It measures the *total* amount of spread, but its units are squared (e.g., "dollars-squared"), which is weird.
s2 = Σ(x - x̄)2 / (n-1)
Standard Deviation (s)
The king of variability. It's the *square root* of the variance, which puts the units back to normal (e.g., "dollars").
It's the *typical* or *average* distance a data point is from the mean.
s = √s2
Practice Time
Q1: What is the range of this dataset: [100, 20, 5, 50]?
> Terminal ready. Awaiting answer...
4. Probability Basics
Probability is the language of uncertainty. It's a number between 0 and 1 that describes how likely an event is to occur.
Defining Probability
# Simple Probability # P(Event) = (Ways Event Can Happen) / (Total Possible Outcomes) # Example: Rolling a 6 on a fair 6-sided die Ways to roll a 6 = 1 Total outcomes = 6 P(Roll a 6) = 1 / 6 = 0.167 (or 16.7%)
Combining Events
# P(A or B) - Union # For *mutually exclusive* events (can't happen at same time) # P(Roll a 1 or a 6) = P(1) + P(6) = (1/6) + (1/6) = 2/6 # P(A and B) - Intersection # For *independent* events (one doesn't affect the other) # P(Flip 'Heads' AND Roll a '6') # P(Heads) * P(6) = (1/2) * (1/6) = 1/12
The Probability Scale
The Complement Rule
The "complement" of an event (written P(A') or P(not A)) is the chance it *doesn't* happen.
P(not A) = 1 - P(A)
Example: If the probability of rain is P(Rain) = 0.3 (30%), then the probability of *no rain* is:
P(No Rain) = 1 - 0.3 = 0.7 (70%)
Practice Time
Q1: You have a bag with 4 red marbles and 6 blue marbles (10 total). What is the probability of pulling out a red marble? (Enter as a decimal, e.g., 0.5)
> Terminal ready. Awaiting answer...
5. Data Visualization
A picture is worth a thousand statistics. Visualizing your data is the best way to understand its "distribution" — its shape, center, and spread all at once.
A histogram is a bar chart that shows the *frequency* of data in certain "bins" or ranges. Let's look at two common shapes.
Practice Time
Q1: What is the common nickname for the "Normal Distribution" shown in the chart? (two words)
> Terminal ready. Awaiting answer...
6. Big Ideas & Tricky Stuff
This section covers the most important concepts and common mistakes. Getting these right will make you a statistical power user.
The #1 Mistake: Correlation is NOT Causation
What is Correlation?
It just means two variables *move together*. When one goes up, the other also tends to go up (or down).
Example: Ice Cream & Crime
There is a very strong, positive correlation between ice cream sales and crime rates. When ice cream sales increase, crime rates also increase.
Does this mean ice cream *causes* crime?
No! A *third variable* (a "lurking" variable), hot weather, causes both. Hot weather makes people buy ice cream, and it also makes people go outside more, leading to more conflict and crime.
Just because two things are related, it does *not* mean one causes the other. Always ask: "Could something else be causing both?"
Power Move: The 68-95-99.7 Rule
For any data that follows a Normal Distribution (a "bell curve"), this rule is a powerful shortcut.
It tells you what percentage of data falls within a certain number of standard deviations ($s$) from the mean ($x̄$).
- About 68% of all data falls within 1 standard deviation of the mean.
- About 95% of all data falls within 2 standard deviations of the mean.
- About 99.7% (almost all) of data falls within 3 standard deviations of the mean.
Example: SAT Scores
SAT scores are normally distributed with a Mean = 1000 and a Standard Deviation = 200.
- 68% of students score between 800 and 1200 (1000 ± 200).
- 95% of students score between 600 and 1400 (1000 ± 400).
The P-Value (Simplified)
The p-value is one of the most important (and misunderstood) concepts in inferential statistics.
It's a "probability of surprise."
The Logic
- You start with a "default" assumption, called the Null Hypothesis (e.g., "This coin is fair," or "This new drug does nothing").
- You collect data (e.g., you flip the coin 100 times and get 70 Heads).
- You calculate the p-value.
The p-value is the probability of seeing data *at least as extreme* as what you got, *assuming the null hypothesis is true*.
"If the coin really *is* fair, what's the chance of getting 70 Heads just by random luck?"
- High p-value: (e.g., p = 0.4) "This isn't surprising. A fair coin could do this 40% of the time. The drug probably does nothing."
- Low p-value: (e.g., p = 0.01) "This is very surprising! There's only a 1% chance of this happening by luck. The coin is probably *not* fair. The drug probably *works*!"
We "reject the null hypothesis" when the p-value is very low (usually < 0.05).
Advanced: Correlation (r)
The correlation coefficient, $r$, is a single number between -1 and +1 that measures the *strength* and *direction* of a *linear* relationship.
- $r$ = +1: A perfect, positive linear relationship. As X goes up by 1, Y goes up by a set amount.
- $r$ ≈ +0.7: A strong, positive linear relationship. (e.g., Height and Weight).
- $r$ ≈ 0: No *linear* relationship. The dots on a scatterplot look like a random cloud.
- $r$ ≈ -0.5: A moderate, negative linear relationship. (e.g., Hours of TV per week and GPA).
- $r$ = -1: A perfect, negative linear relationship.
Advanced: Linear Regression
If correlation ($r$) tells you *if* two variables are related, regression tells you *how*. It gives you a specific formula to make predictions.
It's the "line of best fit" that you draw through a scatterplot of data.
The Formula
ŷ = b0 + b1x
- $ŷ$ (y-hat): The *predicted value* of Y.
- $x$: The value of X you are using to predict.
- $b_0$: The y-intercept. The predicted value of Y when X is 0.
- $b_1$: The slope. For every 1-unit increase in X, we predict Y will change by this amount.
Example:
Predicted_Salary = 30,000 + 2,000 × Years_of_Experience
The slope ($b_1$) is 2,000. This means for every 1 extra year of experience, we predict salary will increase by $2,000.