How do we draw conclusions about entire populations when we can only measure a small sample? This lecture answers that question — rigorously.
In practice, we cannot measure an entire population. Instead, we take a sample and compute a statistic (e.g., the sample mean \(\bar{X}\)) to estimate the parameter (e.g., population mean \(\mu\)).
A sampling distribution describes how a sample statistic varies across all possible samples of the same size.
The standard error measures how much the sample mean \(\bar{X}\) typically varies from sample to sample.
where \(\sigma\) = population standard deviation, \(n\) = sample size.
Consider a tiny population — 4 components, with fault counts: A=5, B=3, C=6, D=2.
Population mean: \(\mu = \frac{5+3+6+2}{4} = 4\)
Take all samples of size \(n=2\). There are \(\binom{4}{2}=6\) possible samples:
| Sample | Values | \(\bar{x}\) |
|---|---|---|
| A, B | 5, 3 | 4.0 |
| A, C | 5, 6 | 5.5 |
| A, D | 5, 2 | 3.5 |
| B, C | 3, 6 | 4.5 |
| B, D | 3, 2 | 2.5 |
| C, D | 6, 2 | 4.0 |
| Average of all \(\bar{x}\) | 4.0 | |
Individual sample means scatter around \(\mu\), but they do not systematically over- or under-estimate it.
Rule of thumb: \(n \geq 30\) is usually sufficient (more if the population is highly skewed).
Once we know \(\bar{X}\) is normally distributed, we standardise to use the Z-table:
Note: If the population is already normal, this works for any sample size. The CLT is only needed for non-normal populations.
n = 20 is below the CLT threshold of 30. To proceed, we would need to assume the population is normal (not just approximately symmetric).
If the population is normal, the sample mean is still normally distributed for any n — CLT is not required in that case.
| Event | Z | Prob. |
|---|---|---|
| \(\bar{x} < 7.5\), n=30 | \(\frac{7.5-8}{2/\sqrt{30}}=-1.37\) | 0.0853 |
| \(\bar{x} < 7.5\), n=50 | \(\frac{7.5-8}{2/\sqrt{50}}=-1.77\) | 0.0384 |
| Individual \(X < 2\) | \(\frac{2-8}{2}=-3.0\) | 0.0013 |
Sometimes we care about a proportion rather than a mean — e.g., "What fraction of accounts are overdue?"
When \(n\pi \geq 5\) and \(n(1-\pi) \geq 5\), we can approximate with the normal distribution:
$$Z = \frac{\hat{p} - \pi}{\sqrt{\pi(1-\pi)/n}}$$Check: \(100(0.4)=40\geq5\) ✓
About 15% of samples of 100 will show over 45% usage.
95% of sample proportions fall within \(\pm1.96\) standard errors of \(\pi\):
The binomial distribution \(X \sim B(n, \pi)\) can be approximated by the normal when:
Then use: \(\mu = n\pi\) and \(\sigma = \sqrt{n\pi(1-\pi)}\)
And standardise as: \(Z = \dfrac{X_a - n\pi}{\sqrt{n\pi(1-\pi)}}\)
When converting a discrete event to a continuous normal probability, shift by 0.5:
| Discrete (Binomial) | Continuous (Normal) — use Xa | Reason |
|---|---|---|
| \(P(X \geq k)\) | \(P(X \geq k - 0.5)\) | Include the full bar at k |
| \(P(X > k)\) | \(P(X > k + 0.5)\) | Exclude the bar at k |
| \(P(X \leq k)\) | \(P(X \leq k + 0.5)\) | Include the full bar at k |
| \(P(X < k)\) | \(P(X < k - 0.5)\) | Exclude the bar at k |
| \(P(X = k)\) | \(P(k - 0.5 \leq X \leq k + 0.5)\) | Capture the whole bar width |
Check: \(n\pi = 6 \times \frac{1}{3} = 2 < 5\) → cannot use normal.
Use binomial formula \(\binom{n}{x}\pi^x(1-\pi)^{n-x}\):
P(X=3) = C(6,3)(1/3)³(2/3)³ = 0.2195 P(X=4) = C(6,4)(1/3)⁴(2/3)² = 0.0823 P(X=5) = C(6,5)(1/3)⁵(2/3)¹ = 0.0165 P(X=6) = C(6,6)(1/3)⁶(2/3)⁰ = 0.0014 ───────────────────────────────────── P(X≥3) = 0.3196
Check: \(n\pi = 20/3 = 6.67 \geq 5\) ✓
For a Poisson distribution \(X \sim \text{Pois}(\lambda)\), when \(\lambda \geq 5\) we can approximate:
Apply the same continuity correction rules as for the binomial.
So far we assumed the population is very large relative to our sample. In practice, if we sample a significant fraction of the population, our sample is more informative than the formula suggests.
| Situation | Formula | Condition to check |
|---|---|---|
| Sample mean, large / normal population | \(Z = \dfrac{\bar{X}-\mu}{\sigma/\sqrt{n}}\) | Population normal, or \(n \geq 30\) (CLT) |
| Sample mean, finite population | \(Z = \dfrac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}\sqrt{\frac{N-n}{N-1}}}\) | \(n/N > 0.05\) and sampling without replacement |
| Sample proportion | \(Z = \dfrac{\hat{p}-\pi}{\sqrt{\pi(1-\pi)/n}}\) | \(n\pi \geq 5\) and \(n(1-\pi) \geq 5\) |
| Binomial → Normal approx | \(Z = \dfrac{X_a - n\pi}{\sqrt{n\pi(1-\pi)}}\) | \(n\pi \geq 5\) and \(n(1-\pi) \geq 5\); apply continuity correction |
| Poisson → Normal approx | \(Z = \dfrac{X_a - \lambda}{\sqrt{\lambda}}\) | \(\lambda \geq 5\); apply continuity correction |