Week 8

Hypothesis Testing

Is the evidence strong enough to challenge what we believe?

Introductory Statistics for Accounting

Section 1

What Is Hypothesis Testing?

1.1

1.1 Starting with a Question

In business, we often need to answer yes-or-no questions using data:

Are bags of chips consistently lighter than the 50g printed on the packet?
Did students score at least 60% on their statistics exam on average?
Has a new training program actually improved employee performance?

Hypothesis testing is a structured, step-by-step method for using sample data to answer these kinds of questions about an entire population.

Last week, we used confidence intervals to estimate a range for a population parameter. This week, we use a different approach: we start with a specific claim and ask whether the data provides enough evidence to reject it.

1.2

1.2 Confidence Intervals vs. Hypothesis Tests

Both tools help us make decisions from sample data, but they ask different questions:

Feature	Confidence Interval	Hypothesis Test
Question asked	"What range of values is the population mean likely in?"	"Is the population mean different from a specific value?"
Output	A range (e.g., 47.2 to 52.8)	A decision: reject or do not reject
Expressed as	Confidence level (e.g., 95%)	Significance level (e.g., 0.05)
Relationship	A 95% confidence interval corresponds to a two-tailed test at $\alpha = 0.05$

They are two sides of the same coin. A 99% confidence interval is related to a hypothesis test at the 0.01 significance level.

1.3

1.3 The Courtroom Analogy

Hypothesis testing works exactly like a courtroom trial:

In Court

The defendant is presumed innocent
The prosecution must present evidence
The jury decides: guilty or not guilty
"Not guilty" does not mean "innocent" — it means there wasn't enough evidence

In Statistics

We start by assuming the status quo (null hypothesis)
We collect sample data as evidence
We decide: reject or do not reject the null
"Do not reject" does not mean the null is true — it means we lacked sufficient evidence

We never say "accept the null hypothesis." We only say "do not reject." The absence of evidence is not evidence of absence.

1.4

1.4 The Null Hypothesis ($H_0$)

The null hypothesis ($H_0$) is the default position — the claim that nothing has changed, nothing is different, or the status quo holds. It always contains an equals sign ($=$, $\leq$, or $\geq$).

Examples:

The mean weight of chip bags is 50 grams: $H_0: \mu = 50$
The mean exam score is at least 60%: $H_0: \mu \geq 60$
The proportion of defective items is no more than 3%: $H_0: p \leq 0.03$

Think of $H_0$ as the "nothing to see here" hypothesis. It represents the manufacturer's claim, the company's target, or the existing benchmark.

1.5

1.5 The Alternative Hypothesis ($H_1$)

The alternative hypothesis ($H_1$) is what we are trying to find evidence for. It represents a change, a difference, or an effect. It is the opposite of $H_0$.

Examples:

The mean weight is less than 50 grams: $H_1: \mu < 50$
The mean exam score is less than 60%: $H_1: \mu < 60$
The proportion of defective items is greater than 3%: $H_1: p > 0.03$

The researcher's question or suspicion is always placed in $H_1$. The burden of proof is on the alternative — just as the burden of proof is on the prosecution in court.

1.6

1.6 One-Tailed vs. Two-Tailed Tests

The direction of $H_1$ determines whether we use a one-tailed or two-tailed test:

Type	$H_1$ Form	When to Use	Example
Left-tailed	$\mu < \text{value}$	Suspecting a decrease	Are bags under 50g?
Right-tailed	$\mu > \text{value}$	Suspecting an increase	Is defect rate above 3%?
Two-tailed	$\mu \neq \text{value}$	Suspecting any difference	Is the mean different from 100?

Rule of thumb: If you only care about one direction (less than or greater than), use a one-tailed test. If you care about any difference in either direction, use a two-tailed test.

1.7

1.7 Visualising the Tails

The shaded red regions are the rejection regions. If our test statistic falls in a red region, we reject $H_0$.

1.8

1.8 The Significance Level ($\alpha$)

The significance level ($\alpha$) is the probability of rejecting $H_0$ when it is actually true. It is the risk of making a wrong decision that we are willing to accept.

Common choices:

$\alpha$	Confidence	Meaning	Typical Use
0.10	90%	10% chance of a wrong rejection	Exploratory research
0.05	95%	5% chance of a wrong rejection	Most business & social science
0.01	99%	1% chance of a wrong rejection	Medical, safety-critical

You choose $\alpha$ before looking at the data. Choosing it after is like moving the goalposts — and is an ethical violation we'll discuss later.

1.Q

Knowledge Check — Section 1

Q1: A manufacturer claims their light bulbs last at least 1,000 hours. You suspect they last less. What is $H_1$?

Your suspicion is that bulbs last less than claimed, so $H_1$ points to "less than." This is a left-tailed test.

Q2: If we "do not reject $H_0$," what does that mean?

Just like a "not guilty" verdict in court, it means the evidence was not strong enough — not that the defendant (null hypothesis) is innocent (true).

Section 2

The Five-Step Hypothesis Testing Process

2.1

2.1 The Five Steps — Overview

Every hypothesis test follows these five steps, regardless of what you are testing. Let's walk through each one.

2.2

2.2 Step 1 — State the Hypotheses

Step 1

Write down $H_0$ and $H_1$ based on the business question.

Example: Chip Bag Weights

Students suspect bags of chips weigh less than the 50g label. The manufacturer claims they weigh 50g.

$H_0: \mu = 50$ (the bags weigh what the label says)

$H_1: \mu < 50$ (the bags are underweight)

This is a left-tailed test because we suspect the mean is less than the claimed value.

Always identify your hypotheses before collecting data. The research question determines the direction of $H_1$.

2.3

2.3 Step 2 — Choose the Significance Level

Step 2

Select your significance level $\alpha$ based on how much risk of a wrong rejection you can tolerate.

Chip Bag Example (continued)

We choose $\alpha = 0.05$. This means we are willing to accept a 5% chance of wrongly concluding the bags are underweight when they actually are not.

Higher stakes require lower $\alpha$. Testing a new drug? Use 0.01. Checking if a marketing campaign made a difference? 0.05 is often fine.

2.4

2.4 Step 3 — Compute the Test Statistic

Step 3

The test statistic measures how far our sample result is from what $H_0$ claims, in standardised units.

Z-test statistic (when population $\sigma$ is known or $n \geq 30$): $$Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}$$ t-test statistic (when population $\sigma$ is unknown and $n < 30$): $$t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}$$

Where $\bar{X}$ is the sample mean, $\mu_0$ is the claimed value in $H_0$, $\sigma$ (or $s$) is the standard deviation, and $n$ is the sample size.

The test statistic is simply: "How many standard errors is my sample mean from the hypothesised value?"

2.5

2.5 Step 4 — Make the Decision (Critical Value Approach)

Step 4a

Compare the test statistic to a critical value from the Z or t table.

Left-tailed test: if Z_stat falls in the red rejection region (left of the critical value), we reject $H_0$.

Decision rule: If the test statistic falls in the rejection region, reject $H_0$. Otherwise, do not reject.

2.6

2.6 Step 4 — The p-Value Approach

Step 4b

The p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming $H_0$ is true.

Think of it as: "If the null hypothesis were true, how surprising is our sample result?"

If...	Then...	Because...
$p\text{-value} \leq \alpha$	Reject $H_0$	The result is too surprising to be due to chance alone
$p\text{-value} > \alpha$	Do not reject $H_0$	The result is not surprising enough to overturn $H_0$

Small p-value = strong evidence against $H_0$. A p-value of 0.02 means there is only a 2% chance of seeing data this extreme if $H_0$ were true.

2.7

2.7 Understanding the p-Value Visually

When the p-value is smaller than α, the observed result is in the rejection region.

2.8

2.8 Step 5 — State the Conclusion

Step 5

Translate the statistical decision back into plain English, in the context of the original business question.

Good conclusion:

"At the 5% significance level, there is sufficient evidence to conclude that the mean weight of chip bags is less than 50 grams."

Common mistakes to avoid:

Saying "the null hypothesis is true" (we can never prove it true)
Saying "we accept the null hypothesis" (we only fail to reject it)
Omitting the significance level from your conclusion
Stating the conclusion without linking it to the real-world context

2.Q

Knowledge Check — Section 2

Q1: You compute a p-value of 0.03 and your significance level is 0.05. What do you do?

Since $p\text{-value} = 0.03 < \alpha = 0.05$, we reject $H_0$. The result is statistically significant.

Q2: Which step must come BEFORE looking at the data?

The significance level (α) must be set before collecting or analysing data. Choosing it after seeing results is known as "p-hacking" and is unethical.

Section 3

Decision Errors: Type I and Type II

3.1

3.1 Two Ways to Be Wrong

No matter how careful we are, there is always a chance of making the wrong decision:

	$H_0$ is actually TRUE	$H_0$ is actually FALSE
Reject $H_0$	Type I Error (False alarm)	Correct decision (Power)
Do Not Reject $H_0$	Correct decision	Type II Error (Missed finding)

Type I Error (α): Rejecting a true null hypothesis — a "false positive."
Type II Error (β): Failing to reject a false null hypothesis — a "false negative."

3.2

3.2 Errors in the Courtroom

Type I Error

Convicting an innocent person.

In business: Concluding that chip bags are underweight when they actually are not. The manufacturer unfairly gets blamed.

Probability = $\alpha$ (the significance level you chose)

Type II Error

Letting a guilty person go free.

In business: Failing to detect that chip bags are underweight when they really are. Consumers keep getting short-changed.

Probability = $\beta$ (harder to control)

There is a trade-off: making $\alpha$ smaller (harder to convict) makes $\beta$ larger (more guilty people go free), and vice versa. The only way to reduce both is to increase the sample size.

3.3

3.3 Which Error Is Worse?

It depends on the context. Consider these scenarios:

Scenario A: Drug Safety Testing

Type I: Approving a harmful drug (patients get hurt). Very bad.

Type II: Rejecting a safe, effective drug (patients miss out).

We set $\alpha$ very low (e.g., 0.01) to protect patients.

Scenario B: Marketing Campaign

Type I: Concluding a campaign worked when it didn't (wasted future budget).

Type II: Missing a campaign that actually worked (missed opportunity).

We might tolerate $\alpha = 0.10$ because the stakes are lower.

As an accountant or auditor, consider: what is the cost of each type of error for your client? This determines how strict your test should be.

3.Q

Knowledge Check — Section 3

Q1: An auditor concludes that a company's accounts contain material misstatement when in fact they are correct. What type of error is this?

The auditor rejected a true "null" (accounts are correct), which is a Type I error — a false alarm.

Q2: How can we reduce BOTH Type I and Type II errors simultaneously?

A larger sample gives us more information, making our test more precise and reducing both types of errors. Setting α = 0 would mean never rejecting, which doesn't solve the problem.

Section 4

Hypothesis Test for the Mean

4.1

4.1 Z-Test vs. t-Test — When to Use Which

Both test whether a sample mean differs from a hypothesised value, but the choice depends on what you know:

Condition	Use	Distribution
Population $\sigma$ is known	Z-test	Standard normal (Z)
Population $\sigma$ is unknown and $n < 30$	t-test	Student's t with $df = n - 1$
Population $\sigma$ is unknown but $n \geq 30$	Either (Z or t)	t is safer; Z is acceptable by CLT

In practice, the population standard deviation is almost never known, so the t-test is far more common. The Z-test is mainly used in textbook examples and when working with very large samples.

4.2

4.2 Worked Example: Chip Bag Weights (Setup)

Scenario

A group of students buys 36 bags of chips and weighs each one. The label claims 50 grams. They suspect the bags are underweight.

Sample results: $\bar{X} = 48.5\text{g}$, $s = 3.5\text{g}$, $n = 36$

Step 1: $H_0: \mu = 50$ $H_1: \mu < 50$ (left-tailed)

Step 2: $\alpha = 0.05$

Step 3: Since $n = 36 \geq 30$ and $\sigma$ is unknown, we use the t-test (or Z approximation):

$$t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} = \frac{48.5 - 50}{3.5 / \sqrt{36}} = \frac{-1.5}{0.5833} = -2.571$$

4.3

4.3 Worked Example: Chip Bag Weights (Decision)

Step 4 (Critical Value): For a left-tailed test at $\alpha = 0.05$ with $df = 35$:

The critical value from the t-table is approximately $t_{0.05, 35} = -1.690$.

Our test statistic $t = -2.571$ is more extreme (further left) than $-1.690$.

Step 5 (Conclusion):

"At the 5% significance level, there is sufficient evidence to conclude that the mean weight of chip bags is less than 50 grams. The bags appear to be underweight."

4.4

4.4 Worked Example: Student Exam Marks (Setup)

Scenario

A university department claims the average mark on the statistics exam is at least 60%. A lecturer suspects it may be lower and samples 25 students.

Sample results: $\bar{X} = 57.2\%$, $s = 8.4\%$, $n = 25$

Step 1: $H_0: \mu \geq 60$ $H_1: \mu < 60$ (left-tailed)

Step 2: $\alpha = 0.05$

Step 3: Since $n = 25 < 30$ and $\sigma$ unknown, we must use the t-test:

$$t = \frac{57.2 - 60}{8.4 / \sqrt{25}} = \frac{-2.8}{1.68} = -1.667$$

4.5

4.5 Worked Example: Student Exam Marks (Decision)

Step 4: Critical value for left-tailed test, $\alpha = 0.05$, $df = 24$:

$t_{0.05, 24} = -1.711$

Our test statistic $t = -1.667$ is not more extreme than $-1.711$. It does not fall in the rejection region.

Step 5 (Conclusion):

"At the 5% significance level, there is insufficient evidence to conclude that the mean exam mark is less than 60%. We cannot confirm the lecturer's suspicion."

Note: This does NOT prove the mean is 60% — only that we lack enough evidence to say it's lower.

4.6

4.6 Two-Tailed Test Example (Setup)

Scenario

An accounting firm's quality manual states that the average time to complete a standard audit is 40 hours. The manager wants to check if the actual mean has changed (either direction). A random sample of 50 recent audits is taken.

Sample results: $\bar{X} = 42.3\text{ hours}$, $s = 6.1\text{ hours}$, $n = 50$

Step 1: $H_0: \mu = 40$ $H_1: \mu \neq 40$ (two-tailed)

Step 2: $\alpha = 0.05$

Step 3:

$$t = \frac{42.3 - 40}{6.1 / \sqrt{50}} = \frac{2.3}{0.8627} = 2.666$$

4.7

4.7 Two-Tailed Test Example (Decision)

Step 4: For a two-tailed test at $\alpha = 0.05$, we split the significance between both tails: $\alpha/2 = 0.025$ each.

Critical values with $df = 49$: $\pm t_{0.025, 49} \approx \pm 2.010$

Our $t = 2.666$ exceeds the upper critical value of $+2.010$.

Step 5:

"At the 5% significance level, there is sufficient evidence to conclude that the mean audit completion time has changed from the standard 40 hours. It appears to have increased."

4.Q

Knowledge Check — Section 4

Q1: You are testing $H_0: \mu = 100$ vs $H_1: \mu \neq 100$ at $\alpha = 0.05$. Your test statistic is $t = 1.85$ and the critical values are $\pm 2.010$. What do you conclude?

Since $1.85$ lies between $-2.010$ and $+2.010$, it does not fall in either rejection region. We do not reject $H_0$.

Q2: When should you use a t-test instead of a Z-test?

The t-test is used when $\sigma$ is unknown and we estimate it with the sample standard deviation $s$. The choice between Z and t has nothing to do with the number of tails.

Section 5

Hypothesis Test for a Proportion

5.1

5.1 Testing a Proportion

Sometimes we are not testing a mean but a proportion — a percentage or fraction.

The Z-test statistic for a proportion is: $$Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}$$ where $\hat{p}$ is the sample proportion, $p_0$ is the hypothesised proportion, and $n$ is the sample size.

Conditions for the proportion test: the sample must be large enough so that both $np_0 \geq 5$ and $n(1-p_0) \geq 5$. This ensures the normal approximation is valid.

5.2

5.2 Worked Example: Invoice Error Rate (Setup)

Scenario

An accounting firm claims that no more than 2% of invoices processed contain errors. An internal auditor suspects the error rate has increased and reviews a random sample of 500 invoices, finding 18 with errors.

Step 1: $H_0: p \leq 0.02$ $H_1: p > 0.02$ (right-tailed)

Step 2: $\alpha = 0.05$

Step 3: Sample proportion: $\hat{p} = 18/500 = 0.036$

Check conditions: $np_0 = 500 \times 0.02 = 10 \geq 5$ and $n(1-p_0) = 490 \geq 5$. Conditions met.

$$Z = \frac{0.036 - 0.02}{\sqrt{\frac{0.02 \times 0.98}{500}}} = \frac{0.016}{0.00626} = 2.555$$

5.3

5.3 Worked Example: Invoice Error Rate (Decision)

Step 4: For a right-tailed test at $\alpha = 0.05$, the critical value is $Z_{0.05} = 1.645$.

Our $Z = 2.555 > 1.645$, so the test statistic falls in the rejection region.

p-Value approach: $P(Z > 2.555) \approx 0.0053$. Since $0.0053 < 0.05$, we reject $H_0$.

Step 5 (Conclusion):

"At the 5% significance level, there is sufficient evidence to conclude that the invoice error rate exceeds 2%. The auditor's concern is supported by the data — management should investigate the invoicing process."

Notice how the conclusion connects the statistical result to a business action. Always do this.

5.Q

Knowledge Check — Section 5

Q1: An e-commerce company claims that at least 85% of orders arrive on time. You survey 200 orders and find 160 arrived on time. What is the sample proportion?

$\hat{p} = 160 / 200 = 0.80$. Remember, the sample proportion is always a fraction between 0 and 1, not a count.

Q2: For a proportion test, why do we use the hypothesised proportion $p_0$ (not $\hat{p}$) in the denominator of the Z formula?

The entire test asks: "If $H_0$ were true, how likely is our sample?" So we calculate the standard error under $H_0$'s assumed value $p_0$.

Section 6

Ethics in Hypothesis Testing

6.1

6.1 Ethical Pitfalls

Hypothesis testing is a powerful tool, but it can be misused — intentionally or accidentally:

p-Hacking (Data Dredging): Running many tests on the same data and only reporting the ones that give "significant" results. If you test 20 hypotheses at $\alpha = 0.05$, on average one will be "significant" by pure chance.

Choosing $\alpha$ after seeing results: If your p-value is 0.04 and you then set $\alpha = 0.05$, you've rigged the game. The significance level must be chosen before the analysis.

Confusing statistical significance with practical significance: A result can be "statistically significant" but meaningless in practice. If chip bags are 0.1g underweight on average, that's statistically detectable with a large enough sample but practically irrelevant.

6.2

6.2 Responsible Use of Hypothesis Tests

As future accountants and business professionals, keep these principles in mind:

Pre-register your hypothesis and $\alpha$ before collecting data
Report all tests, not just the significant ones
Consider the effect size — is the difference large enough to matter in practice?
Use appropriate sample sizes — too small risks Type II errors; too large detects trivial differences
Be transparent about assumptions (normality, independence, sample selection)

A single hypothesis test is one piece of evidence, not proof. Good decision-making combines statistical results with professional judgment, domain knowledge, and ethical responsibility.

7.1

7.1 Week 8 Summary

Concept	Key Takeaway
$H_0$ and $H_1$	Start with the status quo; the research question goes in $H_1$
One-tailed vs Two-tailed	Use one-tailed when direction matters; two-tailed when any difference matters
Significance level ($\alpha$)	Set before testing; represents your tolerance for Type I error
Test statistic	Measures how far the sample result is from $H_0$ in standard units
p-Value	Small p-value = strong evidence against $H_0$
Type I / Type II errors	Trade-off between false alarms and missed findings
Testing means	Use Z when $\sigma$ known; t when unknown
Testing proportions	Use Z with $p_0$ in the standard error formula
Ethics	Don't p-hack, pre-register, report honestly

If...	Then...	Because...
\(p\text{-value} \leq \alpha\)	Reject \(H_0\)	The result is too surprising to be due to chance alone
\(p\text{-value} > \alpha\)	Do not reject \(H_0\)	The result is not surprising enough to overturn \(H_0\)

Type	\(H_1\) Form	When to Use	Example
Left-tailed	\(\mu < \text{value}\)	Suspecting a decrease	Are bags under 50g?
Right-tailed	\(\mu > \text{value}\)	Suspecting an increase	Is defect rate above 3%?
Two-tailed	\(\mu \neq \text{value}\)	Suspecting any difference	Is the mean different from 100?

Condition	Use	Distribution
Population \(\sigma\) is known	Z-test	Standard normal (Z)
Population \(\sigma\) is unknown and \(n < 30\)	t-test	Student's t with \(df = n - 1\)
Population \(\sigma\) is unknown but \(n \geq 30\)	Either (Z or t)	t is safer; Z is acceptable by CLT

Concept	Key Takeaway
\(H_0\) and \(H_1\)	Start with the status quo; the research question goes in \(H_1\)
One-tailed vs Two-tailed	Use one-tailed when direction matters; two-tailed when any difference matters
Significance level (\(\alpha\))	Set before testing; represents your tolerance for Type I error
Test statistic	Measures how far the sample result is from \(H_0\) in standard units
p-Value	Small p-value = strong evidence against \(H_0\)
Type I / Type II errors	Trade-off between false alarms and missed findings
Testing means	Use Z when \(\sigma\) known; t when unknown
Testing proportions	Use Z with \(p_0\) in the standard error formula
Ethics	Don't p-hack, pre-register, report honestly

Hypothesis Testing

Section 1

1.1 Starting with a Question

1.2 Confidence Intervals vs. Hypothesis Tests

1.3 The Courtroom Analogy

In Court

In Statistics

1.4 The Null Hypothesis (\(H_0\))

1.5 The Alternative Hypothesis (\(H_1\))

1.6 One-Tailed vs. Two-Tailed Tests

1.7 Visualising the Tails

1.8 The Significance Level (\(\alpha\))

Knowledge Check — Section 1

Section 2

2.1 The Five Steps — Overview

2.2 Step 1 — State the Hypotheses

2.3 Step 2 — Choose the Significance Level

2.4 Step 3 — Compute the Test Statistic

2.5 Step 4 — Make the Decision (Critical Value Approach)

2.6 Step 4 — The p-Value Approach

2.7 Understanding the p-Value Visually

2.8 Step 5 — State the Conclusion

Knowledge Check — Section 2

Section 3

3.1 Two Ways to Be Wrong

3.2 Errors in the Courtroom

Type I Error

Type II Error

3.3 Which Error Is Worse?

Knowledge Check — Section 3

Section 4

4.1 Z-Test vs. t-Test — When to Use Which

4.2 Worked Example: Chip Bag Weights (Setup)

4.3 Worked Example: Chip Bag Weights (Decision)

4.4 Worked Example: Student Exam Marks (Setup)

4.5 Worked Example: Student Exam Marks (Decision)

4.6 Two-Tailed Test Example (Setup)

4.7 Two-Tailed Test Example (Decision)

Knowledge Check — Section 4

Section 5

5.1 Testing a Proportion

5.2 Worked Example: Invoice Error Rate (Setup)

5.3 Worked Example: Invoice Error Rate (Decision)

Knowledge Check — Section 5

Section 6

6.1 Ethical Pitfalls

6.2 Responsible Use of Hypothesis Tests

7.1 Week 8 Summary

Table of Contents