Week 11 Excel Activity — Simple Linear Regression

Progress:

Click a box when you finish that part.

Background

Apex Retail Solutions is an Australian retail analytics firm. The marketing team wants to know whether advertising expenditure can reliably predict monthly sales revenue. You are the analyst. Using data from 5 store locations, you will build, interpret, and evaluate a simple linear regression model — entirely in Excel.

Dataset — Enter this into Excel first

Open a blank Excel workbook. Enter the data below starting from cell A1.

Cell	A — Store	B — Ad Spend ($000)	C — Monthly Sales ($000)
Row 1	Store	Ad_Spend	Monthly_Sales
Row 2	1	1	10
Row 3	2	2	14
Row 4	3	3	16
Row 5	4	4	22
Row 6	5	5	23

Tip: Label column headers clearly. Ad_Spend goes in column B, Monthly_Sales in column C. Your data runs from rows 1–6 (row 1 = header, rows 2–6 = data).

Create a Scatter Plot — Visualise the Relationship

Select the range B1:C6 (Ad Spend and Monthly Sales, including headers).

Go to Insert → Charts → Scatter and choose "Scatter (X,Y)" — the plain dots with no lines.

Add axis titles: right-click the chart → Add Chart Element → Axis Titles.

Horizontal axis: Advertising Spend ($000)
Vertical axis: Monthly Sales ($000)

Add a linear trendline: right-click any data point → Add Trendline → select Linear → tick "Display Equation on chart" and "Display R-squared value on chart".

Question 1.1

Looking at the scatter plot, does the relationship between advertising spend and monthly sales appear linear? Describe what you see.

Yes — the points follow an upward-sloping straight-line pattern. As advertising spend increases, monthly sales also increase, and the relationship appears approximately linear with no obvious curve.

Question 1.2

What equation does Excel display on the trendline? Write it here.

Excel should display y = 3.4x + 6.8, or very close to this. The R² value shown should be approximately 0.963.

Calculate Coefficients Using Excel Functions

Excel has built-in functions to directly calculate the regression slope and intercept. Use these to verify your trendline equation.

In cell E1, type the label: Slope (b1)

In cell F1, enter the formula:

Formula in F1

=SLOPE(C2:C6, B2:B6)

Syntax: =SLOPE(known_ys, known_xs)

In cell E2, type the label: Intercept (b0)

In cell F2, enter:

Formula in F2

=INTERCEPT(C2:C6, B2:B6)

Tip: In both functions, the y values (Sales) always come first, then the x values (Ad Spend). Getting this backwards is the most common mistake.

Question 2.1

What values does Excel return for the slope and intercept?

Slope (b₁) = 3.4 | Intercept (b₀) = 6.8. These match the trendline equation from Part 1.

Question 2.2

Write the full regression equation and interpret the slope in one sentence using the actual variable names (advertising spend, monthly sales).

Equation: ŷ = 6.8 + 3.4x. Interpretation: For every additional $1,000 spent on advertising, monthly sales are predicted to increase by $3,400.

Calculate Predicted Values and Residuals

Residuals tell us how far each actual observation is from the regression line.

In cell D1, type the header: Predicted_Sales

In cell D2, enter a formula using your slope (F1) and intercept (F2):

Formula in D2 (use absolute references for coefficients)

=$F$2 + $F$1 * B2

Then copy this formula down to D3:D6 by dragging the fill handle.

In cell E3, type the header: Residual (in cell E1 if free, else use column F or G).

Calculate the residual for each store as: Actual − Predicted

Formula in residual column, row 2

=C2 - D2

Copy down to all 5 rows.

Question 3.1

Complete the table. What are the predicted sales and residuals for each store?

Store 1: Predicted = 10.2, Residual = −0.2
Store 2: Predicted = 13.6, Residual = +0.4
Store 3: Predicted = 17.0, Residual = −1.0
Store 4: Predicted = 20.4, Residual = +1.6
Store 5: Predicted = 23.8, Residual = −0.8

Question 3.2

What is the sum of all residuals? Use =SUM(...) to check. What do you notice, and why does this make mathematical sense?

Sum ≈ 0 (may show a tiny rounding error like 0.0000001). The OLS method minimises the sum of squared residuals, and a mathematical property of this is that the positive and negative residuals always cancel out exactly — the line passes through the centroid (x̄, ȳ).

Calculate and Interpret R²

In a new area (e.g., cells E5:F8), build the following summary table:

Cell	Label	Formula
E5	SST	=DEVSQ(C2:C6)
E6	SSE	=SUMXMY2(C2:C6,D2:D6)
E7	SSR	=E5-E6
E8	R²	=E7/E5

DEVSQ calculates Σ(y − ȳ)² directly. SUMXMY2 calculates Σ(x − y)² for two ranges — here it gives us Σ(actual − predicted)² = SSE.

Alternatively, use Excel's built-in R² function to double-check:

=RSQ(C2:C6, B2:B6)

Syntax: =RSQ(known_ys, known_xs)

Question 4.1

What are the values of SST, SSE, SSR, and R²? Confirm that SST = SSR + SSE.

SST = 120 | SSE = 4.40 | SSR = 115.60 | R² = 0.9633. Confirm: 115.60 + 4.40 = 120 ✓

Question 4.2 — Business Interpretation

You are presenting results to the CFO. Write a one-sentence business interpretation of R² = 0.963.

Approximately 96.3% of the variation in monthly sales is explained by advertising expenditure, indicating that the linear regression model provides a very strong fit to the data.

Residual Analysis — Create a Residual Plot

A residual plot checks whether our linear model is appropriate.

Select the Ad Spend values (B2:B6) and the Residuals column together (hold Ctrl to select non-adjacent columns). Insert a Scatter chart.

Add a horizontal reference line at zero: right-click the chart → Add Chart Element → Lines → No lines. Alternatively, draw a manual horizontal line using Insert → Shapes.

The zero line represents "perfect prediction." Points above it have positive residuals; points below have negative residuals.

Label axes: x-axis = "Advertising Spend ($000)", y-axis = "Residual ($000)"

Question 5.1

Describe what you see in the residual plot. Is there a pattern, or do the residuals appear randomly scattered around zero?

The residuals appear roughly randomly scattered around zero with no clear curved or fan-shaped pattern. This supports the assumption of linearity and equal variance. With only 5 data points it is difficult to be definitive, but there are no major red flags.

Question 5.2

Which store has the largest residual (in absolute value)? What does this tell us about that store's actual sales compared to the model's prediction?

Store 4 has the largest residual in absolute value: e₄ = +1.6. This means its actual sales ($22,000) were $1,600 higher than what the model predicted ($20,400). Store 4 is outperforming what its advertising spend alone would suggest — there may be other factors (e.g., location, staff quality, promotions) contributing to its stronger sales.

Use the Model for Prediction & Ethical Reflection

In cell H1, type: New Ad Spend. In I1: Predicted Sales.

In H2, enter a value (e.g., 3.5 for $3,500 spend).

In I2, enter:

=$F$2 + $F$1 * H2

Question 6.1

What does the model predict for monthly sales if a store spends $3,500 on advertising?

ŷ = 6.8 + 3.4(3.5) = 6.8 + 11.9 = $18,700. Since $3,500 falls within our observed data range (1–5), this interpolation is reasonable.

Question 6.2 — Extrapolation Warning

Now try entering x = 20 ($20,000 ad spend). What does the model predict? Is it safe to use this prediction? Explain.

The model predicts ŷ = 6.8 + 3.4(20) = $74,800. This is not safe to use — it is far outside the range of observed data (1–5). The linear relationship may not hold at such extreme values. Extrapolation can produce wildly misleading forecasts. A larger dataset covering higher spend values would be needed before making this claim.

Question 6.3 — Ethical Reflection

Management is considering using this model to set advertising budgets for all 50 stores across Australia. List two ethical concerns a prudent analyst should raise before this decision is made.

Possible concerns include (any two):
1. Small sample size: The model is based on only 5 stores. Applying it to 50 stores assumes the relationship holds universally, which may not be true.
2. Omitted variables: Store location, demographics, competitor presence, and seasonality all affect sales. Relying solely on ad spend ignores these factors.
3. Causation vs correlation: High R² does not prove that advertising causes higher sales — correlation only.
4. Extrapolation risk: Stores with very different spending levels than the training data will get unreliable predictions.

Key Excel Functions Used This Week

Function	Purpose	Syntax
=SLOPE()	Calculates regression slope b₁	=SLOPE(known_ys, known_xs)
=INTERCEPT()	Calculates regression intercept b₀	=INTERCEPT(known_ys, known_xs)
=RSQ()	Calculates R²	=RSQ(known_ys, known_xs)
=DEVSQ()	Calculates SST = Σ(y − ȳ)²	=DEVSQ(range)
=SUMXMY2()	Calculates SSE = Σ(actual − predicted)²	=SUMXMY2(range1, range2)
=FORECAST.LINEAR()	Predicts ŷ for a new x value	=FORECAST.LINEAR(x, known_ys, known_xs)

Simple Linear Regression in Excel

Background

Dataset — Enter this into Excel first

Create a Scatter Plot — Visualise the Relationship

Calculate Coefficients Using Excel Functions

Calculate Predicted Values and Residuals

Calculate and Interpret R²

Residual Analysis — Create a Residual Plot

Use the Model for Prediction & Ethical Reflection

Key Excel Functions Used This Week