DATA4400

Data-Driven Forecasting

Lesson 9: Regression Models

Understanding relationships between variables to predict future outcomes

Where We Are in the Course

Weeks 1-4: Foundations (moving averages, stationarity, correlation)

Week 6: Prophet (automated business forecasting)

Weeks 7-8: ARIMA & VAR (time series patterns)

Week 9: Regression Models ← We are here

Using known relationships to forecast

Weeks 10-11: Integration and model selection

Learning Outcomes

By the end of this lesson, you will be able to:

1. Understand Simple Linear Regression

Predict outcomes using one predictor variable (e.g., sales from temperature)

2. Analyse Multiple Regression Models

Use multiple factors simultaneously (e.g., revenue from marketing, users, satisfaction)

3. Evaluate Model Quality Using Residuals

Determine if your model is making accurate predictions

4. Apply Dummy Variables for Categorical Effects

Account for events like weekends, holidays, or promotions

5. Incorporate Lag Terms for Time Dependencies

Use yesterday's values to predict today's outcomes

What is Regression Analysis?

Definition: A statistical method that examines relationships between variables to predict outcomes and understand dependencies in data.

Key Goals

Predict future outcomes
Understand how factors impact your target variable
Quantify relationships (e.g., "$1 marketing spend = $3.50 revenue")
Test hypotheses about business drivers

Business Applications

Sales forecasting
Demand planning
Price optimization
Marketing effectiveness
Customer behavior prediction

Why Regression for Forecasting?

When you have known predictor variables (temperature, marketing spend, day of week), regression leverages these relationships to make more informed forecasts than methods relying solely on historical patterns.

Understanding Data Types

Data Type	Description	Example
Time Series	Observations of one subject over time at regular intervals	Daily stock prices for Tesla (2020-2025)
Cross-Sectional	Observations of different subjects at the same point in time	Salaries of 100 employees on 1 Jan 2024
Panel Data	Observations of multiple subjects over time	Monthly sales for 50 retail stores (2020-2025)
Multivariate Time Series	Multiple variables tracked simultaneously over time	Daily stock price AND trading volume for one company

For This Lesson: We focus on time series regression – using predictors to forecast outcomes over time.

Cross-Sectional vs. Longitudinal Data

Cross-Sectional Research

Data from different sources collected at the same time

📸

A snapshot in time

Example: Survey 1,000 customers on their satisfaction levels today

Longitudinal Research (Time Series)

Data collected from the same sources over a period of time

🎬

A movie over time

Example: Track the same 1,000 customers' satisfaction monthly for 2 years

Simple Linear Regression

Core Concept: Model the relationship between one predictor variable (X) and an outcome variable (Y)

General Formula

Y_t = β₀ + β₁X_t + ε_t

Y_t = Variable you want to forecast (dependent variable)
X_t = Predictor variable (independent variable)
β₀ = Intercept (baseline value when X = 0)
β₁ = Slope (change in Y for each unit change in X)
ε_t = Error term (random variation)

Simplest Version - Time Trend Model:

Y_t = β₀ + β₁t + ε_t

Where t is simply the time period (1, 2, 3, ...)

Quick Check: Understanding Regression

A coffee shop wants to forecast daily sales. They notice sales increase on colder days. Which is the appropriate setup?

A) Y = Temperature, X = Sales

B) Y = Sales, X = Temperature

C) Y = Time, X = Sales

D) Y = Sales, X = Time and Temperature (multiple regression)

Simple Regression: Visual Intuition

Each point represents one observation. The line represents our regression model.

The Line Shows:

The general trend
Predicted values
The relationship strength

The Scatter Shows:

Actual observations
How much variation exists
Outliers or unusual patterns

Worked Example: Cricket Ground Attendance

Scenario: A large cricket ground tracks annual attendance (thousands) from 2003-2010.

Can we forecast future attendance using a time trend?

Year After 2000	Attendance (1000s/year)
3 (2003)	4,050
4 (2004)	3,650
5 (2005)	4,380
6 (2006)	4,320
7 (2007)	5,820
8 (2008)	6,150
9 (2009)	5,550
10 (2010)	6,580

Regression Result: Ŷ_t = 2,412.4 + 407.9t

This means attendance increases by approximately 408,000 people per year

Running Regression in Excel

Step 1: Enable Analysis ToolPak

File → Options → Add-ins → Analysis ToolPak → Go

Step 2: Organize Your Data

Column A: Independent variable (X) - e.g., Year, Temperature
Column B: Dependent variable (Y) - e.g., Sales, Attendance
Include headers in Row 1

Step 3: Run Regression

Data → Data Analysis → Regression

Input Y Range: Select your dependent variable (including header)
Input X Range: Select your independent variable (including header)
Check "Labels" box
Select output location
Click OK

Alternative Method: Add a trendline to a scatter plot and check "Display Equation on chart" to see the regression formula visually.

What Are Residuals?

Residual Formula

e_t = y_t − ŷ_t

Residual = Actual Value − Predicted Value

Residuals tell us how far off our predictions are from reality. They are the "leftovers" after the model makes its best guess.

Good Residuals Indicate:

Model captures the pattern well
Predictions are reliable
Small, random errors

Bad Residuals Indicate:

Model missing key patterns
Systematic prediction errors
Need for model improvement

What to Look For:

✓ Mean close to zero
✓ Randomly scattered (no pattern)
✓ Constant variance across fitted values
✓ Preferably normally distributed

Good Residual Patterns

These residual plots indicate a well-fitting model:

✓ What Makes This Good:

Randomly scattered around zero
No clear pattern or trend
Consistent spread (variance) across all predicted values
Most points within ±2 standard deviations

Problematic Residual Patterns

These patterns suggest model issues:

Pattern 1: Curved Residuals

Problem: Non-linear relationship not captured

Solution: Try polynomial regression or transformation

Pattern 2: Increasing Spread

Problem: Heteroscedasticity (non-constant variance)

Solution: Transform Y variable (e.g., log)

⚠ Warning Signs: Outliers, systematic patterns, or unequal variance suggest your model needs refinement before using for forecasts.

Quick Check: Residuals

Your regression model predicts sales of 350 units, but actual sales were 320 units. What is the residual?

A) 30

B) -30

C) 670

D) Cannot be determined

Evaluating Model Quality: Key Metrics

Metric	What It Measures	Good Values
R² (R-squared)	Proportion of variance explained by the model	Close to 1 (or 100%) R² > 0.7 typically indicates good fit
F-statistic	Overall model significance	Large value p-value < 0.05
Coefficient p-values	Statistical significance of each predictor	p < 0.05 (predictor is significant)
Standard Error	Average size of residuals (prediction error)	Smaller is better Depends on your data scale
Residual Plots	Visual check for model assumptions	Random scatter No patterns

R² Interpretation

R² = Σ(ŷ_t − ȳ)² / Σ(y_t − ȳ)²

Example: R² = 0.85 means "85% of variation in Y is explained by our model"

Business Example: Ice Cream Sales

Scenario: An ice cream shop wants to forecast daily sales based on temperature.

The Model

Sales_t = β₀ + β₁(Temp_t) + ε_t

Regression Results:

Ŷ = 145.23 + 5.31 × Temperature

R² = 0.89

p-value < 0.001

Business Interpretation

β₀ = 145.23: Expected sales when temperature = 0°C (baseline demand)
β₁ = 5.31: For each 1°C increase, sales increase by $5.31
R² = 0.89: Temperature explains 89% of sales variation

Forecast for 30°C:

Sales = 145.23 + 5.31(30) = $304.53

Limits of Prediction: Uncertainty

Every prediction has uncertainty. We quantify this using prediction intervals.

95% Prediction Interval

Forecast ± 2 × σ

Where σ = standard deviation of residuals (Standard Error)

Example: Ice Cream Shop

Point forecast for 30°C: $304.53

Standard Error: σ = $25

95% Prediction Interval: $304.53 ± 2(25) = $254.53 to $354.53

We are 95% confident actual sales will fall within this range.

68% Interval:

Forecast ± 1σ

Narrower, less confident

95% Interval:

Forecast ± 2σ

Wider, more confident

Quick Check: Model Evaluation

You build a regression model with R² = 0.45. What does this mean?

A) The model is 45% accurate

B) 45% of predictions are correct

C) The model explains 45% of variation in the outcome variable

D) There is a 45% correlation between variables

Types of Regression Curves

Different business scenarios require different curve shapes:

Linear

Y_t = a + bt

Constant growth rate

Example: Subscription revenue with steady monthly growth

Quadratic

Y_t = a + bt + ct²

Accelerating or decelerating growth

Example: Product adoption curve (S-shaped)

Exponential

Y_t = ae^(bt)

Percentage-based growth

Example: Viral social media growth

Logistic

Y_t = U/(1+ae^(-kt))

Growth toward an upper limit

Example: Market saturation (smartphone adoption)

Caution: All forecasting involves extrapolation – assuming current trends continue. Always consider if external changes might alter the pattern.

Multiple Linear Regression

Key Advancement: Use multiple predictors simultaneously to improve forecast accuracy

General Formula

Y_t = β₀ + β₁X₁_,t + β₂X₂_,t + ... + β_nX_n,t + ε_t

Each predictor (X₁, X₂, ..., X_n) has its own coefficient (β₁, β₂, ..., β_n)

Benefits

Captures complex relationships
Controls for confounding factors
Typically higher R² than simple regression
More realistic business models

Challenges

Risk of multicollinearity (correlated predictors)
Requires more data
More complex interpretation
Overfitting with too many predictors

Example: SaaS Subscription Revenue

Business Question: What drives monthly subscription revenue for a SaaS company?

The Model

Revenue_t = β₀ + β₁(Marketing_t) + β₂(Active_Users_t) + β₃(Satisfaction_t) + ε_t

Variable	Coefficient	p-value	Business Meaning
Intercept (β₀)	52.3	0.001	Base revenue with zero predictors
Marketing Spend (β₁)	2.84	0.003	Each $1,000 marketing → +$2,840 revenue
Active Users (β₂)	20.15	<0.001	Each 1,000 users → +$20,150 revenue
Satisfaction Score (β₃)	8.67	0.012	Each point (1-10) → +$8,670 revenue

Model Performance: R² = 0.92, F-statistic p < 0.001

This model explains 92% of revenue variation – excellent predictive power!

Quick Check: Multiple Regression

In the SaaS example, Active Users had coefficient β₂ = 20.15 with p < 0.001. What does p < 0.001 tell us?

A) Active Users causes 99.9% of revenue

B) The coefficient might be wrong

C) Active Users is a statistically significant predictor

D) We can only be 0.1% confident in this predictor

Dummy Variables for Categorical Effects

Purpose: Include categorical (yes/no) factors like weekends, holidays, promotions, or seasons in regression models

How It Works

Create a binary variable:

1 = Event occurs (e.g., weekend)
0 = Event doesn't occur (weekday)

Day	Weekend?
Monday	0
Tuesday	0
Saturday	1
Sunday	1

Business Examples

Promotional Periods: Measure impact on sales
Holiday Seasons: Seasonal effects on demand
Product Launches: Step-change in customer interest
Day of Week: Weekend vs. weekday patterns
Store Type: Flagship vs. regional locations

Example Formula

Sales_t = β₀ + β₁(Temperature_t) + β₂(Weekend_t) + ε_t

β₂ captures the additional sales on weekends, controlling for temperature

How Dummy Variables Shift Predictions

Weekday Equation:

Sales = β₀ + β₁(Temp) + β₂(0)

= β₀ + β₁(Temp)

Weekend Equation:

Sales = β₀ + β₁(Temp) + β₂(1)

= (β₀ + β₂) + β₁(Temp)

Higher intercept!

The dummy variable creates a parallel shift – same slope, different baseline.

Worked Example: Hobby Store Revenue

Scenario: A hobby store wants to understand how quarterly revenue is affected by time trend AND whether it's a holiday quarter (Q4).

Revenue_t = β₀ + β₁(Quarter) + β₂(Holiday_Quarter_t) + ε_t

Quarter	Revenue ($1000s)	Holiday Quarter
1	245	0
2	268	0
3	285	0
4	412	1
5	301	0
6	318	0

Regression Results:

Ŷ = 220.5 + 15.3(Quarter) + 95.7(Holiday_Quarter)

β₁ = 15.3: Revenue increases $15,300/quarter on average
β₂ = 95.7: Q4 holidays add an extra $95,700 in revenue
R² = 0.88: Model explains 88% of variation

Lag Terms: Using Past Values

Core Idea: Today's value often depends on yesterday's value (autocorrelation)

Lag-1 Model

Y_t = β₀ + β₁Y_t-1 + ε_t

Use yesterday's value (Y_t-1) to predict today's value (Y_t)

When to Use Lags

Stock prices (momentum)
Sales (persistent demand)
Website traffic (retention)
Inventory levels
Any slowly-changing process

Can Combine With Other Predictors

Sales_t = β₀ + β₁Sales_t-1 + β₂(Weekend_t) + ε_t

Captures both persistence AND weekend effects

Important: Lag models assume short-horizon forecasting. To forecast multiple days ahead, you need the actual Y values at each intermediate time step (or use recursive forecasting).

Quick Check: Lag Terms

A coffee shop builds this model: Sales_t = 150 + 0.6×Sales_t-1

If yesterday's sales were $300, what is today's forecast?

A) $150

B) $180

C) $330

D) $450

Putting It All Together

Real business forecasting often combines multiple components: trend, external predictors, categorical effects, and past values

Comprehensive Model Example

Revenue_t = β₀ + β₁(Time_t) + β₂(Marketing_t) + β₃(Competitor_Price_t) + β₄(Weekend_t) + β₅(Holiday_t) + β₆(Revenue_t-1) + ε_t

Component	What It Captures	Example Business Insight
Time (t)	Overall growth trend	"We're growing 2% per month"
Marketing_t	Impact of advertising spend	"Each $1,000 marketing → $3,200 revenue"
Competitor_Price_t	Market competition effects	"When competitors raise prices, we gain sales"
Weekend_t	Day-of-week pattern	"Weekends generate 15% more sales"
Holiday_t	Special event effects	"Black Friday adds $50,000"
Revenue_t-1	Persistence/momentum	"High sales yesterday → likely high sales today"

Comparing Forecasting Methods

Method	Best For	Strengths	Limitations
Moving Average	Stable, short-term data	Simple, smooth noise	Can't forecast far ahead, lags trends
Exponential Smoothing	Data with trends	Adaptive, handles trends	Limited for complex patterns
ARIMA	Complex time patterns	Captures autocorrelation, trends, seasonality	Requires stationarity, complex setup
Prophet	Business data with events	Handles holidays automatically, robust	Black box, less control
Regression	Known predictor variables	Interpretable, flexible, explanatory	Needs predictor forecasts, assumes linearity

Decision Rule: Use regression when you have reliable predictor variables and want to understand why the forecast is what it is, not just what it will be.

Regression Forecasting Workflow

Step 1: Define Business Question

What are you trying to predict? What factors might influence it?

Step 2: Collect and Prepare Data

Ensure equal time intervals
Check for missing values
Identify outliers (are they plausible?)
Create dummy variables for categorical factors

Step 3: Build Initial Model

Start simple (one or two predictors), then add complexity

Step 4: Evaluate Model Quality

Check R² (> 0.7 preferred)
Verify p-values (< 0.05 for significance)
Examine residual plots (random scatter?)
Calculate prediction intervals

Step 5: Refine and Forecast

Remove non-significant predictors
Add relevant variables if R² is low
Consider transformations if residuals show patterns
Generate forecasts with confidence intervals

Step 6: Monitor and Update

Regularly check actual vs. predicted; retrain model with new data

Common Mistakes to Avoid

⚠ Correlation ≠ Causation

A strong regression relationship doesn't prove one variable causes the other. Ice cream sales and drowning deaths are correlated (both occur in summer), but ice cream doesn't cause drowning.

⚠ Overfitting

Adding too many predictors can create a model that fits historical data perfectly but forecasts poorly. Aim for parsimony – use the fewest predictors that explain the most variation.

⚠ Extrapolation Beyond Data Range

If your temperature data ranges from 10°C to 35°C, don't trust predictions for -5°C or 45°C. The relationship may not hold outside observed ranges.

⚠ Ignoring Multicollinearity

If predictors are highly correlated (e.g., temperature and ice cream sales), coefficients become unstable. Check correlation matrix before building models.

⚠ Assuming Linearity

Not all relationships are linear. Check scatter plots – if you see curves, consider polynomial or exponential models.

Key Takeaways

1. Regression Leverages Relationships

Unlike pure time series methods, regression uses known predictors to explain and forecast outcomes.

2. Multiple Components Improve Accuracy

Combine trends, external variables, dummy variables, and lags for comprehensive models.

3. Always Validate Your Model

Check R², p-values, and residual plots. A high R² alone doesn't guarantee good forecasts.

4. Interpretability is Valuable

Regression coefficients tell you how much each factor matters – crucial for business decision-making.

5. Choose the Right Tool

Regression excels when you have predictor data. For pure time patterns without external variables, ARIMA or Prophet may be better.

Remember: All models are approximations. The goal isn't perfection – it's making informed decisions under uncertainty.

Next Steps in Your Learning Journey

Week 10: Building End-to-End Forecasting Solutions

Integrate multiple methods, create model pipelines, and measure business impact

Week 11: Adaptive Forecasting

Match methods to specific business problems, handle data quality issues, and communicate results

Week 12: Final Assessment

Demonstrate your mastery of forecasting techniques in a comprehensive business scenario

Practice Exercises Available on Moodle:

Excel regression with real business datasets
Model diagnostics interpretation
Dummy variable creation and application
Comparative forecasting challenges

Questions?

Consult with your instructor or use the discussion forums