DATA4400

Data-Driven Forecasting

Lesson 9: Regression Models

Understanding relationships between variables to predict future outcomes

Where We Are in the Course

Weeks 1-4: Foundations (moving averages, stationarity, correlation)
Week 6: Prophet (automated business forecasting)
Weeks 7-8: ARIMA & VAR (time series patterns)
Week 9: Regression Models ← We are here

Using known relationships to forecast

Weeks 10-11: Integration and model selection

Learning Outcomes

By the end of this lesson, you will be able to:

1. Understand Simple Linear Regression

Predict outcomes using one predictor variable (e.g., sales from temperature)

2. Analyse Multiple Regression Models

Use multiple factors simultaneously (e.g., revenue from marketing, users, satisfaction)

3. Evaluate Model Quality Using Residuals

Determine if your model is making accurate predictions

4. Apply Dummy Variables for Categorical Effects

Account for events like weekends, holidays, or promotions

5. Incorporate Lag Terms for Time Dependencies

Use yesterday's values to predict today's outcomes

What is Regression Analysis?

Definition: A statistical method that examines relationships between variables to predict outcomes and understand dependencies in data.

Key Goals

  • Predict future outcomes
  • Understand how factors impact your target variable
  • Quantify relationships (e.g., "$1 marketing spend = $3.50 revenue")
  • Test hypotheses about business drivers

Business Applications

  • Sales forecasting
  • Demand planning
  • Price optimization
  • Marketing effectiveness
  • Customer behavior prediction
Why Regression for Forecasting?

When you have known predictor variables (temperature, marketing spend, day of week), regression leverages these relationships to make more informed forecasts than methods relying solely on historical patterns.

Understanding Data Types

Data Type Description Example
Time Series Observations of one subject over time at regular intervals Daily stock prices for Tesla (2020-2025)
Cross-Sectional Observations of different subjects at the same point in time Salaries of 100 employees on 1 Jan 2024
Panel Data Observations of multiple subjects over time Monthly sales for 50 retail stores (2020-2025)
Multivariate Time Series Multiple variables tracked simultaneously over time Daily stock price AND trading volume for one company
For This Lesson: We focus on time series regression – using predictors to forecast outcomes over time.

Cross-Sectional vs. Longitudinal Data

Cross-Sectional Research

Data from different sources collected at the same time

📸

A snapshot in time

Example: Survey 1,000 customers on their satisfaction levels today

Longitudinal Research (Time Series)

Data collected from the same sources over a period of time

🎬

A movie over time

Example: Track the same 1,000 customers' satisfaction monthly for 2 years

Simple Linear Regression

Core Concept: Model the relationship between one predictor variable (X) and an outcome variable (Y)

General Formula

Yt = β₀ + β₁Xt + εt
Yt = Variable you want to forecast (dependent variable)
Xt = Predictor variable (independent variable)
β₀ = Intercept (baseline value when X = 0)
β₁ = Slope (change in Y for each unit change in X)
εt = Error term (random variation)
Simplest Version - Time Trend Model:
Yt = β₀ + β₁t + εt

Where t is simply the time period (1, 2, 3, ...)

Quick Check: Understanding Regression

A coffee shop wants to forecast daily sales. They notice sales increase on colder days. Which is the appropriate setup?
A) Y = Temperature, X = Sales
B) Y = Sales, X = Temperature
C) Y = Time, X = Sales
D) Y = Sales, X = Time and Temperature (multiple regression)

Simple Regression: Visual Intuition

Each point represents one observation. The line represents our regression model.

The Line Shows:
  • The general trend
  • Predicted values
  • The relationship strength
The Scatter Shows:
  • Actual observations
  • How much variation exists
  • Outliers or unusual patterns

Worked Example: Cricket Ground Attendance

Scenario: A large cricket ground tracks annual attendance (thousands) from 2003-2010.

Can we forecast future attendance using a time trend?

Year After 2000 Attendance (1000s/year)
3 (2003)4,050
4 (2004)3,650
5 (2005)4,380
6 (2006)4,320
7 (2007)5,820
8 (2008)6,150
9 (2009)5,550
10 (2010)6,580
Regression Result: Ŷt = 2,412.4 + 407.9t

This means attendance increases by approximately 408,000 people per year

Running Regression in Excel

Step 1: Enable Analysis ToolPak

File → Options → Add-ins → Analysis ToolPak → Go

Step 2: Organize Your Data
  • Column A: Independent variable (X) - e.g., Year, Temperature
  • Column B: Dependent variable (Y) - e.g., Sales, Attendance
  • Include headers in Row 1
Step 3: Run Regression

Data → Data Analysis → Regression

  • Input Y Range: Select your dependent variable (including header)
  • Input X Range: Select your independent variable (including header)
  • Check "Labels" box
  • Select output location
  • Click OK
Alternative Method: Add a trendline to a scatter plot and check "Display Equation on chart" to see the regression formula visually.

What Are Residuals?

Residual Formula

et = yt − ŷt

Residual = Actual Value − Predicted Value

Residuals tell us how far off our predictions are from reality. They are the "leftovers" after the model makes its best guess.

Good Residuals Indicate:

  • Model captures the pattern well
  • Predictions are reliable
  • Small, random errors

Bad Residuals Indicate:

  • Model missing key patterns
  • Systematic prediction errors
  • Need for model improvement
What to Look For:
  • ✓ Mean close to zero
  • ✓ Randomly scattered (no pattern)
  • ✓ Constant variance across fitted values
  • ✓ Preferably normally distributed

Good Residual Patterns

These residual plots indicate a well-fitting model:

✓ What Makes This Good:
  • Randomly scattered around zero
  • No clear pattern or trend
  • Consistent spread (variance) across all predicted values
  • Most points within ±2 standard deviations

Problematic Residual Patterns

These patterns suggest model issues:

Pattern 1: Curved Residuals

Problem: Non-linear relationship not captured

Solution: Try polynomial regression or transformation

Pattern 2: Increasing Spread

Problem: Heteroscedasticity (non-constant variance)

Solution: Transform Y variable (e.g., log)

⚠ Warning Signs: Outliers, systematic patterns, or unequal variance suggest your model needs refinement before using for forecasts.

Quick Check: Residuals

Your regression model predicts sales of 350 units, but actual sales were 320 units. What is the residual?
A) 30
B) -30
C) 670
D) Cannot be determined

Evaluating Model Quality: Key Metrics

Metric What It Measures Good Values

(R-squared)
Proportion of variance explained by the model Close to 1 (or 100%)
R² > 0.7 typically indicates good fit
F-statistic Overall model significance Large value
p-value < 0.05
Coefficient p-values Statistical significance of each predictor p < 0.05
(predictor is significant)
Standard Error Average size of residuals (prediction error) Smaller is better
Depends on your data scale
Residual Plots Visual check for model assumptions Random scatter
No patterns

R² Interpretation

R² = Σ(ŷt − ȳ)² / Σ(yt − ȳ)²

Example: R² = 0.85 means "85% of variation in Y is explained by our model"

Business Example: Ice Cream Sales

Scenario: An ice cream shop wants to forecast daily sales based on temperature.

The Model

Salest = β₀ + β₁(Tempt) + εt
Regression Results:

Ŷ = 145.23 + 5.31 × Temperature

R² = 0.89

p-value < 0.001

Business Interpretation

  • β₀ = 145.23: Expected sales when temperature = 0°C (baseline demand)
  • β₁ = 5.31: For each 1°C increase, sales increase by $5.31
  • R² = 0.89: Temperature explains 89% of sales variation
Forecast for 30°C:

Sales = 145.23 + 5.31(30) = $304.53

Limits of Prediction: Uncertainty

Every prediction has uncertainty. We quantify this using prediction intervals.

95% Prediction Interval

Forecast ± 2 × σ

Where σ = standard deviation of residuals (Standard Error)

Example: Ice Cream Shop

Point forecast for 30°C: $304.53

Standard Error: σ = $25

95% Prediction Interval: $304.53 ± 2(25) = $254.53 to $354.53

We are 95% confident actual sales will fall within this range.

68% Interval:

Forecast ± 1σ

Narrower, less confident

95% Interval:

Forecast ± 2σ

Wider, more confident

Quick Check: Model Evaluation

You build a regression model with R² = 0.45. What does this mean?
A) The model is 45% accurate
B) 45% of predictions are correct
C) The model explains 45% of variation in the outcome variable
D) There is a 45% correlation between variables

Types of Regression Curves

Different business scenarios require different curve shapes:

Linear

Yt = a + bt

Constant growth rate

Example: Subscription revenue with steady monthly growth

Quadratic

Yt = a + bt + ct²

Accelerating or decelerating growth

Example: Product adoption curve (S-shaped)

Exponential

Yt = ae^(bt)

Percentage-based growth

Example: Viral social media growth

Logistic

Yt = U/(1+ae^(-kt))

Growth toward an upper limit

Example: Market saturation (smartphone adoption)

Caution: All forecasting involves extrapolation – assuming current trends continue. Always consider if external changes might alter the pattern.

Multiple Linear Regression

Key Advancement: Use multiple predictors simultaneously to improve forecast accuracy

General Formula

Yt = β₀ + β₁X₁,t + β₂X₂,t + ... + βnXn,t + εt

Each predictor (X₁, X₂, ..., Xn) has its own coefficient (β₁, β₂, ..., βn)

Benefits

  • Captures complex relationships
  • Controls for confounding factors
  • Typically higher R² than simple regression
  • More realistic business models

Challenges

  • Risk of multicollinearity (correlated predictors)
  • Requires more data
  • More complex interpretation
  • Overfitting with too many predictors

Example: SaaS Subscription Revenue

Business Question: What drives monthly subscription revenue for a SaaS company?

The Model

Revenuet = β₀ + β₁(Marketingt) + β₂(Active_Userst) + β₃(Satisfactiont) + εt
Variable Coefficient p-value Business Meaning
Intercept (β₀) 52.3 0.001 Base revenue with zero predictors
Marketing Spend (β₁) 2.84 0.003 Each $1,000 marketing → +$2,840 revenue
Active Users (β₂) 20.15 <0.001 Each 1,000 users → +$20,150 revenue
Satisfaction Score (β₃) 8.67 0.012 Each point (1-10) → +$8,670 revenue
Model Performance: R² = 0.92, F-statistic p < 0.001

This model explains 92% of revenue variation – excellent predictive power!

Quick Check: Multiple Regression

In the SaaS example, Active Users had coefficient β₂ = 20.15 with p < 0.001. What does p < 0.001 tell us?
A) Active Users causes 99.9% of revenue
B) The coefficient might be wrong
C) Active Users is a statistically significant predictor
D) We can only be 0.1% confident in this predictor

Dummy Variables for Categorical Effects

Purpose: Include categorical (yes/no) factors like weekends, holidays, promotions, or seasons in regression models

How It Works

Create a binary variable:

  • 1 = Event occurs (e.g., weekend)
  • 0 = Event doesn't occur (weekday)
DayWeekend?
Monday0
Tuesday0
Saturday1
Sunday1

Business Examples

  • Promotional Periods: Measure impact on sales
  • Holiday Seasons: Seasonal effects on demand
  • Product Launches: Step-change in customer interest
  • Day of Week: Weekend vs. weekday patterns
  • Store Type: Flagship vs. regional locations

Example Formula

Salest = β₀ + β₁(Temperaturet) + β₂(Weekendt) + εt

β₂ captures the additional sales on weekends, controlling for temperature

How Dummy Variables Shift Predictions

Weekday Equation:

Sales = β₀ + β₁(Temp) + β₂(0)

= β₀ + β₁(Temp)

Weekend Equation:

Sales = β₀ + β₁(Temp) + β₂(1)

= (β₀ + β₂) + β₁(Temp)

Higher intercept!

The dummy variable creates a parallel shift – same slope, different baseline.

Worked Example: Hobby Store Revenue

Scenario: A hobby store wants to understand how quarterly revenue is affected by time trend AND whether it's a holiday quarter (Q4).
Revenuet = β₀ + β₁(Quarter) + β₂(Holiday_Quartert) + εt
Quarter Revenue ($1000s) Holiday Quarter
12450
22680
32850
44121
53010
63180
Regression Results:

Ŷ = 220.5 + 15.3(Quarter) + 95.7(Holiday_Quarter)

  • β₁ = 15.3: Revenue increases $15,300/quarter on average
  • β₂ = 95.7: Q4 holidays add an extra $95,700 in revenue
  • R² = 0.88: Model explains 88% of variation

Lag Terms: Using Past Values

Core Idea: Today's value often depends on yesterday's value (autocorrelation)

Lag-1 Model

Yt = β₀ + β₁Yt-1 + εt

Use yesterday's value (Yt-1) to predict today's value (Yt)

When to Use Lags

  • Stock prices (momentum)
  • Sales (persistent demand)
  • Website traffic (retention)
  • Inventory levels
  • Any slowly-changing process

Can Combine With Other Predictors

Salest = β₀ + β₁Salest-1 + β₂(Weekendt) + εt

Captures both persistence AND weekend effects

Important: Lag models assume short-horizon forecasting. To forecast multiple days ahead, you need the actual Y values at each intermediate time step (or use recursive forecasting).

Quick Check: Lag Terms

A coffee shop builds this model: Salest = 150 + 0.6×Salest-1

If yesterday's sales were $300, what is today's forecast?
A) $150
B) $180
C) $330
D) $450

Putting It All Together

Real business forecasting often combines multiple components: trend, external predictors, categorical effects, and past values

Comprehensive Model Example

Revenuet = β₀ + β₁(Timet) + β₂(Marketingt) + β₃(Competitor_Pricet) + β₄(Weekendt) + β₅(Holidayt) + β₆(Revenuet-1) + εt
Component What It Captures Example Business Insight
Time (t) Overall growth trend "We're growing 2% per month"
Marketingt Impact of advertising spend "Each $1,000 marketing → $3,200 revenue"
Competitor_Pricet Market competition effects "When competitors raise prices, we gain sales"
Weekendt Day-of-week pattern "Weekends generate 15% more sales"
Holidayt Special event effects "Black Friday adds $50,000"
Revenuet-1 Persistence/momentum "High sales yesterday → likely high sales today"

Comparing Forecasting Methods

Method Best For Strengths Limitations
Moving Average Stable, short-term data Simple, smooth noise Can't forecast far ahead, lags trends
Exponential Smoothing Data with trends Adaptive, handles trends Limited for complex patterns
ARIMA Complex time patterns Captures autocorrelation, trends, seasonality Requires stationarity, complex setup
Prophet Business data with events Handles holidays automatically, robust Black box, less control
Regression Known predictor variables Interpretable, flexible, explanatory Needs predictor forecasts, assumes linearity
Decision Rule: Use regression when you have reliable predictor variables and want to understand why the forecast is what it is, not just what it will be.

Regression Forecasting Workflow

Step 1: Define Business Question

What are you trying to predict? What factors might influence it?

Step 2: Collect and Prepare Data
  • Ensure equal time intervals
  • Check for missing values
  • Identify outliers (are they plausible?)
  • Create dummy variables for categorical factors
Step 3: Build Initial Model

Start simple (one or two predictors), then add complexity

Step 4: Evaluate Model Quality
  • Check R² (> 0.7 preferred)
  • Verify p-values (< 0.05 for significance)
  • Examine residual plots (random scatter?)
  • Calculate prediction intervals
Step 5: Refine and Forecast
  • Remove non-significant predictors
  • Add relevant variables if R² is low
  • Consider transformations if residuals show patterns
  • Generate forecasts with confidence intervals
Step 6: Monitor and Update

Regularly check actual vs. predicted; retrain model with new data

Common Mistakes to Avoid

⚠ Correlation ≠ Causation

A strong regression relationship doesn't prove one variable causes the other. Ice cream sales and drowning deaths are correlated (both occur in summer), but ice cream doesn't cause drowning.

⚠ Overfitting

Adding too many predictors can create a model that fits historical data perfectly but forecasts poorly. Aim for parsimony – use the fewest predictors that explain the most variation.

⚠ Extrapolation Beyond Data Range

If your temperature data ranges from 10°C to 35°C, don't trust predictions for -5°C or 45°C. The relationship may not hold outside observed ranges.

⚠ Ignoring Multicollinearity

If predictors are highly correlated (e.g., temperature and ice cream sales), coefficients become unstable. Check correlation matrix before building models.

⚠ Assuming Linearity

Not all relationships are linear. Check scatter plots – if you see curves, consider polynomial or exponential models.

Key Takeaways

1. Regression Leverages Relationships

Unlike pure time series methods, regression uses known predictors to explain and forecast outcomes.

2. Multiple Components Improve Accuracy

Combine trends, external variables, dummy variables, and lags for comprehensive models.

3. Always Validate Your Model

Check R², p-values, and residual plots. A high R² alone doesn't guarantee good forecasts.

4. Interpretability is Valuable

Regression coefficients tell you how much each factor matters – crucial for business decision-making.

5. Choose the Right Tool

Regression excels when you have predictor data. For pure time patterns without external variables, ARIMA or Prophet may be better.

Remember: All models are approximations. The goal isn't perfection – it's making informed decisions under uncertainty.

Next Steps in Your Learning Journey

Week 10: Building End-to-End Forecasting Solutions

Integrate multiple methods, create model pipelines, and measure business impact

Week 11: Adaptive Forecasting

Match methods to specific business problems, handle data quality issues, and communicate results

Week 12: Final Assessment

Demonstrate your mastery of forecasting techniques in a comprehensive business scenario

Practice Exercises Available on Moodle:

  • Excel regression with real business datasets
  • Model diagnostics interpretation
  • Dummy variable creation and application
  • Comparative forecasting challenges
Questions?

Consult with your instructor or use the discussion forums

1 / 34