Using known relationships to forecast
Understanding relationships between variables to predict future outcomes
Using known relationships to forecast
By the end of this lesson, you will be able to:
Predict outcomes using one predictor variable (e.g., sales from temperature)
Use multiple factors simultaneously (e.g., revenue from marketing, users, satisfaction)
Determine if your model is making accurate predictions
Account for events like weekends, holidays, or promotions
Use yesterday's values to predict today's outcomes
When you have known predictor variables (temperature, marketing spend, day of week), regression leverages these relationships to make more informed forecasts than methods relying solely on historical patterns.
| Data Type | Description | Example |
|---|---|---|
| Time Series | Observations of one subject over time at regular intervals | Daily stock prices for Tesla (2020-2025) |
| Cross-Sectional | Observations of different subjects at the same point in time | Salaries of 100 employees on 1 Jan 2024 |
| Panel Data | Observations of multiple subjects over time | Monthly sales for 50 retail stores (2020-2025) |
| Multivariate Time Series | Multiple variables tracked simultaneously over time | Daily stock price AND trading volume for one company |
Data from different sources collected at the same time
A snapshot in time
Example: Survey 1,000 customers on their satisfaction levels today
Data collected from the same sources over a period of time
A movie over time
Example: Track the same 1,000 customers' satisfaction monthly for 2 years
Where t is simply the time period (1, 2, 3, ...)
Each point represents one observation. The line represents our regression model.
Can we forecast future attendance using a time trend?
| Year After 2000 | Attendance (1000s/year) |
|---|---|
| 3 (2003) | 4,050 |
| 4 (2004) | 3,650 |
| 5 (2005) | 4,380 |
| 6 (2006) | 4,320 |
| 7 (2007) | 5,820 |
| 8 (2008) | 6,150 |
| 9 (2009) | 5,550 |
| 10 (2010) | 6,580 |
This means attendance increases by approximately 408,000 people per year
File → Options → Add-ins → Analysis ToolPak → Go
Data → Data Analysis → Regression
Residual = Actual Value − Predicted Value
These residual plots indicate a well-fitting model:
These patterns suggest model issues:
Problem: Non-linear relationship not captured
Solution: Try polynomial regression or transformation
Problem: Heteroscedasticity (non-constant variance)
Solution: Transform Y variable (e.g., log)
The negative value indicates our model over-predicted by 30 units.
| Metric | What It Measures | Good Values |
|---|---|---|
| R² (R-squared) |
Proportion of variance explained by the model |
Close to 1 (or 100%) R² > 0.7 typically indicates good fit |
| F-statistic | Overall model significance |
Large value p-value < 0.05 |
| Coefficient p-values | Statistical significance of each predictor |
p < 0.05 (predictor is significant) |
| Standard Error | Average size of residuals (prediction error) |
Smaller is better Depends on your data scale |
| Residual Plots | Visual check for model assumptions |
Random scatter No patterns |
Example: R² = 0.85 means "85% of variation in Y is explained by our model"
Ŷ = 145.23 + 5.31 × Temperature
R² = 0.89
p-value < 0.001
Sales = 145.23 + 5.31(30) = $304.53
Where σ = standard deviation of residuals (Standard Error)
Point forecast for 30°C: $304.53
Standard Error: σ = $25
95% Prediction Interval: $304.53 ± 2(25) = $254.53 to $354.53
We are 95% confident actual sales will fall within this range.
Forecast ± 1σ
Narrower, less confident
Forecast ± 2σ
Wider, more confident
Note: This is a moderate fit. Typically, R² > 0.7 is considered strong.
Different business scenarios require different curve shapes:
Constant growth rate
Example: Subscription revenue with steady monthly growth
Accelerating or decelerating growth
Example: Product adoption curve (S-shaped)
Percentage-based growth
Example: Viral social media growth
Growth toward an upper limit
Example: Market saturation (smartphone adoption)
Each predictor (X₁, X₂, ..., Xn) has its own coefficient (β₁, β₂, ..., βn)
| Variable | Coefficient | p-value | Business Meaning |
|---|---|---|---|
| Intercept (β₀) | 52.3 | 0.001 | Base revenue with zero predictors |
| Marketing Spend (β₁) | 2.84 | 0.003 | Each $1,000 marketing → +$2,840 revenue |
| Active Users (β₂) | 20.15 | <0.001 | Each 1,000 users → +$20,150 revenue |
| Satisfaction Score (β₃) | 8.67 | 0.012 | Each point (1-10) → +$8,670 revenue |
This model explains 92% of revenue variation – excellent predictive power!
We typically use p < 0.05 as the threshold for "statistical significance."
Create a binary variable:
| Day | Weekend? |
|---|---|
| Monday | 0 |
| Tuesday | 0 |
| Saturday | 1 |
| Sunday | 1 |
β₂ captures the additional sales on weekends, controlling for temperature
Sales = β₀ + β₁(Temp) + β₂(0)
= β₀ + β₁(Temp)
Sales = β₀ + β₁(Temp) + β₂(1)
= (β₀ + β₂) + β₁(Temp)
Higher intercept!
| Quarter | Revenue ($1000s) | Holiday Quarter |
|---|---|---|
| 1 | 245 | 0 |
| 2 | 268 | 0 |
| 3 | 285 | 0 |
| 4 | 412 | 1 |
| 5 | 301 | 0 |
| 6 | 318 | 0 |
Ŷ = 220.5 + 15.3(Quarter) + 95.7(Holiday_Quarter)
Use yesterday's value (Yt-1) to predict today's value (Yt)
Captures both persistence AND weekend effects
Salest = 150 + 0.6 × 300 = 150 + 180 = $330
The model says today's sales are influenced 60% by yesterday's sales, plus a baseline of $150.
| Component | What It Captures | Example Business Insight |
|---|---|---|
| Time (t) | Overall growth trend | "We're growing 2% per month" |
| Marketingt | Impact of advertising spend | "Each $1,000 marketing → $3,200 revenue" |
| Competitor_Pricet | Market competition effects | "When competitors raise prices, we gain sales" |
| Weekendt | Day-of-week pattern | "Weekends generate 15% more sales" |
| Holidayt | Special event effects | "Black Friday adds $50,000" |
| Revenuet-1 | Persistence/momentum | "High sales yesterday → likely high sales today" |
| Method | Best For | Strengths | Limitations |
|---|---|---|---|
| Moving Average | Stable, short-term data | Simple, smooth noise | Can't forecast far ahead, lags trends |
| Exponential Smoothing | Data with trends | Adaptive, handles trends | Limited for complex patterns |
| ARIMA | Complex time patterns | Captures autocorrelation, trends, seasonality | Requires stationarity, complex setup |
| Prophet | Business data with events | Handles holidays automatically, robust | Black box, less control |
| Regression | Known predictor variables | Interpretable, flexible, explanatory | Needs predictor forecasts, assumes linearity |
What are you trying to predict? What factors might influence it?
Start simple (one or two predictors), then add complexity
Regularly check actual vs. predicted; retrain model with new data
A strong regression relationship doesn't prove one variable causes the other. Ice cream sales and drowning deaths are correlated (both occur in summer), but ice cream doesn't cause drowning.
Adding too many predictors can create a model that fits historical data perfectly but forecasts poorly. Aim for parsimony – use the fewest predictors that explain the most variation.
If your temperature data ranges from 10°C to 35°C, don't trust predictions for -5°C or 45°C. The relationship may not hold outside observed ranges.
If predictors are highly correlated (e.g., temperature and ice cream sales), coefficients become unstable. Check correlation matrix before building models.
Not all relationships are linear. Check scatter plots – if you see curves, consider polynomial or exponential models.
Unlike pure time series methods, regression uses known predictors to explain and forecast outcomes.
Combine trends, external variables, dummy variables, and lags for comprehensive models.
Check R², p-values, and residual plots. A high R² alone doesn't guarantee good forecasts.
Regression coefficients tell you how much each factor matters – crucial for business decision-making.
Regression excels when you have predictor data. For pure time patterns without external variables, ARIMA or Prophet may be better.
Integrate multiple methods, create model pipelines, and measure business impact
Match methods to specific business problems, handle data quality issues, and communicate results
Demonstrate your mastery of forecasting techniques in a comprehensive business scenario
Consult with your instructor or use the discussion forums