DATA4400 · Lesson 3
Moving Averages,
Stationarity &
Correlation Analysis
Data-Driven Forecasting
Kaplan Business School · Master of Business Analytics
Section 1 Stationarity
Section 2 Discrete White Noise & Random Walk
Section 3 Differencing
Section 4 Moving Averages
Lesson 3
Learning Outcomes
By the end of this lesson you will be able to:
1
Evaluate the concept of stationarity
Identify whether a time series is stationary or non-stationary using visual inspection and statistical tests.
2
Understand and apply differencing
Use first-order and second-order differencing to remove trends and achieve stationarity.
3
Smooth data with moving averages
Calculate and interpret simple, weighted, and centred moving averages to reveal underlying trends.
01
Stationarity
Before we can build a forecasting model, we need to understand whether the data is stable over time. This property is called stationarity — and it is the most important data property in time series analysis.
What it is
Why it matters
How to identify it
Stationary vs Non-stationary
Section 1 · Stationarity
1.1 What is Stationarity?
Simple definition
A time series is stationary if its mean (average level) and variance (spread/volatility) remain constant over time — the series does not drift up, drift down, or become more erratic as time passes.
Stationary = stable behaviour
- Constant mean (no trend)
- Constant variance (no widening spread)
- No seasonal component
Non-stationary = changing behaviour
- Mean drifts up or down (trend)
- Variance grows over time
- Recurring seasonal spikes
💡 Think of it this way: if you picked two random windows of your data and they look similar (same average, same spread), the series is stationary. If one window looks very different from another, it is non-stationary.
Three types of non-stationarity
Trend
Mean increases or decreases steadily. Example: total annual revenue for a growing company.
Step change
Mean jumps suddenly at a point in time. Example: website traffic after a viral marketing campaign.
Variance shift
Volatility increases over time. Example: stock prices becoming more volatile in a market crisis.
Section 1 · Stationarity
1.2 Visualising Stationarity
Compare the two series below. Ask yourself: does the average level and spread stay roughly the same throughout?
Series A — Stationary
The series fluctuates around a constant mean (dashed line). The spread does not change. This is stationary.
Series B — Non-stationary (trend)
The mean keeps rising over time. The series is non-stationary. Most forecasting models will not work reliably on this data without transformation.
Forecasting models like ARIMA assume stationarity. If your data is non-stationary, you must transform it first.
Section 1 · Stationarity
1.3 Why Stationarity Matters for Business Forecasting
Statistical stability
Forecasting models calculate a mean and variance from the training data, then assume those values will hold in the future. If the mean keeps shifting (trend), any forecast built from past averages is immediately wrong.
Reliable prediction intervals
For a stationary series, long-run forecasts converge to the mean. The 68% prediction interval is mean ± 1 standard deviation; the 95% interval is mean ± 2 standard deviations. These are interpretable and useful for business planning.
Modelling the remainder
After decomposing a time series into Trend + Seasonal + Remainder, the Remainder component should behave like a stationary series. Stationarity tests help verify that decomposition has worked correctly.
Business impact example
A retail chain builds a demand forecast on raw monthly sales data (which has a growth trend). Because the model is trained on non-stationary data:
- Early-period averages underestimate current demand
- The model consistently orders too little stock
- Lost sales accumulate — e.g. $200k missed revenue per quarter
Simply stationarising the data before modelling fixes this.
Two conditions for full stationarity
- Stationary in mean — average does not change over time
- Stationary in variance — spread (standard deviation) does not change over time
Both conditions must hold. A series can be stationary in mean but not variance (e.g. a random walk).
02
Discrete White Noise
& Random Walk
Two fundamental stationary process models that appear frequently in business data. Understanding these builds intuition for what "pure randomness" and "unpredictable drift" look like.
Discrete White Noise
Random Walk
Random Walk with Drift
Naïve Method
Section 2 · Models
2.1 Discrete White Noise (DWN)
What
Pure random fluctuations with no pattern, no trend, no seasonality. Each observation is independent of all others.
Why it matters
DWN is the ideal residual — if your model's errors look like DWN, there is nothing left to predict. The model has captured everything useful.
When you see it
Random operational noise: day-to-day call centre volume variation, minor cash register fluctuations, sensor readings in a stable process.
Yt = εt εt ~ iid(0, σ²)
Key properties: mean = 0, variance = σ² (constant), each value is independent. DWN is stationary.
Business use case — performance monitoring
A website's daily traffic fluctuates randomly between 900–1100 visitors. If this matches DWN properties, the company knows: there is no underlying problem, no trend, no seasonal effect. Any day-to-day difference is just noise — do not react to it. If a trend emerges, that is a meaningful signal worth investigating.
DWN simulation — four realisations
Each coloured line is a different DWN series (σ=1). Notice: they all look different but share the same statistical properties.
Section 2 · Models
2.2 Random Walk & Random Walk with Drift
Random Walk (no drift)
Yt = Yt−1 + εt
Today's value = yesterday's value + a random shock. Stationary in mean but not stationary in variance — the spread keeps growing over time.
Example: a drunk person walking — each step is random, but over time they wander further from the start.
Random Walk with Drift (δ)
Yt = δ + Yt−1 + εt
A consistent upward (or downward) drift δ is added to each step. Not stationary — both mean and variance change over time.
- Positive δ → series trends upward (e.g. economic growth)
- Negative δ → series trends downward
Business applications
- Stock prices — RW with drift (drift = expected return)
- GDP, CPI, exchange rates
- Long-run demand forecasting
Random Walk — four realisations
Why this is dangerous for forecasting
Because variance grows without bound, long-run prediction intervals become extremely wide. The Naïve method (predict next = last observation) is actually optimal for a random walk.
Naïve forecast: ŶT+1 = YT
The best forecast for a random walk is simply the last observed value.
Knowledge Checkpoint
✓ Checkpoint 1 — Stationarity & White Noise
Question 1 of 2
A company's monthly revenue has grown steadily from $1M to $5M over 5 years. What does this tell you about the series?
AIt is stationary because the variance looks constant.
BIt is non-stationary because the mean is increasing over time.
CIt is stationary because the series has no seasonal component.
DStationarity cannot be determined without running ARIMA first.
Question 2 of 2
Your model's residuals (errors) look exactly like Discrete White Noise. What does this mean?
AThe model is under-fitted and needs more variables.
BThe residuals contain a hidden trend you should remove.
CThe model has captured all predictable patterns — nothing useful remains.
DYou should apply differencing to the residuals before forecasting.
03
Differencing
Differencing is the primary tool for converting a non-stationary series into a stationary one. It is the pre-processing step that unlocks models like ARIMA and SARIMA. Understanding it builds the intuition for the "I" (Integrated) component in ARIMA.
What it is
Why we use it
First-order differencing
Second-order differencing
Seasonal differencing
ADF & KPSS tests
Section 3 · Differencing
3.1 What is Differencing — and Why?
What
Instead of looking at the raw value at each time point, you compute how much it changed from one period to the next. You subtract yesterday from today.
Why
A trending series has a moving target — the model does not know if a value is "high" or "low" because the whole scale is shifting. Differencing removes that shift and leaves stable fluctuations.
When
- Visual plot shows a clear upward/downward trend
- ADF test p-value > 0.05 (series is non-stationary)
- ACF plot decays very slowly (does not drop to zero quickly)
First-order: ∇Yt = Yt − Yt−1
Four uses of differencing:
① Remove a linear trend
② Remove a stochastic (random) trend
③ Stabilise the mean
④ Prepare data for ARIMA/SARIMA modelling
Step-by-step example
| Month | Sales ($) | Difference ∇Yt |
| Jan | 100 | — |
| Feb | 120 | 120 − 100 = +20 |
| Mar | 140 | 140 − 120 = +20 |
| Apr | 160 | 160 − 140 = +20 |
| May | 180 | 180 − 160 = +20 |
What happened?
The raw series has a clear upward trend. After differencing, each value is +20 — perfectly stable. The trend has been removed. The differenced series is stationary.
In practice, differences will not be perfectly equal — they will fluctuate around a stable mean, which is what we want.
Seasonal differencing: ∇pYt = Yt − Yt−p
p = seasonal period (e.g. p=12 for monthly data with annual seasonality)
Section 3 · Differencing
3.2 Before & After Differencing
The chart below shows a non-stationary series (upward trend) and its first-difference. Notice how differencing removes the trend completely.
Original series — non-stationary (trending up)
Mean is not constant — series drifts upward. Cannot be modelled directly.
After first-order differencing — stationary
Fluctuates around a stable mean. Trend has been removed. Ready to model.
First-order differencing (d=1)
Subtracts consecutive observations. Removes linear trends. Sufficient for most business time series.
Second-order differencing (d=2)
Differences the already-differenced series. Use when the series has a quadratic (accelerating) trend and first-order is not enough.
∇²Yt = Yt − 2Yt−1 + Yt−2
Caution: over-differencing
Applying too many differences can introduce artificial structure. Use the minimum number of differences needed to achieve stationarity. Rarely need d > 2.
Section 3 · Differencing
3.3 When to Difference — Formal Tests
1. Visual inspection (always start here)
Plot your time series. If you see a clear upward or downward trend, or a widening spread, differencing is likely needed. This is the fastest and most intuitive check.
2. Augmented Dickey-Fuller (ADF) Test
The most common formal test for non-stationarity.
H₀: Unit root present → series is non-stationary
H₁: No unit root → series is stationary
- p-value < 0.05 → reject H₀ → series is stationary (no need to difference)
- p-value ≥ 0.05 → fail to reject H₀ → series is non-stationary (difference it)
3. KPSS Test (reverse hypotheses)
H₀: Series is stationary
H₁: Series has a unit root → is non-stationary
Use ADF and KPSS together to confirm results — they complement each other because their null hypotheses are opposite.
4. Autocorrelation Function (ACF)
If the ACF plot shows autocorrelations that decay slowly (remain high even at large lags), this suggests non-stationarity. After differencing, the ACF should drop to near zero quickly.
Decision process
1
Plot the series — does it trend?
2
Run ADF test — is p-value ≥ 0.05?
3
If yes → apply first-order differencing
4
Re-test on differenced series — repeat if needed
5
Stop when ADF p-value < 0.05 (series is stationary)
Knowledge Checkpoint
✓ Checkpoint 2 — Differencing & Unit Root Tests
Question 3 of 5
You run an ADF test on monthly sales data and get a p-value of 0.42. What should you do next?
AThe series is stationary — proceed to model it directly.
BThe series is non-stationary — apply first-order differencing, then re-test.
CApply second-order differencing immediately, since p > 0.05.
DThe test is inconclusive — use KPSS only from now on.
Question 4 of 5
After applying first-order differencing to a time series, the ADF test p-value drops to 0.01. What does this tell you?
AFirst-order differencing failed — you need to apply second-order differencing.
BThe original series was already stationary before differencing.
CFirst-order differencing worked — the series is now stationary (d=1).
DA p-value of 0.01 means the test was not significant — no conclusion can be drawn.
04
Moving Averages &
Smoothing Techniques
Moving averages filter out short-term noise to reveal the underlying trend. They are one of the most widely used tools in business analytics, appearing in stock trading dashboards, sales performance reports, and operational monitoring.
Simple Moving Average
Weighted Moving Average
Centred Moving Average
Window size trade-off
Section 4 · Moving Averages
4.1 What is Smoothing and Why Do We Need It?
The problem: noisy data
Raw business data is almost always noisy. A retailer's weekly sales jump up and down due to promotions, weather, public holidays, or simple randomness. This noise masks the underlying trend — the long-run direction that actually matters for planning.
What
Replace each data point with the average of nearby points. Short-term spikes cancel out, revealing the smooth trend.
Why
Distinguish genuine trends from random noise. Avoid over-reacting to a single unusual week.
How
Choose a window size k (number of periods to average). Larger k = smoother, but more lag behind recent changes.
Business rule: do not make a strategic decision based on one data point. Use moving averages to confirm the trend before acting.
Noise vs signal — the core challenge
Grey = raw noisy sales data. Red = 4-week moving average. The trend is only clear after smoothing.
Key trade-off: a larger window removes more noise but reacts more slowly to genuine changes in the trend. A smaller window is more responsive but lets more noise through.
Section 4 · Moving Averages
4.2 Simple Moving Average (SMA)
How it works
The SMA at time t uses the k most recent observations, each given equal weight. The "window" slides forward one period at a time.
Ŷt+1 = (Yt + Yt−1 + Yt−2 + … + Yt−k+1) / k
where k = number of periods in the moving average (window size)
Worked example — 3-period SMA
| Quarter | Demand ($M) | 3-period SMA |
| Q1 | 4.71 | — |
| Q2 | 4.75 | — |
| Q3 | 4.63 | (4.71+4.75+4.63)/3 = 4.70 |
| Q4 | 4.74 | (4.75+4.63+4.74)/3 = 4.71 |
| Q5 | 4.19 | (4.63+4.74+4.19)/3 = 4.52 |
| Q6 (forecast) | — | 4.52 |
When to use SMA: no apparent trend; seasonal data (set k = seasonal period). Use as a benchmark before trying more complex models.
Important notes
- The first (k−1) periods have no SMA value — not enough previous observations yet
- The SMA is centred at time t − (k−1)/2, not at time t — it lags behind the present
- If data has seasonality, set k to the seasonal period (e.g. k=4 for quarterly, k=12 for monthly)
- SMA is better for smoothing and exploration than for multi-step forecasting
Weighted Moving Average (WMA)
A variant that gives more weight to recent observations and less weight to older ones. More responsive to recent changes than plain SMA.
Example: for k=3, you might assign weights of 0.5, 0.3, 0.2 (most recent gets 0.5). The weights must sum to 1.
Limitation of SMA for forecasting
SMA lags behind genuine trend changes. If sales are rising, the SMA will consistently underestimate the current level. For long-range forecasting, exponential smoothing (Week 4) handles trend much better.
Section 4 · Moving Averages
4.3 Centred Moving Average (CMA)
The problem with even-period smoothing
For seasonal data with an even period (e.g. k=4 quarters, k=12 months), the plain moving average falls between two time points — not at any actual observation. This creates a misalignment that makes it impossible to estimate seasonal effects accurately.
The solution: centring
Take the average of the k-period MA going back from t and the k-period MA going back from t−1. This centres the average exactly at time t − k/2.
(0.5Yt + Yt−1 + … + Yt−k+1 + 0.5Yt−k) / k
Note: the first and last observations get half weight (0.5). All others get full weight.
Why CMA matters
CMA is the correct smoothing method used inside classical time series decomposition (Trend + Seasonal + Remainder). It ensures the trend estimate is aligned with the data, allowing seasonal factors to be estimated accurately.
SMA vs WMA vs CMA — quick comparison
| Method | Equal weights? | Best for |
| SMA | Yes | Quick trend smoothing, no seasonality |
| WMA | No (recent = more) | Series where recent data matters more |
| CMA | Approx. equal | Seasonal decomposition (even periods) |
Limitations of all moving averages
- No values at the start and end of the series
- Always lags behind the actual current level
- Poor for multi-step ahead forecasting — use exponential smoothing or ARIMA instead
- Good for data exploration, imputation, and trend extraction
Moving averages are exploratory tools — they help you understand the data before building a formal forecasting model.
Section 4 · Moving Averages
4.4 Window Size — Noise vs Lag Trade-off
Adjust the window size and observe how the smoothed line changes. A larger window removes more noise but creates a longer lag behind the actual data.
Small window (k=3)
Closely follows the raw data. Removes some noise, but still shows many short-term fluctuations. Reacts quickly to genuine changes. Use for short-term operational monitoring.
Medium window (k=7)
Balanced smoothing. Removes most week-to-week noise while still tracking medium-term trends. A good default starting point for most business series.
Large window (k=14)
Very smooth — longer-run trend is very clear. However, significant lag — the smoothed line does not yet reflect very recent changes. Use for strategic trend analysis.
Section 4 · Moving Averages
4.5 The Naïve Forecasting Method
What it is
The simplest possible forecasting method: the forecast for next period equals the most recent observed value. Nothing else is considered.
ŶT+1 = YT
T = current period. The forecast for T+1 is simply whatever happened at T.
Why it works (when it does)
When data follows a random walk, the naïve method is mathematically optimal. There is no exploitable pattern — the best guess for tomorrow is what happened today.
When to use it
- As a benchmark baseline — any more complex model should outperform naïve, or it is not worth using
- When data shows random walk behaviour (e.g. stock prices)
- Very short-term operational decisions (next hour, next day)
Naïve as benchmark — business rule
Before presenting a forecasting model to stakeholders, always compare it to the naïve method. If your ARIMA or Prophet model can't beat "just repeat last period's value", question whether the added complexity is justified.
Example: A logistics company forecasts weekly parcel volumes using a naïve model (RMSE = 850). Their new ML model gives RMSE = 420. The ML model is 50% more accurate — worth the investment. If ML RMSE = 840, the complexity is not justified.
Seasonal naïve variant
Instead of using the last period, use the observation from the same period last year (or last season). Useful for highly seasonal data like retail (Christmas week = last Christmas week).
ŶT+h = YT+h−m
m = seasonal period. h = forecast horizon.
Knowledge Checkpoint
✓ Checkpoint 3 — Moving Averages
Question 5 of 5
A retail analyst uses a 30-day SMA to monitor daily sales. She notices the smoothed line is below the raw data for the past two weeks. What is the most likely explanation?
AThe SMA window is too small and should be reduced to 7 days.
BSales have been increasing recently, and the SMA lags behind — it still reflects older, lower values.
CThe analyst should switch to a CMA to fix the lag issue.
DThe raw data has an error — negative sales values are pulling the SMA down.
Key insight
This is the fundamental lag problem of trailing moving averages. A large window gives a smoother line but always represents the past average, not the present level. For real-time business decisions, the lag matters.
One practical solution: use a shorter window for operations (react faster) and a longer window for strategy (see the bigger trend). Many analysts plot both simultaneously.
Lesson 3 · Summary
Summary — What We Covered
01 · Stationarity
- Constant mean and variance over time
- Required by ARIMA, SARIMA, and related models
- Three types of non-stationarity: trend, step change, variance shift
- Check visually, then formally with ADF/KPSS
02 · DWN & Random Walk
- DWN: pure noise, no pattern, ideal model residual
- Random Walk: today = yesterday + random shock (non-stationary in variance)
- Random Walk with Drift: adds a consistent upward/downward component
- Naïve method is optimal for random walk data
03 · Differencing
- ∇Yt = Yt − Yt−1 removes linear trends
- Second-order removes quadratic trends
- Seasonal differencing removes seasonal patterns
- ADF test: p ≥ 0.05 → difference; p < 0.05 → stationary
04 · Moving Averages
- SMA: equal weight to k most recent periods
- WMA: more weight to recent observations
- CMA: use for even-period seasonal decomposition
- Larger window → smoother but more lag
- Best for exploration, not long-range forecasting
05 · Key Business Rules
- Always check stationarity before modelling
- Use minimum differencing needed — don't over-difference
- Naïve forecast = essential benchmark for any model
- Moving averages: distinguish noise from signal before reacting
- Residuals should look like DWN — if not, the model is incomplete
06 · Coming Up Next
- Week 4: Exponential Smoothing — a more sophisticated smoothing method that handles trend and seasonality
- Holt-Winters model
- Performance metrics: MAE, RMSE, MAPE
- Business-oriented model evaluation
Understanding stationarity and differencing is the foundation for ARIMA (Week 7) — the d in ARIMA(p,d,q) is the order of differencing.