Understanding Temporal Fusion Transformers

Advanced Time Series Forecasting with Interpretable AI

Interactive visualization for DATA4800 & DATA5000

What is a Temporal Fusion Transformer?

Business Problem

How can we accurately forecast time series with multiple inputs while understanding which factors drive predictions?

Temporal Fusion Transformer (TFT) is a deep learning architecture specifically designed for multi-horizon time series forecasting with built-in interpretability.

Traditional Time Series Methods

  • ARIMA, Prophet, etc.
  • Limited to univariate or simple multivariate
  • Difficult to incorporate static features
  • Black box predictions
  • Fixed forecast horizons

TFT Advantages

  • Handles complex multivariate inputs
  • Incorporates static & time-varying features
  • Built-in variable importance
  • Interpretable attention patterns
  • Multi-horizon forecasting

Key TFT Innovations

  • Variable Selection Networks: Automatically selects relevant features
  • Gated Residual Networks: Efficient information flow with skip connections
  • Temporal Self-Attention: Captures long-range temporal dependencies
  • Interpretable Multi-Head Attention: Shows what the model focuses on
  • Quantile Outputs: Provides prediction intervals, not just point forecasts
TFT Architecture

The TFT processes static features, past time series, and known future inputs to generate multi-step forecasts

Key Components Explained
1. Variable Selection Networks

TFT uses separate variable selection for different input types:

Static Features

Context that doesn't change over time

  • Store ID
  • Product category
  • Geographic location

Time-Varying Known

Future values we know in advance

  • Day of week
  • Holidays
  • Promotional events

Time-Varying Unknown

Past observed values only

  • Historical sales
  • Past prices
  • Previous demand
2. Gated Residual Network (GRN)

GRN Formula

GRN(a, c) = LayerNorm(a + GLU(η₁))
where:
  η₁ = W₁,₁ η₂ + b₁,₁
  η₂ = ELU(W₂,₁ a + W₂,₂ c + b₂,₁)
  GLU = Gated Linear Unit for flexible representation
                    

This allows the model to adaptively suppress or enhance features based on context

3. Temporal Attention

Multi-head attention learns different temporal patterns:

Real Example: Retail Sales Forecasting

Scenario: Forecast next 7 days of product sales

Input: 30 days of historical data + store features + calendar events

Application: Inventory management and staff scheduling