Assessment 3: Dataset Selection Guide

Finding Your Data & Formulating Business Problems
DATA4400: Data-Driven Forecasting

Assessment 3: Quick Overview

Weight: 40% of final grade

Format: Individual forecasting project

What You Need to Do:

  1. Choose a dataset (Plan B or Plan C)
  2. Formulate a clear business problem
  3. Analyze data patterns
  4. Select and justify appropriate forecasting method(s)
  5. Build forecasts and calculate business impact in dollars
  6. Provide recommendations with specific timelines

Today's Focus: Steps 1 & 2 - Finding data and formulating your problem

2

Your Two Options

Plan B: Business Data

Source: Online business datasets

Examples:

  • Retail sales
  • Website traffic
  • Supply chain data
  • Customer metrics
  • Energy consumption
  • Tourism/hospitality

Good for: Operational forecasting, marketing, supply chain problems

Plan C: Financial Data

Source: Stock markets, crypto, indices

Examples:

  • Stock prices (ASX, NYSE)
  • Cryptocurrency
  • Market indices
  • Exchange rates
  • Commodity prices
  • Economic indicators

Good for: Investment decisions, portfolio management, risk analysis

Important: Choose ONE plan. Don't mix business and financial data unless you have a clear multivariate relationship to explore (e.g., oil prices affecting airline stocks).
3

Plan B: Where to Find Business Data

Source URL What You'll Find
Kaggle kaggle.com/datasets Retail sales, web analytics, supply chain, customer data. Search "time series"
UCI Repository archive.ics.uci.edu/ml Classic datasets: bike sharing, energy, sales, traffic
Google Dataset Search datasetsearch.research.google.com Search engine for datasets across all domains
data.gov.au data.gov.au Australian government data: tourism, energy, transport, health
Australian Bureau of Statistics abs.gov.au Retail trade, employment, building approvals, economic indicators
data.world data.world Business metrics, social data, economic data (requires free account)
Pro Tip: On Kaggle, search terms like "retail time series", "sales forecasting", "demand prediction", "website traffic" to find relevant datasets quickly.
4

Plan C: Where to Find Financial Data

Source URL What You'll Find
Yahoo Finance finance.yahoo.com Stock prices, indices, forex. Download CSV directly. Global markets.
ASX Data asx.com.au Australian stocks, indices (ASX200, All Ords). Historical prices.
FRED (Federal Reserve) fred.stlouisfed.org Economic indicators, interest rates, GDP, inflation, unemployment (US & global)
CoinMarketCap coinmarketcap.com Cryptocurrency prices, market cap, volume. Download historical data.
Investing.com investing.com Stocks, commodities, forex, indices. Historical data download available.
Quandl / Nasdaq Data Link data.nasdaq.com Financial & economic data. Free tier available (requires account).
Pro Tip: For Yahoo Finance, search any stock ticker (e.g., CBA.AX for CommBank), go to "Historical Data", select your date range, and click "Download" for CSV.
5

What Makes a Good Dataset?

Minimum Requirements:

Criterion Requirement Why It Matters
Time Points ≥ 60 observations (monthly)
≥ 100 (weekly/daily)
Need enough data to identify patterns and validate forecasts
Frequency Regular intervals (daily, weekly, monthly, quarterly) Time series methods require consistent spacing
Completeness < 10% missing values Large gaps break forecasting models
Recency Includes recent data (2020+) Old data alone (pre-2015) may not be relevant for current decisions
Numeric Target At least one continuous variable to forecast Sales, price, volume, count - needs to be measurable
Red Flags - Avoid These Datasets:
  • Only 20-30 data points (too short)
  • Irregular timing (inconsistent gaps between observations)
  • 50%+ missing values
  • Purely categorical data (no numeric forecasting target)
6

From Data to Business Problem

You don't just "forecast the data." You solve a business problem using forecasting.

The 3-Step Framework:

1WHO is the stakeholder?

→ CEO? Marketing Manager? Investor? Supply Chain Director?

2WHAT decision do they need to make?

→ Set budgets? Adjust inventory? Buy/sell stock? Hire staff?

3WHY does the forecast matter?

→ What's the cost of being wrong? What's the financial impact?

Remember: Your A3 must show dollar-based ROI and business impact, not just statistical accuracy. The business problem drives everything.
7

Example 1: Plan B (Business Data)

Dataset: Retail Store Sales (Monthly, 2018-2023)

Variables: Monthly revenue, customer count, marketing spend, competitor openings

❌ Weak Problem Formulation:

"Forecast retail sales for next 12 months."

Why weak? No stakeholder, no decision, no business context, no cost structure.

✅ Strong Problem Formulation:

Stakeholder: Regional Manager planning 2024 operations

Decision: Set monthly inventory budgets and staffing levels for Q1-Q2 2024

Cost Structure: Stockouts cost 3x more than overstocking (lost sales vs. holding costs)

Business Impact: Each 10% forecast error costs ~$50k in Q1 through inefficient inventory

Deliverable: Monthly sales forecast + recommended inventory levels + staffing plan with dollar impacts

8

Example 2: Plan C (Financial Data)

Dataset: ASX Bank Stock (CBA.AX Daily Prices, 2020-2024)

Variables: Daily close price, volume, ASX200 index, interest rates

❌ Weak Problem Formulation:

"Predict CBA stock price using ARIMA."

Why weak? No stakeholder, no investment decision, no risk analysis, prescribes method before analysis.

✅ Strong Problem Formulation:

Stakeholder: Retail investor with $100k to invest

Decision: Buy, hold, or sell CBA stock for a 6-month horizon (Q1-Q2 2024)

Risk Tolerance: Moderate (willing to accept 10% downside for 15% upside potential)

Alternative: Compare to ASX200 index fund (benchmark return)

Deliverable: 6-month price forecast + buy/hold/sell recommendation + expected return vs. benchmark + risk assessment

9

Example 3: Plan B (Multivariate)

Dataset: E-commerce Website Traffic + Sales (Weekly, 2019-2023)

Variables: Weekly visitors, conversion rate, marketing spend, sales revenue, seasonality

✅ Strong Problem Formulation:

Stakeholder: Marketing Director with $500k annual budget

Decision: Optimize Q1 2024 marketing budget allocation across channels

Business Question: "What's the ROI of marketing spend? How much should we invest in Q1?"

Approach: Model visitors → sales relationship, test Granger causality, calculate $ impact per $1k spend

Deliverable: Q1 visitor forecast + sales forecast + recommended marketing spend ($X) + expected ROI (Y:1)

Why this is strong: Uses VAR/regression to show causal relationships, calculates specific ROI, provides actionable budget recommendation with dollar impacts.
10

Your Problem Formulation Checklist

Before finalizing your dataset and problem, check these boxes:

Dataset meets requirements: ≥60 observations, regular frequency, <10% missing

Clear stakeholder identified: Who needs this forecast? (title/role)

Specific decision defined: What action will they take with the forecast?

Cost structure understood: What errors cost more? Overestimate or underestimate?

Business impact quantifiable: Can you express impact in dollars?

Timeline specified: Forecast for how many periods ahead? (Q1 2024? Next 6 months?)

Success criteria clear: What defines a "good" forecast? (Not just MAE!)

Data patterns identifiable: Can you see trend/seasonality/relationships to analyze?

If you can't check all boxes: Revise your problem or choose a different dataset. A weak problem formulation will cost you significant marks on A3.
11

Common Mistakes to Avoid

Mistake How to Fix It
Choosing dataset first, problem second "I found cool data" → BAD. Start with "What business problem interests me?" then find data.
Generic problem statements Don't say "forecast sales." Say "help Regional Manager set Q1 inventory budgets to minimize stockout costs."
No cost structure Every business has asymmetric costs. Being wrong in one direction hurts more than the other. Identify this.
Focusing only on accuracy A3 requires dollar-based ROI. "RMSE = 5.2" is not a business recommendation. "$50k potential savings" is.
Ignoring data quality issues If data has 40% missing values or stops in 2018, find better data. Don't try to force it.
Too broad or too narrow Too broad: "Forecast economy." Too narrow: "Forecast sales on Tuesdays in March." Find middle ground.
Prescribing method before analysis Don't say "I'll use ARIMA." Analyze first, then match method to pattern + business need.
12

What You Need to Do This Week

Step 1: Explore Datasets (1-2 hours)

  • Visit 3-4 sources from today's slides
  • Download 2-3 candidate datasets
  • Check: ≥60 observations? Regular frequency? Recent data?
  • Visualize in Excel/Orange to see patterns

Step 2: Draft Problem Statement (1 hour)

  • For your best dataset, answer the 3 questions: WHO? WHAT decision? WHY?
  • Write 3-4 sentences describing the business problem
  • Identify the cost structure (what errors hurt most?)
  • Estimate potential business impact in dollars

Step 3: Validate with Facilitator (Week 12)

  • Bring your dataset + problem statement to next session
  • Get feedback before committing to the project
  • Adjust if needed based on guidance
Deadline: Finalize your dataset and problem by end of Week 12. You need Weeks 13-14 for actual analysis and writing!
13

Problem Statement Template

Use this template to draft your A3 problem statement:

Template:

[Stakeholder role] at [Company/Organization] needs to [specific decision] for [time period]. Currently, [describe current situation/problem]. Being wrong costs approximately [$X] because [explain cost structure]. A reliable forecast would enable [specific action/benefit] with an estimated impact of [$Y].

Filled Example (Plan B):

The Regional Manager at CoffeeCo (15 locations) needs to set monthly inventory budgets and staffing levels for Q1-Q2 2024. Currently, budgets are based on simple year-over-year growth (+5%), missing seasonal patterns and COVID impacts. Being wrong costs approximately $50k per 10% error because stockouts cost 3x more than overstocking (lost sales vs. holding costs). A reliable forecast would enable optimized inventory purchasing and labor scheduling with an estimated impact of $150k savings in Q1-Q2.

14

Quick Reference: Data Sources

Plan B (Business)

  • kaggle.com/datasets
  • archive.ics.uci.edu/ml
  • datasetsearch.research.google.com
  • data.gov.au
  • abs.gov.au
  • data.world

Search Terms:

"retail time series"
"sales forecasting"
"demand prediction"
"website traffic"
"supply chain"
"energy consumption"

Plan C (Financial)

  • finance.yahoo.com
  • asx.com.au
  • fred.stlouisfed.org
  • coinmarketcap.com
  • investing.com
  • data.nasdaq.com

Common Tickers:

ASX: CBA.AX, BHP.AX, WES.AX
Crypto: BTC, ETH, BNB
Indices: ^AXJO (ASX200), ^GSPC (S&P500)
Forex: AUDUSD, EURUSD

Need help? Bring your draft dataset + problem to your facilitator in Week 12 for feedback before committing!

15

Final Reminders

✓ DO This:

  • Start with a business problem you care about
  • Find data that can answer a specific question
  • Define your stakeholder clearly
  • Identify cost structure and dollar impacts
  • Check data quality before committing
  • Get facilitator feedback early (Week 12)

✗ DON'T Do This:

  • Choose "interesting data" without a business problem
  • Write "forecast X" as your entire problem statement
  • Ignore cost asymmetry (over vs. under forecasting)
  • Focus only on statistical accuracy (RMSE, MAE)
  • Use datasets with <30 observations or huge gaps
  • Wait until Week 13 to start looking for data

Questions?

16