1 / 42

Week 4: From Data to Questions

DATA6000 - Capstone: Industry Case Studies

Systematic Dataset Exploration for Business Analytics

Assessment 1 Due: Week 5

Today: Learn how to extract business questions from any dataset

The Common Challenge

Sound Familiar?

"I have a dataset... but I don't know what to look for."

"Where do I even start?"

"How do I turn data into a business question?"

Today's Solution

A systematic framework that works with ANY dataset:

  • Structured approach instead of random exploration
  • Repeatable process you can apply every time
  • Direct connection to Assessment 1 requirements

What You'll Learn Today

By the end of this workshop, you will be able to:

  1. Apply a 4-step framework to systematically explore any dataset
  2. Ask the right questions for each type of variable
  3. Identify patterns, anomalies, and relationships within data
  4. Transform data observations into actionable business questions
  5. Create relevant visualizations that support your research question

Assessment 1 Connection

Section 4 requires you to: "Generate a unique business question for this industry based on one of the business problems"

You also need: "at least THREE RELEVANT visualisations from YOUR DATASET"

Introducing: The SPAR Method

A systematic approach to dataset exploration

S

Structure

"What do I have?"

P

Patterns

"What's typical?"

A

Anomalies

"What stands out?"

R

Relationships

"What connects?"

Why SPAR works: Each step builds on the previous one, guiding you from basic understanding to insight generation. It prevents the "staring at data" problem by giving you specific things to look for.

The SPAR Analysis Flow

STRUCTURE β†’ Understand your data's shape and quality
↓
PATTERNS β†’ Find what's normal and typical
↓
ANOMALIES β†’ Identify what's unusual or unexpected
↓
RELATIONSHIPS β†’ Discover connections between variables
↓
BUSINESS QUESTIONS β†’ Transform insights into actionable questions

Key Principle

You cannot find patterns until you understand structure. You cannot spot anomalies until you know what's normal. You cannot understand relationships until you've identified what varies.

The Missing Skill: Knowing What to Ask

The Real Problem

SPAR tells you what to look for... but not what questions to ask yourself when you're staring at the data.

What We Say

"Look for patterns in your data"

"Find anomalies"

"Identify relationships"

What You Think

"I see numbers... now what?"

"Everything looks normal to me"

"How do I know what to compare?"

The Solution: Question Prompt Cards

Experienced analysts have internal questions they automatically ask for each type of variable. We'll teach you these questions so you can think like an analyst.

First: Know Your Variable Types

Different variable types require different questions. Always identify the type first.

NUMERICAL

What it is: Numbers you can calculate with

Examples:

  • Sales amount ($)
  • Quantity sold
  • Customer age
  • Rating score
  • Temperature

CATEGORICAL

What it is: Groups or categories

Examples:

  • Product category
  • Customer type
  • Region/City
  • Payment method
  • Gender

TIME

What it is: Dates, times, periods

Examples:

  • Transaction date
  • Month/Quarter
  • Hour of day
  • Day of week
  • Year

Question Prompts: Numerical Variables

NUMERICAL Ask yourself these questions for every numerical column:

Central & Spread

"What's the typical value? (mean/median)"
"How spread out are the values? (range/std dev)"
"Is mean very different from median?"

Extremes

"What's the highest value? Does it make sense?"
"What's the lowest value? Is zero or negative possible?"
"Are there values that seem impossible?"

Distribution Shape

"Are most values clustered in one area?"
"Is there a long tail of high/low values?"
"Are there multiple peaks (bimodal)?"

Comparison

"Is this value different across categories?"
"Has it changed over time?"
"How does it compare to industry benchmarks?"

Question Prompts: Categorical Variables

CATEGORICAL Ask yourself these questions for every categorical column:

Frequency & Dominance

"Which category is most common?"
"Does one category dominate (>50%)?"
"How many unique categories exist?"

Balance & Gaps

"Are categories roughly equal or heavily skewed?"
"Are any expected categories missing?"
"Are there categories with very few entries?"

Business Meaning

"What does each category represent?"
"Are there categories I don't recognize?"
"Should some categories be combined?"

Cross-Category

"Do certain categories appear together often?"
"Does [Numerical Variable] differ by category?"
"Which category has highest/lowest [outcome]?"

Question Prompts: Time Variables

TIME Ask yourself these questions for date/time columns:

Trend

"Is the overall direction up, down, or flat?"
"When did growth/decline start?"
"Is the rate of change accelerating or slowing?"

Seasonality

"Do certain months always perform better/worse?"
"Are weekends different from weekdays?"
"Is there a weekly/monthly/yearly cycle?"

Events & Spikes

"Are there sudden jumps or drops?"
"What happened on those dates?"
"Do holidays/events explain unusual values?"

Coverage & Gaps

"What time period does the data cover?"
"Are any time periods missing?"
"Is the data granularity appropriate?"

The 5-Second Check: Before Every Variable

Before calculating anything, take 5 seconds to ask:

1

TYPE

Numerical, Categorical, or Time?

2

EXPECT

What would I expect to see?

3

SURPRISE

What would surprise me?

4

COMPARE

What should I compare it to?

5

SO WHAT

Why does this matter to the business?

Example thought process for "Total Sales" column:

TYPE: Numerical β†’ EXPECT: Maybe $50-200 average? β†’ SURPRISE: If average is $5 or $5000 β†’ COMPARE: By store, by month, by product β†’ SO WHAT: Revenue is the core business metric

Example: Applying Question Prompts

Dataset: Supermarket Sales

Variable: Total (transaction amount in $)

Questions I Ask Myself

  1. "What's the average transaction?" β†’ $322.97
  2. "Is mean β‰  median?" β†’ Yes (322 vs 312)
  3. "What does that mean?" β†’ Right-skewed
  4. "What's the max?" β†’ $1,042
  5. "Does it differ by category?" β†’ Check by Branch...

Insights Generated

  • Most transactions are modest, but some high-value ones pull the average up
  • Branch C has highest total revenue
  • No impossible values (all positive, reasonable range)

Potential Question: "What drives high-value transactions, and can we encourage more of them?"

Practice: What Questions Would You Ask?

Scenario: You see a column called "Customer_Age" with values ranging from 18 to 85.

Which question is MOST useful to ask first?

"What's the most common age?"
"What's the average and median age, and are they different?"
"Are there any 100-year-olds?"
"How many customers are there?"

Scenario: You see a column called "Payment_Method" with values: Cash, Credit, Debit, Mobile.

Which question is MOST useful to ask first?

"What's the average payment?"
"Which payment method is most common, and does one dominate?"
"Is Mobile payment increasing over time?"
"How does payment method correlate with age?"

Practice: What Questions Would You Ask? (Continued)

Scenario: You see a column called "Order_Date" spanning Jan 2023 to Dec 2023.

Which question is MOST useful to ask first?

"What day had the most orders?"
"Are there any missing dates?"
"Is there an overall trend (up/down) and are there seasonal patterns?"
"What time do most orders occur?"

Scenario: You calculate that Mean = $500 and Median = $120 for a "Transaction_Value" column.

What does this tell you?

The data has errors
Most transactions are around $500
A few very high-value transactions are pulling the mean up
The data is normally distributed

Your Question Prompt Reference Card

Keep this mental checklist for every variable you explore:

Variable Type First Questions to Ask Follow-up Questions
NUM Mean? Median? Are they different?
Min/Max reasonable?
Does it differ by [category]?
Has it changed over time?
CAT Which is most common?
Does one category dominate?
Any unexpected categories?
How does [numerical] differ across these?
TIME Overall trend up/down/flat?
Any obvious cycles?
Any sudden spikes/drops?
What events explain them?

Remember: These questions work for any dataset. Practice them until they become automatic!

SStructure: "What Do I Have?"

The Goal

Before any analysis, you must understand the basic shape and quality of your data. This prevents errors and guides your entire analysis.

Data Dimensions

  • How many rows (observations)?
  • How many columns (variables)?
  • What time period does it cover?
  • What is the unit of analysis?

Variable Types

  • Numerical (continuous vs discrete)
  • Categorical (nominal vs ordinal)
  • Date/Time
  • Text/Free-form

Structure: Data Quality Indicators

Quality Issue What to Check Why It Matters
Missing Values % missing per column, patterns of missingness Affects analysis validity; may indicate data collection issues
Duplicates Exact duplicates, near-duplicates Can inflate counts and skew statistics
Data Types Numbers stored as text, dates as strings Incorrect types prevent proper analysis
Value Ranges Min/max values, negative where unexpected Indicates data entry errors or outliers
Consistency Same entity different spellings, format variations Causes incorrect groupings and counts

Structure: Your Checklist

Complete This for Every Dataset

Row count: Total observations in the dataset
Column count: Total variables available
Time range: Start date to end date (if applicable)
Unit of analysis: What each row represents (customer, transaction, day, etc.)
Variable types: Count of numerical, categorical, date, text columns
Missing data: Columns with >10% missing values
Key identifier: Column(s) that uniquely identify each row

Quiz: Identify the Data Type

Q1: Customer satisfaction rating (1-5 stars)

Continuous numerical
Ordinal categorical
Nominal categorical
Text

Q2: Product category (Electronics, Clothing, Food)

Continuous numerical
Ordinal categorical
Nominal categorical
Discrete numerical

Q3: Transaction amount ($45.99, $123.50, etc.)

Continuous numerical
Discrete numerical
Ordinal categorical
Nominal categorical

Quiz: Identify the Data Type (Continued)

Q4: Number of items purchased (1, 2, 3, 4...)

Continuous numerical
Discrete numerical
Ordinal categorical
Nominal categorical

Q5: Education level (High School, Bachelor's, Master's, PhD)

Discrete numerical
Ordinal categorical
Nominal categorical
Text

Why This Matters

Data type determines which analysis methods and visualizations are appropriate. Using the wrong method produces meaningless results.

PPatterns: "What's Typical?"

The Goal

Understand what "normal" looks like in your data. This baseline is essentialβ€”you cannot spot what's unusual until you know what's usual.

For Numerical Variables

  • Central tendency: Mean, median, mode
  • Spread: Range, standard deviation
  • Shape: Symmetric, skewed, bimodal
  • Quartiles: 25th, 50th, 75th percentiles

For Categorical Variables

  • Frequency: Count per category
  • Proportion: % of total
  • Mode: Most common category
  • Cardinality: Number of unique values

Patterns: Statistics That Tell Stories

Statistic What It Tells You Business Example
Mean Average value (sensitive to outliers) Average order value is $85
Median Middle value (robust to outliers) Half of customers spend less than $65
Mode Most frequent value Most common purchase quantity is 1 item
Std Dev How spread out values are Delivery times vary by Β±2 days
Skewness Distribution asymmetry Most customers are low-spenders, few big spenders

Key Insight: When mean β‰  median, your data is skewed. This affects which "average" you should report!

Patterns: Which Chart for Which Question?

Question Type Best Visualization Example
Distribution of one numerical variable Histogram, Box plot How are customer ages distributed?
Composition/proportions Pie chart, Stacked bar What % of sales come from each region?
Comparison across categories Bar chart Which product category has highest revenue?
Change over time Line chart How have monthly sales changed this year?
Relationship between two variables Scatter plot Does price affect quantity sold?

Quiz: Interpret the Statistics

Q1: A dataset of salaries has Mean = $95,000 and Median = $65,000. What does this indicate?

Data is normally distributed
Right-skewed (few high earners pulling mean up)
Left-skewed (few low earners)
Data has errors

Q2: Product A has Std Dev of $5, Product B has Std Dev of $50. Which statement is TRUE?

Product A is more expensive
Product A prices are more consistent
Product B is more popular
Cannot determine anything

Q3: A category has 10,000 rows but only 3 unique values. This category has:

Low cardinality
High cardinality
High variance
Missing values

Patterns: Turning Observations into Questions

Pattern β†’ Question Formula

When you observe a pattern, ask: "Why?" or "What if?"

Pattern Observed Business Question Generated
70% of orders come from 3 cities Why are these cities dominant? What's limiting growth elsewhere?
Average order value is $50 but median is $30 Who are the high-value customers? What makes them different?
Sales peak on Fridays and Mondays What drives these peaks? Can we replicate them on other days?
Product Category X has lowest volume but highest margins How can we increase Category X sales without hurting margins?

AAnomalies: "What Stands Out?"

The Goal

Identify data points or patterns that deviate from the norm. Anomalies often reveal the most interesting business opportunities or problems.

Types of Anomalies

  • Outliers: Individual extreme values
  • Gaps: Missing time periods or categories
  • Spikes: Sudden increases
  • Drops: Sudden decreases
  • Inconsistencies: Values that don't make sense

What Anomalies Might Indicate

  • Data entry errors
  • System glitches
  • Unusual events (holidays, promotions)
  • Fraud or suspicious activity
  • Business opportunities

Anomalies: How to Spot Them

Method How It Works When to Use
IQR Method Values beyond 1.5 Γ— IQR from Q1 or Q3 Quick outlier detection for any numerical variable
Z-Score Values more than 3 standard deviations from mean When data is approximately normal
Visual Inspection Box plots, scatter plots, time series charts Initial exploration; seeing the context
Domain Rules Values that violate business logic Age = -5, Price = 0, Quantity = 1 million
Comparison Comparing across segments or time periods Finding segments that behave differently

Anomalies: Why They Matter for Business

Case Study: Retail Sales Anomaly

Observation: One store has 300% higher sales per square foot than average

Investigation reveals: Store manager implemented unique customer engagement strategy

Business Value: Strategy documented and rolled out to other stores, increasing overall revenue by 15%

The Key Insight

Anomalies aren't just "bad data" to remove. They're often signals of:

  • Best practices worth replicating
  • Problems requiring intervention
  • Market opportunities being exploited (or missed)
  • Customer segments with unique needs

Quiz: Is This an Anomaly?

Q1: Daily website visits average 10,000. On December 26, visits were 45,000. Is this an anomaly?

Yes, definitely bad data
Yes, but likely explainable (post-holiday shopping)
No, normal variation
Cannot determine

Q2: Customer age field shows value of 250 years. This is most likely:

Data entry error
Legitimate outlier
Business opportunity
Statistical anomaly

Q3: One sales rep consistently sells 40% more than peers with same territory size. This anomaly suggests:

Remove as outlier from analysis
Investigate for best practices to replicate
Data recording error
Ignore in reporting

Anomalies: Turning Observations into Questions

Anomaly β†’ Question Formula

When you find an anomaly, ask: "What caused this?" and "What can we learn?"

Anomaly Observed Business Question Generated
5% of customers generate 60% of revenue What characteristics define these super-customers? How do we find more?
Returns spike every January Are holiday gifts driving returns? How can we reduce this?
One product has 10x the review count of similar products What makes this product generate engagement? Can we replicate?
Certain ZIP codes have zero orders Is this a delivery limitation, marketing gap, or demographic mismatch?

RRelationships: "What Connects?"

The Goal

Discover how variables interact with each other. Relationships reveal the "levers" you can pull to influence outcomes.

Types of Relationships

  • Correlation: Variables move together
  • Segmentation: Groups behave differently
  • Time-based: Trends and seasonality
  • Hierarchical: Nested groupings

Key Questions

  • Which variables predict my outcome?
  • Do different segments show different patterns?
  • What changes over time?
  • Are there natural groupings?

Important: Correlation β‰  Causation. Just because two things move together doesn't mean one causes the other.

Relationships: Understanding Correlation

Positive Correlation

β†—

As X increases, Y increases

Example: Marketing spend and website visits

Negative Correlation

β†˜

As X increases, Y decreases

Example: Price and quantity demanded

No Correlation

⟷

X and Y move independently

Example: Shoe size and income

Correlation Coefficient (r)

r = +1: Perfect positive correlation | r = 0: No correlation | r = -1: Perfect negative correlation

Rule of thumb: |r| > 0.7 = strong, 0.3-0.7 = moderate, < 0.3 = weak

Relationships: Segmentation Analysis

What is Segmentation?

Dividing your data into meaningful groups to see if different segments behave differently.

Example: E-commerce Customer Segments

Segment Avg Order Value Purchase Frequency Return Rate
New Customers $45 1.2/year 18%
Repeat Customers $78 4.5/year 8%
VIP Customers $156 12/year 3%

Insight: VIP customers are 3x more valuable than repeat customers. What makes someone become a VIP?

Quiz: Interpret the Relationship

Q1: Correlation between advertising spend and sales is r = 0.85. This indicates:

Advertising causes sales to increase
Strong positive association between the variables
85% of sales come from advertising
Weak relationship

Q2: Ice cream sales and drowning deaths are positively correlated. The most likely explanation is:

Ice cream causes drowning
Drowning makes people buy ice cream
A third variable (summer/heat) causes both
The correlation is coincidental

Q3: When comparing segments, you find urban customers have 2x the average order value of rural customers. Your next step should be:

Stop serving rural customers
Investigate what drives the difference
Combine the segments for analysis
Assume urban customers are better

Relationships: Turning Observations into Questions

Relationship β†’ Question Formula

When you find a relationship, ask: "Can we leverage this?" and "Is this causal?"

Relationship Observed Business Question Generated
Email open rate correlates with purchase rate (r=0.6) Will improving email content increase purchases, or do engaged customers just do both?
Mobile users have 30% lower conversion than desktop Is our mobile experience causing lost sales? What friction points exist?
Customers who buy Product A also buy Product B (70% overlap) Should we bundle these products? Create cross-sell recommendations?
Sales increase every Thursday What happens on Thursdays? Is it payday-related? Can we amplify this?

From SPAR Insights to Business Questions

The Question Generator Framework

For each SPAR observation, use these question starters:

Descriptive Questions

  • What is the current state of...?
  • How is X distributed across...?
  • What proportion of...?
  • How has X changed over...?

Diagnostic Questions

  • Why is X happening?
  • What factors contribute to...?
  • What differentiates A from B?
  • What explains the variation in...?

Predictive Questions

  • Can we predict which customers will...?
  • What will X be in the next...?
  • Which factors best predict...?

Prescriptive Questions

  • How can we improve...?
  • What actions would increase...?
  • What should we do to reduce...?

Question Quality Checklist

A Good Business Question Must Be:

Criterion Test Example: Good vs Bad
Specific Is it focused enough to answer? ❌ "Why do customers leave?" βœ“ "What factors predict customer churn within 90 days?"
Answerable Can your data actually answer this? ❌ "What are customers thinking?" βœ“ "What do customer reviews reveal about satisfaction?"
Actionable Will the answer lead to action? ❌ "Is our data complete?" βœ“ "Which customer segments are underserved?"
Original Does it add new insight? ❌ "What were last year's sales?" βœ“ "What drove the sales variance from forecast?"

Assessment 1 Requirement: "A difficult problem to resolve that cannot be solved by GenAI, the web or a textbook"

Complete SPAR Example: Online Retail Dataset

Structure

500K transactions
12 months
8 countries
4,000 products

Patterns

Mean order: $45
Peak: Nov-Dec
UK: 85% of orders

Anomalies

2% customers = 30% revenue
Negative quantities exist
Tuesday spike

Relationships

Quantity ↔ Total (r=0.9)
Country affects AOV
Product clusters exist

Generated Business Question

"How can we identify and replicate the purchasing behaviors of high-value customers (top 2%) to increase overall customer lifetime value in UK retail?"

This question is: Specific (top 2%, UK), Answerable (data exists), Actionable (leads to strategy), Original (not obvious from surface metrics)

Assessment 1: Visualization Requirements

Key Requirement

"You need to generate at least THREE RELEVANT visualisations from YOUR DATASET to demonstrate and illustrate its value toward answering your research question."

What Makes a Visualization "Relevant"?

  1. Directly supports your research question - not just "interesting"
  2. Shows evidence - patterns, anomalies, or relationships that justify your question
  3. Is the right chart type - matches the data and question
  4. Is clearly labeled - title, axes, legend, source

Three Visualization Strategy

Viz 1: Show a key pattern in your data (distribution, trend, comparison)

Viz 2: Show an anomaly or segment that your question addresses

Viz 3: Show a relationship that your analysis will explore

Hands-On Activity: Apply SPAR

Your Task (30 minutes)

Apply the SPAR method to the provided sample dataset OR your Assessment 1 dataset.

Step-by-Step Instructions

  1. Structure (5 min): Complete the structure checklist - rows, columns, types, quality
  2. Patterns (8 min): Use Question Prompts for each variable type. Calculate key statistics.
  3. Anomalies (7 min): Find at least 2 anomalies, classify them (error vs insight)
  4. Relationships (7 min): Identify 2 correlations or segment differences
  5. Question (3 min): Generate one business question that passes the quality checklist

Remember Your Question Prompts!

For each variable, ask: What type is it? What would I expect? What would surprise me?

Key Takeaways

Structure First

Know your data before analyzing it

Find the Normal

Patterns establish your baseline

Value Anomalies

Unusual = interesting business opportunities

Connections Matter

Relationships reveal the levers

The SPAR Method + Question Prompts Give You

  • A repeatable process for any dataset
  • Internal questions to ask yourself for every variable
  • Systematic evidence gathering for your business question
  • Clear justification for your visualizations
  • Confidence that you haven't missed important insights

Next Steps for Assessment 1

This Week

  1. Apply SPAR + Question Prompts to your chosen dataset
  2. Document your findings in each category
  3. Generate 3-5 potential business questions
  4. Create draft visualizations that support your strongest question

Assessment 1 Due: Week 5

  • Report uploaded to Turnitin
  • Visualization file (Tableau/Power BI) uploaded to File Dropbox
  • 10+ references supporting your industry review

Questions?

Use today's remaining time to start your SPAR analysis. Your facilitator can help guide your exploration.