Systematic Dataset Exploration for Business Analytics
Assessment 1 Due: Week 5
Today: Learn how to extract business questions from any dataset
"I have a dataset... but I don't know what to look for."
"Where do I even start?"
"How do I turn data into a business question?"
A systematic framework that works with ANY dataset:
Section 4 requires you to: "Generate a unique business question for this industry based on one of the business problems"
You also need: "at least THREE RELEVANT visualisations from YOUR DATASET"
A systematic approach to dataset exploration
"What do I have?"
"What's typical?"
"What stands out?"
"What connects?"
Why SPAR works: Each step builds on the previous one, guiding you from basic understanding to insight generation. It prevents the "staring at data" problem by giving you specific things to look for.
You cannot find patterns until you understand structure. You cannot spot anomalies until you know what's normal. You cannot understand relationships until you've identified what varies.
SPAR tells you what to look for... but not what questions to ask yourself when you're staring at the data.
"Look for patterns in your data"
"Find anomalies"
"Identify relationships"
"I see numbers... now what?"
"Everything looks normal to me"
"How do I know what to compare?"
Experienced analysts have internal questions they automatically ask for each type of variable. We'll teach you these questions so you can think like an analyst.
Different variable types require different questions. Always identify the type first.
What it is: Numbers you can calculate with
Examples:
What it is: Groups or categories
Examples:
What it is: Dates, times, periods
Examples:
NUMERICAL Ask yourself these questions for every numerical column:
CATEGORICAL Ask yourself these questions for every categorical column:
TIME Ask yourself these questions for date/time columns:
Before calculating anything, take 5 seconds to ask:
TYPE
Numerical, Categorical, or Time?
EXPECT
What would I expect to see?
SURPRISE
What would surprise me?
COMPARE
What should I compare it to?
SO WHAT
Why does this matter to the business?
Example thought process for "Total Sales" column:
TYPE: Numerical β EXPECT: Maybe $50-200 average? β SURPRISE: If average is $5 or $5000 β COMPARE: By store, by month, by product β SO WHAT: Revenue is the core business metric
Variable: Total (transaction amount in $)
Potential Question: "What drives high-value transactions, and can we encourage more of them?"
Scenario: You see a column called "Customer_Age" with values ranging from 18 to 85.
Which question is MOST useful to ask first?
Scenario: You see a column called "Payment_Method" with values: Cash, Credit, Debit, Mobile.
Which question is MOST useful to ask first?
Scenario: You see a column called "Order_Date" spanning Jan 2023 to Dec 2023.
Which question is MOST useful to ask first?
Scenario: You calculate that Mean = $500 and Median = $120 for a "Transaction_Value" column.
What does this tell you?
Keep this mental checklist for every variable you explore:
| Variable Type | First Questions to Ask | Follow-up Questions |
|---|---|---|
| NUM | Mean? Median? Are they different? Min/Max reasonable? |
Does it differ by [category]? Has it changed over time? |
| CAT | Which is most common? Does one category dominate? |
Any unexpected categories? How does [numerical] differ across these? |
| TIME | Overall trend up/down/flat? Any obvious cycles? |
Any sudden spikes/drops? What events explain them? |
Remember: These questions work for any dataset. Practice them until they become automatic!
Before any analysis, you must understand the basic shape and quality of your data. This prevents errors and guides your entire analysis.
| Quality Issue | What to Check | Why It Matters |
|---|---|---|
| Missing Values | % missing per column, patterns of missingness | Affects analysis validity; may indicate data collection issues |
| Duplicates | Exact duplicates, near-duplicates | Can inflate counts and skew statistics |
| Data Types | Numbers stored as text, dates as strings | Incorrect types prevent proper analysis |
| Value Ranges | Min/max values, negative where unexpected | Indicates data entry errors or outliers |
| Consistency | Same entity different spellings, format variations | Causes incorrect groupings and counts |
Q1: Customer satisfaction rating (1-5 stars)
Q2: Product category (Electronics, Clothing, Food)
Q3: Transaction amount ($45.99, $123.50, etc.)
Q4: Number of items purchased (1, 2, 3, 4...)
Q5: Education level (High School, Bachelor's, Master's, PhD)
Data type determines which analysis methods and visualizations are appropriate. Using the wrong method produces meaningless results.
Understand what "normal" looks like in your data. This baseline is essentialβyou cannot spot what's unusual until you know what's usual.
| Statistic | What It Tells You | Business Example |
|---|---|---|
| Mean | Average value (sensitive to outliers) | Average order value is $85 |
| Median | Middle value (robust to outliers) | Half of customers spend less than $65 |
| Mode | Most frequent value | Most common purchase quantity is 1 item |
| Std Dev | How spread out values are | Delivery times vary by Β±2 days |
| Skewness | Distribution asymmetry | Most customers are low-spenders, few big spenders |
Key Insight: When mean β median, your data is skewed. This affects which "average" you should report!
| Question Type | Best Visualization | Example |
|---|---|---|
| Distribution of one numerical variable | Histogram, Box plot | How are customer ages distributed? |
| Composition/proportions | Pie chart, Stacked bar | What % of sales come from each region? |
| Comparison across categories | Bar chart | Which product category has highest revenue? |
| Change over time | Line chart | How have monthly sales changed this year? |
| Relationship between two variables | Scatter plot | Does price affect quantity sold? |
Q1: A dataset of salaries has Mean = $95,000 and Median = $65,000. What does this indicate?
Q2: Product A has Std Dev of $5, Product B has Std Dev of $50. Which statement is TRUE?
Q3: A category has 10,000 rows but only 3 unique values. This category has:
When you observe a pattern, ask: "Why?" or "What if?"
| Pattern Observed | Business Question Generated |
|---|---|
| 70% of orders come from 3 cities | Why are these cities dominant? What's limiting growth elsewhere? |
| Average order value is $50 but median is $30 | Who are the high-value customers? What makes them different? |
| Sales peak on Fridays and Mondays | What drives these peaks? Can we replicate them on other days? |
| Product Category X has lowest volume but highest margins | How can we increase Category X sales without hurting margins? |
Identify data points or patterns that deviate from the norm. Anomalies often reveal the most interesting business opportunities or problems.
| Method | How It Works | When to Use |
|---|---|---|
| IQR Method | Values beyond 1.5 Γ IQR from Q1 or Q3 | Quick outlier detection for any numerical variable |
| Z-Score | Values more than 3 standard deviations from mean | When data is approximately normal |
| Visual Inspection | Box plots, scatter plots, time series charts | Initial exploration; seeing the context |
| Domain Rules | Values that violate business logic | Age = -5, Price = 0, Quantity = 1 million |
| Comparison | Comparing across segments or time periods | Finding segments that behave differently |
Observation: One store has 300% higher sales per square foot than average
Investigation reveals: Store manager implemented unique customer engagement strategy
Business Value: Strategy documented and rolled out to other stores, increasing overall revenue by 15%
Anomalies aren't just "bad data" to remove. They're often signals of:
Q1: Daily website visits average 10,000. On December 26, visits were 45,000. Is this an anomaly?
Q2: Customer age field shows value of 250 years. This is most likely:
Q3: One sales rep consistently sells 40% more than peers with same territory size. This anomaly suggests:
When you find an anomaly, ask: "What caused this?" and "What can we learn?"
| Anomaly Observed | Business Question Generated |
|---|---|
| 5% of customers generate 60% of revenue | What characteristics define these super-customers? How do we find more? |
| Returns spike every January | Are holiday gifts driving returns? How can we reduce this? |
| One product has 10x the review count of similar products | What makes this product generate engagement? Can we replicate? |
| Certain ZIP codes have zero orders | Is this a delivery limitation, marketing gap, or demographic mismatch? |
Discover how variables interact with each other. Relationships reveal the "levers" you can pull to influence outcomes.
Important: Correlation β Causation. Just because two things move together doesn't mean one causes the other.
β
As X increases, Y increases
Example: Marketing spend and website visits
β
As X increases, Y decreases
Example: Price and quantity demanded
β·
X and Y move independently
Example: Shoe size and income
r = +1: Perfect positive correlation | r = 0: No correlation | r = -1: Perfect negative correlation
Rule of thumb: |r| > 0.7 = strong, 0.3-0.7 = moderate, < 0.3 = weak
Dividing your data into meaningful groups to see if different segments behave differently.
| Segment | Avg Order Value | Purchase Frequency | Return Rate |
|---|---|---|---|
| New Customers | $45 | 1.2/year | 18% |
| Repeat Customers | $78 | 4.5/year | 8% |
| VIP Customers | $156 | 12/year | 3% |
Insight: VIP customers are 3x more valuable than repeat customers. What makes someone become a VIP?
Q1: Correlation between advertising spend and sales is r = 0.85. This indicates:
Q2: Ice cream sales and drowning deaths are positively correlated. The most likely explanation is:
Q3: When comparing segments, you find urban customers have 2x the average order value of rural customers. Your next step should be:
When you find a relationship, ask: "Can we leverage this?" and "Is this causal?"
| Relationship Observed | Business Question Generated |
|---|---|
| Email open rate correlates with purchase rate (r=0.6) | Will improving email content increase purchases, or do engaged customers just do both? |
| Mobile users have 30% lower conversion than desktop | Is our mobile experience causing lost sales? What friction points exist? |
| Customers who buy Product A also buy Product B (70% overlap) | Should we bundle these products? Create cross-sell recommendations? |
| Sales increase every Thursday | What happens on Thursdays? Is it payday-related? Can we amplify this? |
For each SPAR observation, use these question starters:
| Criterion | Test | Example: Good vs Bad |
|---|---|---|
| Specific | Is it focused enough to answer? | β "Why do customers leave?" β "What factors predict customer churn within 90 days?" |
| Answerable | Can your data actually answer this? | β "What are customers thinking?" β "What do customer reviews reveal about satisfaction?" |
| Actionable | Will the answer lead to action? | β "Is our data complete?" β "Which customer segments are underserved?" |
| Original | Does it add new insight? | β "What were last year's sales?" β "What drove the sales variance from forecast?" |
Assessment 1 Requirement: "A difficult problem to resolve that cannot be solved by GenAI, the web or a textbook"
500K transactions
12 months
8 countries
4,000 products
Mean order: $45
Peak: Nov-Dec
UK: 85% of orders
2% customers = 30% revenue
Negative quantities exist
Tuesday spike
Quantity β Total (r=0.9)
Country affects AOV
Product clusters exist
"How can we identify and replicate the purchasing behaviors of high-value customers (top 2%) to increase overall customer lifetime value in UK retail?"
This question is: Specific (top 2%, UK), Answerable (data exists), Actionable (leads to strategy), Original (not obvious from surface metrics)
"You need to generate at least THREE RELEVANT visualisations from YOUR DATASET to demonstrate and illustrate its value toward answering your research question."
Viz 1: Show a key pattern in your data (distribution, trend, comparison)
Viz 2: Show an anomaly or segment that your question addresses
Viz 3: Show a relationship that your analysis will explore
Apply the SPAR method to the provided sample dataset OR your Assessment 1 dataset.
For each variable, ask: What type is it? What would I expect? What would surprise me?
Know your data before analyzing it
Patterns establish your baseline
Unusual = interesting business opportunities
Relationships reveal the levers
Use today's remaining time to start your SPAR analysis. Your facilitator can help guide your exploration.