Week 2: Applying Analytics Methodology to Industry Projects
DATA6000: Industry Business Analytics Project: Capstone
Focus Areas:
Formulating the right business questions
Sourcing data for industry projects
Applying analytics methodology steps
Learning Objectives
By the end of this session, you will be able to:
Transform vague business requests into specific, actionable analytics questions
Identify and evaluate appropriate data sources for industry capstone projects
Apply the 6-phase analytics methodology systematically to real-world problems
Recognize common pitfalls in problem formulation and data sourcing
Plan and scope an analytics project with realistic constraints
Why Problem Formulation is Critical
"A problem well-stated is a problem half-solved" - Charles Kettering
Problem formulation is the most critical element in any business analytics project because:
Direction: It determines every subsequent decision in your project
Resources: It defines what data, tools, and techniques you need
Success Metrics: It establishes how you measure project success
Stakeholder Alignment: It ensures everyone understands the project goals
Key Insight: Most failed analytics projects fail not because of poor analysis, but because they solved the wrong problem. Getting the problem right is your first and most important task.
The Translation Challenge
From Business Language to Analytics Language
Business stakeholders speak in terms of goals and outcomes. Analysts need specific, measurable, and answerable questions.
Business Language (Vague)
Analytics Language (Specific)
"We need to improve sales"
"What factors predict which customers will make repeat purchases within 30 days?"
"Customers are leaving"
"Which customer segments have the highest churn rate in the past 6 months, and what behaviors precede churn?"
"Optimize our operations"
"Where are the bottlenecks in our order fulfillment process that add more than 24 hours to delivery time?"
"Understand our customers better"
"What are the distinct customer segments based on purchasing behavior, and what are their characteristics?"
What Makes a Good Analytics Question?
The SMART-A Framework for Analytics Questions
SSpecific: Clearly defined scope and focus. Who, what, where, and when are specified.
MMeasurable: Can be answered with data and quantitative methods. Success can be objectively evaluated.
AActionable: The answer will lead to concrete business decisions or actions.
RRelevant: Directly tied to business objectives and stakeholder needs.
TTime-bound: Has a clear timeframe for analysis and impact.
AAnalytically Tractable: Can be answered with available or obtainable data and appropriate methods.
The 5W1H Method for Problem Definition
A systematic framework to translate business problems into analytics questions:
The 5W1H Framework
WHO
Stakeholders, target users, affected parties
WHAT
Objectives, outcomes, deliverables
WHERE
Context, location, business unit
WHEN
Timeframe, deadlines, temporal scope
WHY
Motivation, business value, impact
HOW
Success metrics, KPIs, measurement
5W1H Method: Applied Example
Business Request: "We need to reduce customer churn"
Applying the 5W1H Framework:
WHO: Subscription customers who have been active for at least 3 months
WHAT: Identify customers at high risk of canceling their subscription
WHERE: Focus on the North American market segment initially
WHEN: Predict churn probability for the next 60 days; project completion in 8 weeks
WHY: Customer acquisition costs are 5x retention costs; reducing churn by 10% equals $2M annual revenue
HOW: Success measured by model accuracy (>80%), precision (>70%), and business impact (5% churn reduction in pilot)
Resulting Analytics Question: "Which features and behaviors predict subscription cancellation within 60 days for North American customers who have been active for 3+ months, with sufficient accuracy to enable targeted retention interventions?"
Common Pitfalls in Problem Formulation
Watch Out For These Mistakes:
Too Vague
Problem: "Improve customer experience"
Why it fails: No specific metrics, no clear scope, impossible to measure success
Fix: "Reduce average customer support response time from 4 hours to 2 hours"
Too Broad
Problem: "Analyze all business operations"
Why it fails: Unfocused, resource-intensive, no clear deliverable
Fix: "Identify inefficiencies in the order fulfillment process that cause delays >24 hours"
Solution-Focused
Problem: "Build a neural network for sales"
Why it fails: Specifies technique before understanding problem
Fix: "Predict monthly sales by product category to optimize inventory levels"
Not Actionable
Problem: "Explore interesting patterns in data"
Why it fails: No business decision will result from the analysis
Fix: "Identify customer segments with different purchasing behaviors to personalize marketing"
Practice: Identifying Problem Quality
Evaluate these problem statements. Which are well-formulated and which need improvement?
Problem Statement
Quality Assessment
"Use machine learning to improve things"
❌ Poor: Too vague, solution-focused, not actionable
"Which product features correlate with customer ratings above 4 stars in our mobile app?"
✓ Good: Specific, measurable, actionable
"Make customers happy"
❌ Poor: Not measurable, no scope, undefined
"What factors predict employee turnover within 90 days for new hires in sales roles?"
✓ Good: Specific, time-bound, measurable
"Do something about logistics"
❌ Poor: No objective, no scope, too vague
Quiz 1: Problem Definition
A retail manager says "We need to improve sales." Which is the BEST analytics question?
Data Sourcing for Industry Projects
Why Data Sourcing Matters
Once you have defined your business problem, the next critical step is identifying and accessing the right data.
Without appropriate data, even the most well-defined problem cannot be solved. Data sourcing determines the feasibility and quality of your entire project.
Data availability constrains what questions you can actually answer
Data quality determines the reliability of your insights
Data access affects project timelines and feasibility
Data costs impact project budgets and sustainability
Types of Data Sources in Industry Settings
Three Categories of Data Sources
Internal Sources
Transactional databases
CRM systems
ERP systems
Web analytics
Operational logs
Employee records
Financial systems
External Sources
Public datasets
Government data
Third-party APIs
Market research
Social media
Web scraping
Industry reports
Hybrid/Enriched
Internal + external combined
Purchased enrichment
Partner data sharing
Surveys + transaction data
Demographic overlays
The Data Availability Matrix
Evaluate potential data sources using this framework:
Data Availability Matrix
High Relevance →
← Low Accessibility
Low Relevance High Accessibility
Status: Easy to get but not useful
Action: Avoid unless exploratory
Example: Public datasets unrelated to your problem
High Relevance High Accessibility
Status: Ideal data sources
Action: Prioritize these sources
Example: Internal transaction database
Low Relevance Low Accessibility
Status: Worst case scenario
Action: Eliminate from consideration
Example: Restricted competitor data not related to problem
High Relevance Low Accessibility
Status: Valuable but challenging
Action: Assess effort vs. value
Example: Partner data requiring legal agreements
High Accessibility →
← High Relevance
Real-World Data Sourcing Challenges
Industry projects face practical challenges that academic projects rarely encounter:
Data Silos
Data scattered across different departments, systems, and formats with no integration
Impact: Requires significant time for data collection and integration
Data Quality Issues
Missing values, inconsistent formats, duplicates, errors, and outdated information
Impact: 60-80% of project time spent on data cleaning
Access Restrictions
Privacy regulations, security policies, legal constraints, and approval processes
Impact: Delays in project start and potential scope changes
Documentation Gaps
Unclear data definitions, missing metadata, undocumented business rules
Impact: Risk of misinterpreting data and drawing wrong conclusions
Key Takeaway: Always assess data availability and quality EARLY in your project. Many projects fail because data challenges were discovered too late.
Data Sourcing Checklist for Capstone Projects
Before Committing to a Data Source:
✓ Availability: Can you actually access this data? What approvals are needed?
✓ Timeframe: How long will it take to obtain access and extract the data?
✓ Completeness: Does the data cover the full scope of your analysis (time period, geographic area, customer segments)?
✓ Quality: What is the expected quality? Are there known issues?
✓ Format: In what format is the data? Will conversion or significant preprocessing be required?
✓ Volume: Is the data volume sufficient for your analysis? Too large to handle?
✓ Documentation: Is there a data dictionary? Are field definitions clear?
✓ Compliance: Are there legal, privacy, or ethical constraints on use?
✓ Cost: Is there any cost to obtain or use the data?
✓ Backup Plan: What alternatives exist if this data source falls through?
Quiz 2: Data Sourcing
For predicting employee turnover, which data source combination is MOST appropriate?
Applying the Analytics Methodology
Revisiting the 6-Phase Framework
Phase 1
Problem Definition & Data Sourcing
→
Phase 2
Data Processing & Management
→
Phase 3
Analytics Techniques
Phase 4
Visualization & Evaluation
→
Phase 5
Communication & Recommendations
→
Phase 6
Ethics & Security
Today's Focus: Phase 1 is where projects succeed or fail. Getting problem definition and data sourcing right determines everything that follows.
Phase 1: Problem Definition & Data Sourcing
Detailed Workflow
1Initial Stakeholder Meeting
Understand business context, objectives, constraints. Document in your own words.
2Apply 5W1H Framework
Systematically clarify who, what, where, when, why, and how for the project.
3Draft Analytics Question
Translate business need into specific, measurable, actionable analytics question.
4Identify Required Data
List all data elements needed to answer the analytics question.
5Map to Data Sources
Identify where each data element can be obtained. Apply availability matrix.
6Assess Feasibility
Evaluate data access, quality, timeline. Identify gaps and risks.
7Refine or Pivot
Adjust analytics question based on data reality. Get stakeholder approval.
"Which customer behavioral patterns and account characteristics predict subscription cancellation within 30-60 days for US-based monthly subscribers, with sufficient accuracy (>75%) and precision (>70%) to enable cost-effective retention interventions?"
Why This Question Works: It's specific (30-60 days, US, monthly), measurable (>75% accuracy), actionable (enables interventions), relevant (addresses churn), time-bound (project timeline), and tractable (we can get the data).
Case Study: Identifying Data Requirements
What data do we need to answer the analytics question?
Data Category
Specific Data Elements
Source
Customer Profile
Customer ID, subscription start date, plan type, demographics, location
Internal CRM
Behavioral Data
Login frequency, product views, time on site, feature usage, last login date
Web analytics (Google Analytics)
Transaction History
Payment history, failed payments, refund requests, plan changes
Payment system (Stripe)
Support Interactions
Support ticket count, resolution time, satisfaction scores, complaint types
Subscription status (active/cancelled), cancellation date, cancellation reason
Internal CRM
Case Study: Assessing Data Availability
Apply the Data Availability Matrix to each source:
High Relevance + High Accessibility
Internal CRM data (customer profiles, subscription status)
Transaction history from payment system
Internal product engagement data
Action: Prioritize these - start here
High Relevance + Low Accessibility
Web analytics data (requires API setup and historical data export)
Support ticket data (in separate system, needs integration)
Action: Worth the effort - plan for 2-week data integration
Low Relevance + High Accessibility
General industry benchmark data
Public e-commerce statistics
Action: Use only for context, not core analysis
Low Relevance + Low Accessibility
Competitor customer data (impossible to obtain)
Detailed social media sentiment (complex to collect and low direct relevance)
Action: Eliminate - not worth pursuing
Case Study: Selecting Analytics Approach
Based on the defined problem and available data, what methodology makes sense?
Methodology Selection Rationale
Problem Type: Binary classification (will churn or won't churn)
Data Type: Mix of structured numerical and categorical data
Data Volume: 50,000 customers with 18 months of historical data (sufficient for supervised learning)
Business Requirement: Need interpretable results to understand why customers churn
Recommended Approach: Supervised classification models (logistic regression for interpretability, random forest for comparison, evaluation on holdout test set)
Quiz 3: Methodology Application
What is the FIRST step when starting an analytics capstone project?
Practical Considerations: Stakeholder Management
Getting Buy-In and Managing Expectations
Technical excellence alone doesn't guarantee project success. You must manage stakeholders effectively.
Key Stakeholders
Sponsor: Executive champion
End Users: Who will use insights
Data Owners: Control data access
IT/Technical: Support infrastructure
Compliance: Legal/privacy oversight
Best Practices
Set clear expectations early
Communicate in business terms, not jargon
Regular status updates (weekly)
Be transparent about limitations
Document all agreements
Common Stakeholder Issues
Scope creep: "While you're at it, can you also analyze..."
Unrealistic expectations: "Can you predict next year's sales with 95% accuracy?"
Changing priorities: Mid-project shifts in business focus
Data gatekeeping: Stakeholders reluctant to share data
Solution: Written project scope document signed by all stakeholders at project start.
Timeline Planning: Realistic vs Optimistic
Student projects often underestimate time requirements. Here's what actually takes time:
Project Phase
Student Estimate
Realistic Industry Timeline
Problem definition & scoping
1 week
2-3 weeks (multiple stakeholder meetings)
Data access approval
Immediate
1-4 weeks (legal/IT approvals)
Data collection & integration
1 week
2-4 weeks (multiple systems, APIs, extraction)
Data cleaning & preparation
1 week
3-6 weeks (60-80% of project time)
Analysis & modeling
2 weeks
2-3 weeks (the "fun" part is shortest)
Validation & refinement
1 week
2-3 weeks (multiple iterations)
Documentation & presentation
1 week
2 weeks (stakeholder-ready materials)
TOTAL
8 weeks
14-25 weeks
Planning Principle: Whatever timeline you think is reasonable, add 50% buffer for unexpected delays. They WILL happen.
Risk Assessment and Mitigation
Identify and plan for potential project risks early:
Risk Category
Specific Risk
Mitigation Strategy
Data Access
Cannot obtain necessary data due to privacy/security constraints
Identify alternative data sources; have backup project scope
Data Quality
Data has >50% missing values or major quality issues
Early data quality assessment; plan for imputation or scope adjustment
Technical
Data volume too large for available tools/infrastructure
Sample data for initial analysis; cloud computing resources
Scope
Problem too complex for project timeline
Break into phases; focus on MVP (minimum viable product)
Stakeholder
Stakeholder changes priorities mid-project
Written scope agreement; regular check-ins; document changes
Your capstone project requires customer transaction data, but the company's data is spread across 5 different systems with inconsistent formats. What is the PRIMARY risk?
Why Documentation Matters from Day One
Good documentation is not optional - it's essential for project success and reproducibility.
Essential Documentation Components
Project Charter: Problem statement, objectives, scope, stakeholders, timeline
Data Dictionary: All data sources, field definitions, data types, missing value codes
Analysis Log: Date, what you tried, results, decisions made, lessons learned
Code Repository: Version controlled, well-commented, README file explaining structure
Decision Documentation: Why you chose certain methods, what alternatives you considered
Findings presented clearly to both technical and business audiences
6. Implementation Readiness
Clear recommendations with feasible next steps
Workshop Activity: Scope Your Own Project
Group Exercise (20 minutes)
Activity Instructions
You will work in groups of 3-4 to analyze a business brief and develop a project scope.
1Read the Business Brief (2 min)
Your group will receive a real business scenario from an industry partner
2Apply 5W1H Framework (5 min)
Systematically analyze who, what, where, when, why, and how
3Define Analytics Question (5 min)
Transform business need into a specific, measurable analytics question
4Identify Data Sources (5 min)
List required data and potential sources, assess availability
5Present to Class (3 min per group)
Share your analytics question and rationale
Workshop Scenario
Healthcare Provider: Emergency Department Wait Times
Background: St. Mary's Hospital is a 400-bed facility with a busy emergency department (ED). The Chief Medical Officer is concerned about patient satisfaction scores related to wait times.
Business Request:
"Our ED wait times are hurting patient satisfaction and our reputation. We need to understand what's causing delays and fix them. Can you help us reduce wait times?"
Common complaints mention "long wait times" but no specific data
Hospital has electronic health records (EHR) system with timestamped patient flow data
Leadership wants recommendations within 8 weeks
Your Task: Apply the frameworks from today's lecture to scope this project. What's your analytics question? What data do you need? What are the risks?
Quiz 5: Applied Scenario
A healthcare provider wants to "reduce patient wait times." You have access to: appointment schedules, patient check-in logs, doctor availability, and treatment duration records. What is the MOST appropriate initial analytics approach?
Key Takeaways from Week 2
Problem formulation is the most critical element of any analytics project
Remember These Core Principles:
1. Start with the Problem, Not the Solution
Understand the business need before selecting analytical techniques
2. Use Frameworks to Structure Your Thinking
5W1H and SMART-A frameworks transform vague requests into specific questions
3. Validate Data Availability Early
Don't design a project around data you can't access
4. Plan for Reality, Not Best Case
Data issues, delays, and scope changes are normal - plan accordingly
5. Document Everything
Your future self and stakeholders will thank you
Preparing for Your Capstone Project
Action Items for Next Week
✓ Identify Potential Project Topics
Think about business problems you're interested in solving. Consider your professional interests and industry connections.
✓ Research Data Availability
For your potential topics, investigate what data might be available. Contact industry partners if relevant.
✓ Practice Problem Formulation
Take 3 vague business statements and translate them into specific analytics questions using the frameworks from today.
✓ Review Week 1 Content
Refresh your understanding of the 6-phase analytics framework and how Phase 1 connects to later phases.
✓ Set Up Project Documentation Structure
Create folders for your capstone project: data, code, documentation, references.
✓ Complete Workshop Activity
If you didn't finish the in-class workshop, complete the ED wait time scenario analysis.
Additional Resources
Recommended Reading and Tools
Problem Definition:
"Cracking the PM Interview" by Gayle McDowell - Problem-solving frameworks
"The Lean Startup" by Eric Ries - Validating problem-solution fit
Data Sourcing:
Google Dataset Search (datasetsearch.research.google.com)
Kaggle Datasets (kaggle.com/datasets)
Government open data portals (data.gov, data.gov.au)
Project Management:
CRISP-DM methodology for data mining projects
Trello or Asana for project task tracking
GitHub for code version control and collaboration
Week 2 Summary
Success in analytics projects starts with getting the problem right
We covered:
Translating business language to analytics questions