From Business Question to Analytical Method

Week 6: Methodology Selection
DATA6000 - Capstone Project

Where You Are Now

✓ Assessment 1 Complete

You have successfully identified:

  • Industry context - You understand your chosen sector
  • Business problem - You have a clear question to answer
  • Dataset - You have identified and explored your data sources
  • Descriptive insights - You understand what your data shows

The Critical Gap

You know WHAT to investigate...
But do you know HOW to investigate it?

The Methodology Selection Challenge

What Students Often Do

  • Pick a method they've heard of
  • Choose based on software familiarity
  • Select what seems "impressive"
  • Follow what peers are doing

What You Should Do

  • Match method to question type
  • Verify data meets requirements
  • Justify why it's appropriate
  • Prove it will work on YOUR data

Today's Objective

Learn a systematic framework for selecting and justifying the right analytical methodology for YOUR business question and dataset.

The Methodology Landscape

Four Categories of Analytics

Category Core Question Output Type
Descriptive What happened? Summary, visualization, dashboard
Diagnostic Why did it happen? Relationships, correlations, causes
Predictive What will happen? Forecasts, classifications, probabilities
Prescriptive What should we do? Recommendations, optimization, decisions
Critical Insight: Most capstone projects require combining 2-3 categories. Descriptive alone is insufficient for your final report!

Descriptive vs Diagnostic Analytics

Descriptive Analytics

Purpose: Summarize and visualize what occurred

Common Methods:

  • Frequency distributions
  • Central tendency measures
  • Data profiling
  • Dashboards
  • Trend visualization

Example Question:

"What were our sales by region last quarter?"

Diagnostic Analytics

Purpose: Understand relationships and root causes

Common Methods:

  • Correlation analysis
  • Regression analysis
  • Root cause analysis
  • Segmentation analysis
  • A/B test analysis

Example Question:

"Why did sales drop in the Northeast region?"

Key Difference: Descriptive shows you the facts. Diagnostic explains the reasons behind those facts.

Predictive vs Prescriptive Analytics

Predictive Analytics

Purpose: Forecast future outcomes

Common Methods:

  • Linear/logistic regression
  • Decision trees
  • Neural networks
  • Time series forecasting
  • Classification algorithms

Example Question:

"Which customers are likely to churn next month?"

Prescriptive Analytics

Purpose: Recommend optimal actions

Common Methods:

  • Optimization algorithms
  • Recommender systems
  • Simulation models
  • Decision analysis
  • Constraint programming

Example Question:

"What retention strategies should we deploy for at-risk customers?"

Key Difference: Predictive tells you what's coming. Prescriptive tells you what to do about it.

Classification Quiz: Part 1

Question 1: "What is the average customer lifetime value across our product categories?"

A) Descriptive Analytics
B) Diagnostic Analytics
C) Predictive Analytics
D) Prescriptive Analytics

Question 2: "Which pricing strategy will maximize our revenue next quarter?"

A) Descriptive Analytics
B) Diagnostic Analytics
C) Predictive Analytics
D) Prescriptive Analytics

Classification Quiz: Part 2

Question 3: "Why do customers abandon their shopping carts at the payment stage?"

A) Descriptive Analytics
B) Diagnostic Analytics
C) Predictive Analytics
D) Prescriptive Analytics

Question 4: "How many support tickets will we receive during Black Friday?"

A) Descriptive Analytics
B) Diagnostic Analytics
C) Predictive Analytics
D) Prescriptive Analytics

Methodology Decision Tree

Start Here: What is your business question asking?

Does your question contain:

  • "What is..." / "How many..." / "Show me..." → Descriptive
  • "Why..." / "What caused..." / "What explains..." → Diagnostic
  • "Will..." / "Likely to..." / "Predict..." / "Forecast..." → Predictive
  • "Should..." / "Recommend..." / "Optimize..." / "Best strategy..." → Prescriptive
Common Mistake: Forcing a predictive model when you really need diagnostic analysis. Always match the method to what the business actually needs to know!

Common Category Mistakes

Mistake Example Why It's Wrong
Mislabeling Calling correlation analysis "predictive" Correlation shows relationships (diagnostic), doesn't forecast outcomes
Over-complicating Using neural networks for simple trend questions Descriptive visualization would answer the question better
Under-analyzing Only creating dashboards for a "why" question Descriptive methods can't explain causation
Jumping ahead Building recommendations without predictions Prescriptive requires predictive foundation
The Golden Rule: Your methodology category must directly address what your business question is asking for. Don't choose methods that produce outputs the business didn't request!

The MATCH Framework

A Systematic Approach to Methodology Selection

MATCH helps you select and justify your analytical methodology systematically, ensuring it's appropriate for both your business question AND your dataset.

M - Methodology Type

Which analytical category does your question require?

A - Analytical Technique

Which specific method within that category will you use?

T - Technical Prerequisites

Does your data meet the requirements for this technique?

C - Connection Strength

How well does this technique address your specific question?

H - How to Measure Success

What metrics will show your analysis worked?

M - Methodology Type

Selection Guide

If Your Question Needs... Choose This Type
Summary of historical data Descriptive - Present facts and trends
Explanation of relationships Diagnostic - Identify correlations and causes
Future outcome estimation Predictive - Build forecasting or classification models
Action recommendations Prescriptive - Generate optimal decisions

Case Study: TechStart Accelerator

Business Question: "What factors predict startup success within 24 months?"

Dataset: 211 startup records (funding, team size, industry, market conditions, success/failure)

M - Methodology Type: Predictive Analytics

Rationale: Question asks to "predict" a binary outcome (success/failure)

A - Analytical Technique

Common Techniques by Category

Descriptive

  • Statistical summaries
  • Data visualization
  • Clustering (unsupervised)
  • Profiling

Diagnostic

  • Correlation analysis
  • Regression (explanatory)
  • Hypothesis testing
  • Root cause analysis
  • Segmentation analysis

Predictive

  • Linear/logistic regression
  • Decision trees/Random forest
  • Neural networks
  • Time series forecasting
  • Support vector machines

Prescriptive

  • Recommender systems
  • Optimization algorithms
  • Simulation models
  • Decision trees (prescriptive)

A - Analytical Technique (TechStart): Logistic Regression + Random Forest Classification

Both techniques handle binary classification (success/failure) and can rank feature importance

T - Technical Prerequisites

Key Requirements Checklist

  • Data Type Compatibility: Does your data type match the method's requirements?
    • Numerical, categorical, text, image?
  • Sample Size: Do you have enough observations?
    • Minimum 10-20 observations per variable for most methods
  • Data Quality: Is your data sufficiently clean?
    • Missing values handled? Outliers addressed?
  • Variable Requirements: Do you have the right variables?
    • Target variable for supervised learning? Enough features?
  • Distribution Assumptions: Does data meet statistical assumptions?
    • Normality, independence, homoscedasticity (if required)

T - Technical Prerequisites (TechStart):

  • ✓ Binary target variable (success/failure) - suitable for classification
  • ✓ 211 records with ~8 features - adequate sample size
  • ✓ Mix of numerical and categorical predictors - logistic regression handles both
  • ✓ No severe class imbalance identified

C - Connection Strength

Scoring How Well Your Methodology Fits

Score Connection Strength Criteria
5 - Excellent Method directly addresses the question type; commonly used for this exact problem in industry; produces actionable outputs
4 - Strong Method strongly aligns with question; well-documented applications; minor adaptations needed
3 - Adequate Method can answer the question but not ideal; requires significant interpretation; alternative methods may be better
2 - Weak Method tangentially related; produces outputs that need substantial translation to answer business question
1 - Poor Method doesn't align with question type; outputs don't address what was asked

C - Connection Strength (TechStart): 5/5 - Excellent

Classification algorithms directly predict binary outcomes (success/failure), which is exactly what the question asks. Industry standard for this problem type.

H - How to Measure Success

Evaluation Metrics by Methodology Type

Methodology Type Common Success Metrics
Descriptive Completeness of summary, clarity of visualizations, stakeholder comprehension
Diagnostic R² value, correlation coefficients, statistical significance (p-values), effect sizes
Predictive Accuracy, precision, recall, F1-score, AUC-ROC, RMSE, MAE
Prescriptive Optimization goal achievement, recommendation acceptance rate, business outcome improvement

H - How to Measure Success (TechStart):

  • Primary: Classification accuracy >75% on test set
  • Secondary: AUC-ROC >0.80 (strong discriminative ability)
  • Business: Identify top 3-5 predictive factors accelerator can influence
  • Validation: K-fold cross-validation to ensure model generalizability

Apply MATCH: Practice Scenario

New Scenario: E-Commerce Returns

Business Question: "Which product characteristics are associated with higher return rates?"

Dataset: 5,000 product records with return rate, price, category, customer ratings, description length, images count

Question: What methodology type should be used?

A) Descriptive - Just show return rates by category
B) Diagnostic - Identify which characteristics correlate with returns
C) Predictive - Forecast future return rates
D) Prescriptive - Recommend which products to discontinue
Think about: The question asks "which characteristics are associated with" - this is asking for relationships and correlations, not predictions or recommendations.

Two Types of Visualization

Descriptive Visualization

Purpose: Show what your data contains

Used in: Assessment 1, exploratory analysis

Examples:

  • Bar chart of category frequencies
  • Pie chart of market share
  • Histogram of age distribution
  • Time series of sales trends

Answers: "What does my data look like?"

Justification Visualization

Purpose: Prove your methodology will work

Used in: Methodology selection, validation

Examples:

  • Scatter plot showing relationship to model
  • Correlation matrix of predictors
  • Distribution plot checking assumptions
  • Class balance visualization

Answers: "Why will my method work on this data?"

Critical Shift: You now need justification visualizations to support your methodology choice. These are different from the descriptive visualizations you created in Assessment 1!

The PROVE Framework

Creating Visualizations That Justify Your Methodology

PROVE helps you create visualizations that demonstrate your chosen methodology is appropriate and will work effectively on YOUR specific dataset.

P - Pattern Existence

Show that the patterns you want to model actually exist in your data

R - Relationship Strength

Demonstrate the relationships are strong enough to model

O - Outlier Impact

Show that anomalies won't break your analytical method

V - Variable Distribution

Prove your variables meet the method's statistical assumptions

E - Enough Data

Demonstrate you have sufficient data volume and quality

P - Pattern Existence

Show the Pattern You Want to Model Actually Exists

Purpose: Before committing to a methodology, prove your data contains the patterns, trends, or relationships you intend to analyze.

Orange Visualizations to Create:

  • Scatter Plot: Show relationships between continuous variables (X-Y patterns)
  • Box Plot by Category: Show how target variable differs across groups
  • Line Chart: Show temporal patterns exist in time series data
  • Heat Map: Show patterns across multiple dimensions

TechStart Example:

Visualization: Box plot of "Funding Amount" grouped by "Success Status"

What it proves: Successful startups show systematically different funding patterns than failed ones - confirming the pattern exists to model.

Orange Widgets: File → Data Table → Box Plot (set funding as variable, success as grouping)

R - Relationship Strength

Demonstrate Relationships Are Strong Enough to Model

Purpose: Prove that relationships between your variables are statistically meaningful and strong enough to support your chosen analytical method.

Orange Visualizations to Create:

  • Correlation Matrix: Show correlation coefficients between all numeric variables
  • Scatter Plot with Regression: Show fit quality (R² value)
  • Feature Importance Chart: Show which variables have strongest predictive power
  • Mosaic Plot: Show associations between categorical variables

TechStart Example:

Visualization: Correlation matrix showing relationships between funding, team size, market size, and success

What it proves: Moderate-to-strong correlations (|r| > 0.3) between predictors and target, confirming relationships worth modeling.

Orange Widgets: File → Correlations → Mosaic Display

O - Outlier Impact

Show Anomalies Won't Break Your Method

Purpose: Identify and assess whether outliers or anomalies in your data will negatively impact your chosen methodology.

Orange Visualizations to Create:

  • Box Plot with Outliers: Show distribution and outlier locations
  • Scatter Plot with Highlighting: Identify extreme observations
  • Distribution Plot: Show z-scores or standard deviations
  • Outlier Table: List observations beyond 3 standard deviations

TechStart Example:

Visualization: Box plot of "Team Size" showing 3 startups with teams >100 people

What it proves: While outliers exist, they represent only 1.4% of data. Decision: Keep them as they represent legitimate large startups, and random forest is robust to outliers.

Orange Widgets: File → Box Plot → Data Table (filter outliers)

V - Variable Distribution

Prove Variables Meet Method Assumptions

Purpose: Demonstrate your data distributions meet the statistical assumptions required by your chosen methodology.

Orange Visualizations to Create:

  • Histogram with Normal Curve: Check for normality (if required)
  • Q-Q Plot: Formal normality assessment
  • Distribution Comparison: Compare distributions across groups
  • Data Info Table: Show statistics (mean, median, skewness, kurtosis)

TechStart Example:

Visualization: Histograms of key numeric predictors (funding, team size, market size)

What it proves: Variables are right-skewed (non-normal), which is fine for logistic regression and random forest - both handle non-normal distributions. Linear regression would require transformation.

Orange Widgets: File → Distributions → Data Info

E - Enough Data

Demonstrate Sufficient Data Volume and Quality

Purpose: Prove you have adequate data volume, completeness, and balance to support your methodology.

Orange Visualizations to Create:

  • Data Info Summary: Show total observations and completeness
  • Class Balance Chart: Show distribution of target variable categories
  • Missing Data Matrix: Show completeness by variable
  • Sample Size Table: Show observations per category/group

Sample Size Rules of Thumb

  • Regression: 10-20 obs per predictor
  • Classification: 50+ per class
  • Neural networks: 1000+ observations
  • Time series: 50+ time points

TechStart Example

Observations: 211 total

Class balance: 118 success (56%), 93 failure (44%)

What it proves: Adequate sample size for 8 predictors, well-balanced classes

Bad vs Good Justification Visualizations

Example 1: Proving Relationship Strength

❌ Bad: Descriptive

Bar chart showing: "Average funding by industry"

Why it's wrong: Shows what funding looks like, doesn't prove relationships exist between funding and success

✓ Good: Justification

Scatter plot showing: "Funding vs Success Rate with correlation r=0.58"

Why it's right: Proves a moderate relationship exists, justifies including funding as a predictor

Example 2: Proving Pattern Existence

❌ Bad: Descriptive

Pie chart showing: "Proportion of startups by industry sector"

Why it's wrong: Shows industry distribution, doesn't prove industries have different success patterns

✓ Good: Justification

Box plot showing: "Success rate by industry showing significant variation (ANOVA p<0.01)"

Why it's right: Proves industry matters for success, justifies including it in the model

Red Flags: When to Change Methodology

Warning Signs Your Method Won't Work

Red Flag What It Means Action Required
No visible patterns Your PROVE visualizations show random scatter, no relationships Reconsider if prediction is possible; may need more diagnostic work first
Extreme outliers Outliers represent >10% of data or heavily skew results Choose robust methods, transform data, or handle outliers explicitly
Insufficient sample size Fewer than 10 observations per variable Simplify model, reduce variables, or collect more data
Severe class imbalance One class <10% of observations Use resampling, different algorithms, or adjust evaluation metrics
Assumption violations Data doesn't meet method's statistical requirements Transform data, use non-parametric alternatives, or change methods
Weak relationships All correlations |r| < 0.2 Consider feature engineering, interaction terms, or different variables
Don't Force It: If your PROVE visualizations reveal fundamental problems, it's better to change methodology now than to proceed with a flawed approach!

Workshop Task: Your Project

Apply MATCH + PROVE to Your Assessment 1 Project

You will now work through both frameworks using YOUR actual business question and dataset.

Part 1: Complete MATCH Analysis (20 minutes)

  • Identify your methodology type based on your business question
  • Select 1-2 specific analytical techniques you'll use
  • List technical prerequisites your data must meet
  • Score your connection strength (1-5) with justification
  • Define success metrics for your analysis

Part 2: Plan PROVE Visualizations (20 minutes)

  • Identify which 3 PROVE elements are most critical for YOUR methodology
  • Specify exact visualizations you'll create in Orange
  • Describe what each visualization will prove about your data
  • List any potential red flags you need to check
Deliverable: Complete the "Methodology Justification Template" provided and be prepared to share your methodology choice with peers.

Summary & Next Steps

What You Learned Today

  • Four analytical categories (Descriptive, Diagnostic, Predictive, Prescriptive) and how to classify business questions
  • MATCH Framework for systematic methodology selection with clear justification
  • PROVE Framework for creating visualizations that justify your methodology will work on YOUR data
  • Red flags that indicate when to change your analytical approach

Your Action Items This Week

  1. Complete MATCH analysis for your project
  2. Create 3-5 PROVE visualizations in Orange Data Mining
  3. Document any red flags and mitigation strategies
  4. Begin drafting your methodology justification
  5. Bring questions about your methodology choice to next workshop

Remember

The right methodology is one that matches your question, fits your data, and can be justified with evidence.