DATA6000

Capstone: Industry Case Studies

Analytics Industry Research Project: Roadmap

Welcome to Your Capstone Journey

Your role: Industry Analytics Consultant

What Makes Capstone Different?

Real-World Problem Solving

Not textbook exercises - you'll tackle actual industry challenges

Integration of ALL Previous Subjects

  • DATA4000 - Analytics language and roles
  • DATA4100 - Analyzing and visualizing data
  • DATA4200 - Analyzing and managing data
  • DATA4300 - Ethics, privacy, and security
  • DATA4400 - Time series forecasting
  • DATA4500 - Social media analytics
  • DATA4600 - Managing projects
  • DATA4800 - Machine learning
  • DATA5000 - Programming AI for business analytics

Your Original Contribution

You will add new knowledge to your chosen industry through original research and analysis

The Capstone Challenge

Scenario:

You are hired as an analytics consultant. An industry has a problem. Your job is to research, analyze, and provide actionable recommendations.

This is YOUR project. You drive the direction.

Assessment 1: Literature Review

What

Industry exploration and formulate YOUR unique business question

Why

Understanding what's already known before you contribute something new

You Will

  • Choose an industry and research key business problems
  • Evaluate existing analysis on these problems
  • Reflect on data and analytics methodologies employed
  • Evaluate types of data sources available
  • Create a unique business question based on identified problems

Deliverable

Written report exploring industry background and analytics applications

Assessment Structure and Timeline

Assessment 1

Literature Review

Explore industry and formulate question

Assessment 2

Methodology Peer Review

8-minute presentation + feedback

Assessment 3

Industry Research Report

Complete project + elevator pitch

Important

Each assessment builds on the previous one. Start early on your literature review!

The 6-Phase Analytics Project Framework

Every successful analytics project follows a journey

1. Problem Definition & Data Sourcing
2. Data Processing & Management
3. Analytics Techniques
4. Visualization & Evaluation
5. Communication & Recommendations
6. Ethics, Privacy & Security
Iterative
Process

Key Point: This is an iterative process, not a linear one!

Phase 1: Asking the Right Question

The Business Problem ≠ The Analytics Question

Example

Business Problem: "We're losing customers"

Analytics Question: "Which customer segments have the highest churn probability in the next 90 days?"

SMART Questions Framework

  • Specific: Clear and focused
  • Measurable: Quantifiable outcomes
  • Achievable: Realistic with available resources
  • Relevant: Aligned with business goals
  • Time-bound: Has a defined timeframe

Data Sourcing Strategy

Internal vs External

Internal: Company databases, CRM, transaction logs

External: Market research, public datasets, APIs

Structured vs Unstructured

Structured: Databases, spreadsheets, tables

Unstructured: Text, images, videos, social media

Primary vs Secondary

Primary: Collected directly (surveys, experiments)

Secondary: Existing data from other sources

Real-time vs Historical

Real-time: Streaming data, live feeds

Historical: Archived data, trend analysis

Data Source Evaluation Matrix

Optimal Data Sources Have:

Accuracy

Data is correct and error-free

Relevance

Directly relates to your question

Accessibility

You can obtain and use it

Cost-effectiveness

Worth the investment

Ethical Data Sources Must Have:

Consent

Proper permissions obtained

Privacy

Personal information protected

Transparency

Clear data provenance

Fairness

Represents all groups appropriately

Quiz 1: Problem Definition

Scenario: A retail company wants to reduce customer churn. They have access to:

  • Customer transaction history
  • Website clickstream data
  • Customer service call logs
  • Social media mentions

Question: Which data source should be prioritized FIRST for initial analysis?

A) Social media mentions - to understand brand sentiment
B) Customer transaction history - to identify purchasing patterns of churned vs. retained customers
C) Website clickstream data - to see user engagement
D) Customer service call logs - to find complaints

Explanation: Transaction history provides the ground truth of customer behavior (who churned vs. who stayed) and reveals purchasing patterns. This forms your baseline before exploring other signals.

Phase 2: From Raw Data to Analysis-Ready Data

Key Activities

1. Data Cleaning

  • Missing values strategy (remove, impute, or flag)
  • Outlier detection and handling
  • Consistency checks across sources
  • Duplicate removal

2. Data Transformation

  • Normalization/Standardization (scaling values)
  • Feature engineering (creating new variables)
  • Categorical encoding (converting text to numbers)
  • Aggregation and summarization

3. Data Storage

  • Structured data: SQL databases, spreadsheets
  • Unstructured data: NoSQL databases, data lakes
  • Your project: Likely spreadsheets, but understand the principles!

Automation vs Manual Processing

When to Automate

  • Repetitive tasks
  • Large data volumes
  • Regular updates needed
  • Consistent rules can be defined

When to Stay Manual

  • Exploratory phase
  • Complex judgment calls required
  • Small datasets
  • One-time analysis

Trade-offs to Consider

Efficiency vs Control

Automation is faster but you lose hands-on understanding

Speed vs Understanding

Manual work takes longer but builds intuition

Cost vs Customization

Automation requires upfront investment

Quiz 2: Analytics Technique Selection

Scenario: You're analyzing student mental health data and want to predict which students are at risk of dropping out. You have:

  • Numerical data: attendance rates, grades, login frequency
  • Categorical data: course type, study mode, demographics
  • Historical data: which students dropped out in previous years

Question: Which approach would be MOST appropriate?

A) Time series forecasting only, since you're predicting future behavior
B) Classification algorithms like Random Forest or Logistic Regression
C) Simple descriptive statistics and visualization
D) Natural language processing on student feedback only

Explanation: This is a binary classification problem (drop out: Yes/No) with labeled historical data. Classification algorithms can handle both numerical and categorical features to identify patterns that predict dropout risk.

Phase 3: Choosing Your Analytical Approach

The Three Categories

1. Descriptive Analytics (What happened?)

  • Summary statistics
  • Pattern identification
  • Trend analysis

When to use: Understanding historical data, initial exploration

2. Predictive Analytics (What will happen?)

  • Classification: Predicting categories (Will a student drop out? Yes/No)
  • Regression: Predicting numbers (What will sales be next quarter?)

When to use: Forecasting, risk assessment, decision support

3. Prescriptive Analytics (What should we do?)

  • Optimization
  • Recommendation systems
  • Simulation

When to use: Action-oriented decisions, resource allocation

Matching Techniques to Questions

Do you have historical outcomes?
YES
→ Supervised Learning
NO
→ Unsupervised Learning
Are you predicting categories or numbers?
Categories
→ Classification
(Logistic Regression, Random Forest, SVM)
Numbers
→ Regression
(Linear Regression, Time Series)
Do you have labels?
NO
→ Clustering
(K-means, Hierarchical)
Looking for patterns
→ Association Rules
(Market Basket Analysis)

Types of Analysis: Backwards-Looking

Purpose

Understanding what has already happened - patterns in existing data useful for understanding the past

For Forecasting Projects

  • Trend Analysis: Is the pattern going up or down over time?
  • Seasonality Detection: Are there regular cyclical patterns?
  • Variable Importance: Which factors influence the outcome most?

For Prediction Projects

  • Feature Analysis: Which variables are most predictive?
  • Pattern Recognition: What distinguishes different groups?

Training Phase Outputs

Outputs that describe the quality of your model:

  • Forecasting: Error measures (RMSE, MAE, MAPE)
  • Prediction: Accuracy, Confusion Matrix, ROC curves

Types of Analysis: Forwards-Looking

Purpose

Saying something about the future - this is where business value is created

For Forecasting

Output: Future forecasts with confidence intervals

Business Use: Planning, budgeting, resource allocation

For Prediction

Output: Deploying the model to make predictions on new data

Business Use: Real-time decision making, automation, risk scoring

Your Project Must Include Forward-Looking Analysis

Your capstone project needs to answer a business question about the future, not just describe the past. Expected results can be inferred from testing accuracy and confusion matrix.

Interpreting Forecasts (Forward-Looking)

Understanding Confidence Intervals

The band around your forecast line shows the probability the result will fall within that range

What Should the Business Do?

Lower Bound

"Worst case" scenario

Plan for minimum expected outcome

Middle Line

Most likely scenario

Base case planning

Upper Bound

"Best case" scenario

Opportunity planning

Important Caveats

  • Forecasts measure uncertainty of the model, not absolute truth
  • Unforeseen events can have larger impact than the model predicts
  • Models work best when the future resembles the past
  • Always monitor: What does the business do with actual results?

Interpreting Prediction (Forward-Looking)

Deploying Models for Prediction

How is the business going to use the model in practice?

Use Cases

Automation

Replace or augment human decisions

Key question: How does model performance compare to human performance?

Predicting Behavior

Anticipate customer, employee, or system actions

Key question: What is the cost of being wrong, and is it worth it overall?

Testing Results Tell You

  • How useful the model will be in practice
  • Which types of cases it handles well vs. poorly
  • Whether the model is ready for deployment
  • What monitoring and maintenance will be needed

Critical: Always consider the real-world consequences of model errors

Phase 4: Making Results Meaningful

Visualization Principles

1. Know Your Audience

  • Technical team: Detailed charts, statistical metrics
  • Executives: High-level dashboards, key insights
  • Mixed audience: Layered approach with drill-down capability

2. Choose the Right Chart Type

  • Comparison: Bar charts
  • Trend over time: Line charts
  • Part-to-whole: Pie charts (use sparingly!)
  • Relationship: Scatter plots
  • Distribution: Histograms, box plots

3. Ethical Visualization

Avoid:

  • Misleading axes (truncated or stretched)
  • Cherry-picked data that supports only one narrative
  • Overly complex charts that obscure truth

Include:

  • Context and comparisons
  • Limitations and uncertainty
  • Clear labels and legends

Model Evaluation: Beyond Accuracy

For Classification Models

Accuracy

Overall correctness

Can be misleading with imbalanced data!

Precision

Of predicted positives, how many are actually positive?

Important when false positives are costly

Recall (Sensitivity)

Of actual positives, how many did we catch?

Important when false negatives are costly

F1-Score

Balance between precision and recall

Useful when you need both to be good

Context Matters - Examples

High Recall Priority

Cancer detection: Catch all cases, even with false positives

At-risk student identification: Better to offer support unnecessarily than miss someone

High Precision Priority

Spam detection: Avoid marking important emails as spam

Fraud detection in payments: Don't block legitimate transactions

Quiz 3: Model Evaluation

Scenario: Your predictive model for identifying at-risk students shows:

  • 85% accuracy
  • But when deployed, it correctly identifies only 40% of students who actually drop out (the ones you're trying to help!)
  • However, 95% of students it flags as "at risk" do actually drop out

Question: What's the issue?

A) The model has high precision but low recall - it's too conservative in flagging students
B) The model is overfitted to the training data
C) The visualization is incorrect
D) The data preprocessing was inadequate

Explanation: High precision (95% of flagged students do drop out) but low recall (only catching 40% of actual dropouts) means the model is conservative. In this context, higher recall is better - it's better to offer support to students who might not need it than to miss students who do need help.

Phase 5: Telling the Data Story

The Three-Part Structure

1. Context

What problem were you solving?

Why does it matter?

2. Approach

How did you solve it?

(Brief on methods)

3. Impact

What should stakeholders DO with this information?

What's the expected outcome?

Tailoring Communication

Technical Stakeholders

  • Methods and algorithms
  • Validation approach
  • Limitations and assumptions
  • Technical details in appendix

Business Stakeholders

  • Key insights and findings
  • Actionable recommendations
  • Expected ROI or impact
  • Next steps

The Elevator Pitch

60 Seconds to Make an Impact

Structure

Problem: What challenge did you address? (15 seconds)

Solution: What did you do about it? (20 seconds)

Impact: What difference does it make? (25 seconds)

Rules

  • No jargon - your grandmother should understand it
  • One memorable takeaway - what do you want them to remember?
  • Practice - timing matters!

Recommendation Framework

Recommendations must be:

Specific

"Improve student support"

"Implement a weekly check-in program for students identified as at-risk"

Actionable

Stakeholders can actually do this with their available resources and authority

Evidence-Based

Tied directly to your findings - show the connection between data and recommendation

Realistic

Consider constraints: budget, time, technical capability, organizational culture

Measurable

How will you know if it worked? Define success metrics

Phase 6: Ethics Isn't an Afterthought

Ethics must be considered at EVERY phase of your project

The Five Ethical Pillars

1. Informed Consent

  • Do participants know their data is being used?
  • Do they understand HOW it's being used?
  • Did they have a genuine choice to opt out?

2. Privacy Protection

  • De-identification: Remove direct identifiers
  • Anonymization: Make re-identification impossible
  • Data minimization: Collect only what you need
  • Secure storage and transmission

3. Fairness and Bias

  • Is your model fair across all demographic groups?
  • Are you perpetuating historical biases?
  • Is your training data representative?
  • Who benefits? Who might be harmed?

4. Transparency

  • Can you explain your model's decisions?
  • Are stakeholders clear about limitations?
  • Is the methodology documented?
  • Can others reproduce your work?

5. Accountability

  • Who is responsible if something goes wrong?
  • What's the process for addressing harm?
  • How will the model be monitored after deployment?
  • What are the appeal mechanisms?

Red Flags in Analytics Projects

Stop and Reassess If:

  • Stakeholders want to hide how you got results
  • You're using data without clear consent
  • Your model performs very differently across demographic groups
  • You can't explain your model's key decisions
  • The business application could directly harm individuals
  • You're asked to ignore negative findings
  • Privacy protections are treated as optional

When in doubt, consult your facilitator

Quiz 4: Ethics in Practice

Scenario: A university wants to implement facial recognition technology during online lectures to:

  • Track student attendance automatically
  • Detect engagement levels (attentive vs. distracted)
  • Identify students who might need additional support

Question: What is the PRIMARY ethical concern that must be addressed FIRST?

A) The technology might not be accurate enough
B) Students haven't given informed consent for biometric data collection and continuous monitoring
C) The data storage costs are too high
D) Faculty members might not understand how to use the system

Explanation: Biometric data (facial recognition) is highly sensitive personal information. Continuous monitoring raises significant privacy concerns. Students must give informed consent, understanding exactly what data is collected, how it's used, who has access, and their right to opt out. Accuracy, costs, and usability are important, but consent is the foundational ethical requirement.

Blue Sky Case Study: Student Mental Health

What is "Blue Sky" Thinking?

Before diving into real constraints, let's think BIG. What if you had access to any data and any technology?

Purpose

  • Understand the full potential of what analytics could do
  • Then make informed trade-offs based on real-world constraints
  • Spark creative thinking
  • Identify what's truly necessary vs. "nice to have"

The Case Study

Context: You're hired by a university

Problem: Student mental health concerns, especially with remote learning

Task: Investigate and provide recommendations

Brainstorming Activity

How does remote learning affect you?

Think about challenges you've faced or observed

Common Themes We Often Hear:

  • Feeling isolated from peers
  • Screen fatigue from continuous video calls
  • Difficulty with time management
  • Work-life balance challenges
  • Lack of informal peer interaction
  • Technical difficulties creating stress
  • Reduced motivation

We'll use these insights to inform our Blue Sky project

Blue Sky: Phase 1 - Problem Definition & Data

Business Question

How can we identify and support students struggling with mental health in remote learning?

Blue Sky Data Sources

Academic

Grades, attendance, assignment submissions, LMS engagement patterns

Behavioral

Login patterns, time-of-day activity, video engagement metrics

Communication

Discussion forum sentiment, email communication patterns

Wellbeing

Self-reported surveys, support service usage

Technology (Blue Sky!)

AI emotion detection (facial expressions), voice stress analysis

Environmental

Home study setup quality, internet stability

Discussion Point

What data is actually NECESSARY vs. what would be "nice to have"?

Blue Sky: Phase 2 - Data Processing

Challenges

  • Multiple data sources need integration
  • Privacy-sensitive information requiring protection
  • Real-time vs. batch processing decisions
  • Missing data from students who disengage
  • Varying data formats and quality

Blue Sky Tools

Automated Pipelines

Data flows automatically from sources to centralized system

Cloud Storage

Secure, encrypted storage with role-based access

Real-time Processing

Immediate alerts for critical indicators

Data Quality Monitoring

Automated checks for completeness and accuracy

Reality Check

In your actual project, you'll likely use spreadsheets and manual processes. But understanding these principles helps you make better decisions.

Blue Sky: Phase 3 - Analytics Techniques

Blue Sky Technologies

Machine Learning for Early Warning

  • Classification models: Predict which students are at high/medium/low risk
  • Features: Attendance trends, grade patterns, engagement metrics, communication frequency
  • Output: Risk scores updated daily

Natural Language Processing

  • Sentiment analysis: Analyze discussion posts and emails for emotional tone
  • Identify: Signs of distress, frustration, or disengagement

Clustering Analysis

  • Group students: With similar behavioral patterns
  • Benefit: Targeted interventions for each group

Prescriptive Analytics (Most Blue Sky!)

  • Recommendation system: Suggest specific interventions based on student profile
  • Virtual reality therapy: Integration for stress management

Expected Outputs

  • Individual student risk scores
  • Trend analysis across cohorts
  • Intervention effectiveness tracking
  • Early warning alerts

Blue Sky: Phase 4 - Visualization & Evaluation

Dashboard Components

Individual View

Student risk indicators (traffic light system)

Recent activity summary

Recommended actions

Cohort View

Class-level trends

Comparison across courses

Time-based patterns

Program View

Institution-wide metrics

Intervention effectiveness

Resource allocation

Evaluation Metrics

  • Prediction accuracy: How accurate are risk predictions?
  • Early detection: Are we identifying students before crisis points?
  • False positive rate: Students flagged who don't need help
  • False negative rate: Students missed who do need help
  • Intervention impact: Do supported students show improvement?

Critical Question

In student wellbeing, is it better to have false positives or false negatives? Why?

Blue Sky: Phase 5 - Communication

Stakeholders and Their Needs

University Leadership

Needs: Budget justification, policy implications

Format: Executive summary with ROI

Faculty

Needs: Practical interventions they can implement

Format: Action-oriented guidelines

Support Services

Needs: Case prioritization, resource allocation

Format: Detailed reports with risk scores

Students

Needs: Transparency about monitoring, opt-out options

Format: Plain language explanations

Key Messages

  • Early identification enables early intervention
  • Data-driven approach supplements (not replaces) human judgment
  • Privacy protections are built-in
  • System requires continuous monitoring and refinement

Blue Sky: Phase 6 - Ethics & Privacy

Critical Ethical Questions

Who has access to individual student data?

  • Should faculty see risk scores?
  • Should peers ever see this information?
  • How long is data retained?

How is data stored and protected?

  • Encryption at rest and in transit
  • Role-based access controls
  • Audit logs of who accessed what

Can students opt out?

  • What happens if a student refuses monitoring?
  • Does opting out disadvantage them?
  • Is participation truly voluntary?

What happens if the model flags someone incorrectly?

  • Stigma from being labeled "at risk"
  • Unnecessary interventions
  • Self-fulfilling prophecies

Are we violating student privacy even with good intentions?

  • Continuous monitoring of behavior
  • Facial expression analysis
  • Right to be left alone

Reality Check: Bringing It Back to Earth

Constraints in YOUR Projects

Data Access

You'll likely use public datasets or simulated data, not sensitive student information

Technical Skills

Work within your skillset, but push yourself to learn one new technique

Time

12 weeks to complete the project - scope accordingly

Resources

Free or student-licensed tools only

The Key Principle

Make informed trade-offs, not compromises

Understand what you're giving up and why. Document your decisions.

What Remains Important

  • Clear problem definition
  • Appropriate methodology for your question
  • Ethical considerations throughout
  • Actionable recommendations
  • Professional communication

Quiz 5: Synthesis and Critical Thinking

Scenario: You're working on a recommendation system to suggest products to online shoppers. Your model shows:

  • 92% accuracy in testing
  • Stakeholders are excited about deployment
  • However, you notice the model primarily recommends popular items that most customers would buy anyway (bestsellers)
  • Niche products that might delight specific customers are rarely recommended

Question: What should you do FIRST?

A) Deploy the model since accuracy is high and stakeholders are happy
B) Re-evaluate using metrics beyond accuracy, such as diversity of recommendations, novelty, and user satisfaction
C) Collect more data to improve the model further
D) Create better visualizations to present to stakeholders

Explanation: High accuracy doesn't mean the model is adding business value. A recommendation system that only suggests what customers would find anyway doesn't help with discovery or engagement. This demonstrates that model evaluation must align with business objectives, not just statistical metrics. Consider diversity, novelty, serendipity, and long-term engagement.

Key Expectations for Your Success

Independence

This is YOUR project. You drive the direction. We guide, but you decide.

Iteration

Expect to refine your approach multiple times. That's normal and encouraged.

Professional Communication

Treat this as a real consulting engagement. Quality matters.

Academic Integrity

Your work, properly cited sources. Plagiarism will not be tolerated.

Time Management

Start early on literature review! Don't leave it until the last minute.

Growth Mindset

You'll encounter challenges. That's where learning happens.

Remember

This capstone is your opportunity to showcase everything you've learned and create something meaningful for your portfolio.

Resources and Support

Available Resources

Weekly Workshops

Case studies, skill building, and project development time

Learning Materials

  • Templates for each assessment
  • Exemplars from previous students
  • Guides for literature searching
  • Methodology resources

Facilitator Support

Office hours: Check learning portal for schedule

Email for questions

Feedback on drafts (with sufficient lead time)

Peer Support

Collaboration is encouraged (but submission is individual)

Discussion forums for questions

Peer review activities in class

Technical Resources

Access to software through student licenses

Public datasets and repositories

Technical guides for tools

Your Action Items This Week

Get Started Now

1. Industry Brainstorming

Start thinking about industries that interest you. What problems do you want to solve?

2. Review Previous Subjects

What skills from DATA4000-5000 can you leverage? Where are your strengths?

3. Read Assessment 1 Guidelines

Thoroughly review the requirements for the literature review

4. Browse Industry News

Read reports and articles for inspiration. What challenges are industries facing?

5. Prepare for Next Week

Come with 2-3 industry ideas to discuss

Week 2 Preview

Next Week We'll Cover:

Industry Selection

Strategies for choosing an industry

Narrowing your focus

Finding the sweet spot

Literature Search

Finding quality sources

Academic vs. industry literature

Organizing your research

Question Formulation

From broad problem to specific question

Making it answerable

Testing feasibility

Case Study

We'll work through another case study applying the 6-phase framework

Your Capstone Journey Starts Now

Make it meaningful

Make it impactful

Make it yours

Questions?

1 / 39