Week 6: Advanced Classification Methods

Naïve Bayes, Support Vector Machines, and Gradient Boosting
DATA4800: Artificial Intelligence and Machine Learning

Learning Objectives

By the end of this workshop, you will be able to:

Understand and apply Naïve Bayes classification using probability theory
Implement Support Vector Machines for complex classification problems
Utilize Gradient Boosting for high-accuracy predictions
Compare classification methods and select appropriate algorithms for business problems
Evaluate model performance using appropriate metrics
Key Focus: Understanding when and why to use different classification methods in real-world business scenarios

Classification Methods Overview

What We'll Cover

Probabilistic Classification (Naïve Bayes)
Margin-Based Classification (SVM)
Ensemble Boosting Methods
Comparative Analysis

Business Applications

Email spam detection
Customer segmentation
Employee attrition prediction
Medical diagnosis
Real-World Context: Each method solves different types of business problems. Understanding their strengths and weaknesses helps you choose the right tool for your specific challenge.

Naïve Bayes Classification

What is Naïve Bayes?

A probabilistic classifier that predicts outcomes based on the likelihood of features occurring together.

Business Analogy: Think of email spam filters. The algorithm learns from thousands of emails:
Emails with words like "FREE", "WINNER", "CLICK NOW" are usually spam
Emails from unknown senders with attachments are suspicious
When a new email arrives, it calculates the probability it's spam based on these learned patterns
Why "Naïve"? The algorithm assumes all features are independent of each other (which is rarely true in reality, but often works well in practice).

The Foundation: Conditional Probability

Naïve Bayes is built on a simple question: What is the probability of event A given that event B has occurred?

P(A|B) = P(B|A) × P(A) / P(B)

In Plain English

P(A|B): Probability of A given B happened
P(B|A): Probability of B given A happened
P(A): Overall probability of A
P(B): Overall probability of B

Email Spam Example

P(Spam|"FREE"): Probability email is spam given it contains "FREE"
P("FREE"|Spam): Probability spam contains "FREE"
P(Spam): Overall spam rate
P("FREE"): Overall frequency of "FREE"

How Naïve Bayes Makes Decisions

Customer Purchase Prediction Example: Will a customer buy based on their browsing behavior?

Step-by-Step Calculation

Business Problem: Email Classification

Dataset: 1,000 emails (700 legitimate, 300 spam)

Feature Spam Emails (300) Legitimate (700)
Contains "FREE" 240 (80%) 70 (10%)
Contains "Meeting" 30 (10%) 490 (70%)
Unknown Sender 270 (90%) 140 (20%)
New Email: Contains "FREE" and from Unknown Sender

Question Is this email spam or legitimate?

Calculating the Probabilities

Step 1 Calculate P(Spam): 300/1000 = 0.30 (30% of emails are spam)
Step 2 Calculate P(Features|Spam):
• P("FREE"|Spam) = 240/300 = 0.80
• P(Unknown|Spam) = 270/300 = 0.90
• Combined: 0.80 × 0.90 = 0.72
Step 3 Calculate P(Features|Legitimate):
• P("FREE"|Legitimate) = 70/700 = 0.10
• P(Unknown|Legitimate) = 140/700 = 0.20
• Combined: 0.10 × 0.20 = 0.02
Step 4 Final Calculation:
• Spam Score: 0.72 × 0.30 = 0.216
• Legitimate Score: 0.02 × 0.70 = 0.014

Prediction: SPAM (0.216 > 0.014)

Knowledge Check: Naïve Bayes

Question: Why is the Naïve Bayes classifier called "naïve"?

A) It only works with small datasets
B) It requires extensive feature engineering
C) It assumes all features are independent of each other
D) It cannot handle categorical variables

Naïve Bayes: When to Use It

Strengths

Fast training and prediction
Works well with small datasets
Handles high-dimensional data effectively
Provides probability estimates
Simple to implement and interpret

Limitations

Assumes feature independence (rarely true)
Sensitive to irrelevant features
Requires sufficient data per class
Cannot learn feature interactions
Zero probability problem with unseen features
Best Use Cases: Text classification (spam detection, sentiment analysis), medical diagnosis with independent symptoms, real-time prediction systems where speed is critical

Support Vector Machines (SVM)

What is SVM?

A classification method that finds the optimal boundary (hyperplane) between classes with the maximum margin of separation.

Business Analogy: Imagine you're a retail analyst separating customers into "likely to buy" vs "unlikely to buy":
You want the clearest possible boundary between the two groups
The boundary should have the widest "safety margin" to minimize errors
The customers closest to the boundary (support vectors) define where this line should be drawn
Key Concept: SVM doesn't just find any boundary—it finds the boundary with the maximum margin, making it more robust to new data.

Finding the Optimal Boundary

Customer Segmentation: Spending vs Visit Frequency

Support Vectors: The data points closest to the decision boundary that actually determine where the boundary is placed. If we removed other points, the boundary wouldn't change, but removing support vectors would.

The Kernel Trick: Handling Complex Patterns

When Simple Boundaries Don't Work

Sometimes data cannot be separated by a straight line in its original form.

The Problem

Real-world data often has complex, non-linear patterns
A straight line cannot separate circular or curved patterns
Example: Customer clusters based on multiple behaviors

The Solution

Transform data into higher dimensions
Find linear separation in new space
Project decision boundary back to original space
Business Analogy: Imagine trying to separate customers using only "age" and "income". Adding a third dimension like "purchase history" might make the separation much clearer—the kernel trick does this mathematically without explicitly creating new features.

Kernel Transformation in Action

Watch how data that cannot be separated linearly becomes separable in higher dimensions:

Common Kernels: Linear (straight line), Polynomial (curves), RBF/Gaussian (complex patterns), Sigmoid (S-shaped boundaries)

Knowledge Check: Support Vector Machines

Question: What is the primary objective of Support Vector Machines?

A) Minimize the number of support vectors
B) Find the decision boundary with the maximum margin between classes
C) Classify data points as quickly as possible
D) Reduce the dimensionality of the feature space

SVM Business Applications

Real-World Use Cases

Medical Diagnosis: Classifying patients as high-risk or low-risk based on multiple health indicators where clear separation is crucial for treatment decisions.
Customer Segmentation: Identifying premium customers from regular customers using purchase patterns, demographics, and engagement metrics when boundaries are complex.
Fraud Detection: Separating legitimate from fraudulent transactions using transaction features where the cost of misclassification is high.
When to Choose SVM: Use when you need high accuracy with complex non-linear boundaries, have moderate-sized datasets, and can afford longer training times for better performance.

SVM: When to Use It

Strengths

Effective in high-dimensional spaces
Works well with clear margin of separation
Handles non-linear boundaries via kernels
Memory efficient (only uses support vectors)
Robust against overfitting in high dimensions

Limitations

Slow training time on large datasets
Requires careful parameter tuning
No probability estimates by default
Sensitive to feature scaling
Difficult to interpret with complex kernels
Best Use Cases: Binary classification with clear but complex boundaries, high-dimensional data (text, genomics), problems where accuracy is more important than speed

Gradient Boosting

What is Gradient Boosting?

An ensemble method that builds multiple weak learners sequentially, where each new model focuses on correcting the errors of previous models.

Business Analogy: Think of a team of analysts predicting employee attrition:
Analyst 1 makes initial predictions (70% accuracy)
Analyst 2 focuses only on the cases Analyst 1 got wrong
Analyst 3 corrects remaining errors from both previous analysts
Final prediction combines all analysts' insights, achieving 95% accuracy
Key Concept: Each new model is trained to predict the errors (residuals) of the combined previous models, gradually improving overall accuracy.

Sequential Error Correction

Building Models Step by Step

Boosting vs Bagging: Key Differences

Aspect Random Forest (Bagging) Gradient Boosting
Training Parallel (independent trees) Sequential (each tree learns from previous errors)
Focus Reduce variance through averaging Reduce bias by correcting errors
Speed Fast (can parallelize) Slower (sequential process)
Accuracy Good Typically higher
Overfitting Risk Lower Higher (needs careful tuning)
Interpretability Moderate Lower

Knowledge Check: Gradient Boosting

Question: How does Gradient Boosting differ from Random Forest in building models?

A) It uses more trees than Random Forest
B) It builds trees sequentially, with each tree correcting previous errors
C) It only works with numerical features
D) It trains all trees simultaneously in parallel

Key Parameters in Gradient Boosting

Number of Trees (n_estimators): More trees generally improve accuracy but increase training time and risk overfitting
Learning Rate: Controls how much each tree contributes to the final prediction. Lower rates need more trees but often perform better
Max Depth: Maximum depth of each tree. Deeper trees can capture complex patterns but risk overfitting
Subsample: Fraction of samples used for training each tree. Values < 1.0 introduce randomness and prevent overfitting
Practical Tip: Start with learning_rate=0.1, n_estimators=100, max_depth=3. Monitor validation performance and adjust gradually. Lower learning rates with more trees often yield best results.

Gradient Boosting Business Applications

Employee Attrition Prediction: Combining multiple factors (salary, performance, tenure, department) to predict which employees are likely to leave, achieving high accuracy for proactive retention strategies.
Credit Risk Assessment: Evaluating loan applicants using financial history, employment data, and behavioral patterns where prediction accuracy directly impacts financial risk.
Customer Lifetime Value: Predicting long-term customer value based on purchase patterns, engagement metrics, and demographics for targeted marketing strategies.
Industry Standard: Gradient Boosting (especially XGBoost and LightGBM implementations) frequently wins machine learning competitions and is widely used in industry for high-stakes predictions.

Gradient Boosting: When to Use It

Strengths

Often provides highest accuracy
Handles mixed data types well
Automatically captures feature interactions
Handles missing values effectively
Provides feature importance rankings
Robust to outliers

Limitations

Longer training time
Requires careful hyperparameter tuning
Risk of overfitting with too many trees
Less interpretable than single trees
Cannot be parallelized during training
Sensitive to noisy data
Best Use Cases: High-stakes predictions where accuracy is paramount, structured/tabular data, kaggle competitions, situations where you have time for proper hyperparameter tuning

Method Comparison Overview

Criteria Naïve Bayes SVM Gradient Boosting
Training Speed Very Fast Slow Moderate
Prediction Speed Very Fast Fast Moderate
Typical Accuracy Good Very Good Excellent
Interpretability High Low Moderate
Handles Non-linearity No Yes (with kernels) Yes
Dataset Size Small to Medium Small to Medium Medium to Large

Choosing the Right Method

General Guidelines:
Need fast results with limited data? → Naïve Bayes
Complex boundaries with moderate data? → SVM
Maximum accuracy on structured data? → Gradient Boosting
Need interpretability? → Naïve Bayes or Random Forest
Real-time predictions at scale? → Naïve Bayes or pre-trained SVM

Knowledge Check: Method Selection

Question: You need to build a real-time spam filter that processes millions of emails per day. Which method would be most appropriate?

A) Naïve Bayes - fast training and prediction with good accuracy for text
B) SVM - highest accuracy regardless of speed
C) Gradient Boosting - best overall performance
D) All methods would work equally well

Evaluating Classification Performance

Key Metrics for Model Evaluation

Accuracy: Overall correct predictions / total predictions
Precision: True positives / (True positives + False positives)
Recall: True positives / (True positives + False negatives)
F1-Score: Harmonic mean of precision and recall
When to use Accuracy: Balanced datasets
When to use Precision: False positives are costly
When to use Recall: False negatives are costly
When to use F1: Balance both concerns
Business Example: In fraud detection, missing actual fraud (false negative) is worse than flagging legitimate transactions (false positive). Therefore, prioritize Recall over Precision.

Understanding the Confusion Matrix

Metric Formula Example Value
Accuracy (TP + TN) / Total (85 + 10) / 120 = 79.2%
Precision TP / (TP + FP) 85 / (85 + 15) = 85.0%
Recall TP / (TP + FN) 85 / (85 + 10) = 89.5%
F1-Score 2 × (Precision × Recall) / (Precision + Recall) 87.2%

Best Practices for Implementation

Before Building Models

Understand your business problem and cost of errors
Examine class distribution (balanced vs imbalanced)
Perform exploratory data analysis
Handle missing values appropriately
Scale/normalize features (especially for SVM)

During Model Development

Always use train/test split or cross-validation
Start simple, then increase complexity
Monitor for overfitting vs underfitting
Compare multiple algorithms
Tune hyperparameters systematically
Remember: The best algorithm is the one that solves your specific business problem effectively, not necessarily the most complex one.

Knowledge Check: Comprehensive Review

Question: A healthcare company needs to predict patient readmission risk. They have a medium-sized dataset with complex feature interactions, and false negatives (missing high-risk patients) are very costly. Which method would be most appropriate?

A) Naïve Bayes - simple and fast
B) SVM with linear kernel - good for moderate datasets
C) Gradient Boosting - handles complex interactions, high accuracy, can optimize for recall
D) Random Forest - faster than Gradient Boosting

Case Study: Employee Attrition Prediction

Business Context

A company wants to predict which employees are likely to leave to implement proactive retention strategies.

Method Accuracy Precision Recall Training Time
Naïve Bayes 76% 71% 68% 2 seconds
SVM (RBF kernel) 84% 82% 79% 45 seconds
Gradient Boosting 89% 87% 86% 28 seconds
Decision: Gradient Boosting was selected because the 5% improvement in recall means catching 5% more at-risk employees, which justifies the extra 26 seconds of training time.

Today's Workshop Activities

Hands-On Practice

Activity 1: Naïve Bayes Email Classification
Build a spam filter using Naïve Bayes
Calculate probabilities manually
Implement in Orange Data Mining
Activity 2: SVM Customer Segmentation
Classify customer purchase likelihood
Experiment with different kernels
Visualize decision boundaries
Activity 3: Model Comparison Challenge
Apply all three methods to employee attrition dataset
Compare performance metrics
Recommend best approach with justification

Key Takeaways

Naïve Bayes

Use when: You need fast, simple classification with independent features (text classification, real-time systems)
Remember: Based on probability, assumes feature independence

Support Vector Machines

Use when: You have complex non-linear boundaries and can afford training time
Remember: Finds maximum margin boundary, uses kernel trick for complex patterns

Gradient Boosting

Use when: You need highest accuracy on structured data and have time for tuning
Remember: Sequential error correction, requires careful parameter tuning
Next Steps: Practice with real datasets, experiment with different parameters, and always validate your results with appropriate metrics!
Slide 1 of 34