Week 6: Advanced Classification Methods

Naïve Bayes, Support Vector Machines, and Gradient Boosting

DATA4800: Artificial Intelligence and Machine Learning

Learning Objectives

By the end of this workshop, you will be able to:

Understand and apply Naïve Bayes classification using probability theory

Implement Support Vector Machines for complex classification problems

Utilize Gradient Boosting for high-accuracy predictions

Compare classification methods and select appropriate algorithms for business problems

Evaluate model performance using appropriate metrics

                    Key Focus: Understanding when and why to use different classification methods in real-world business scenarios
                

Classification Methods Overview

What We'll Cover

Probabilistic Classification (Naïve Bayes)

Margin-Based Classification (SVM)

Ensemble Boosting Methods

Comparative Analysis

Business Applications

Email spam detection

Customer segmentation

Employee attrition prediction

Medical diagnosis

Real-World Context: Each method solves different types of business problems. Understanding their strengths and weaknesses helps you choose the right tool for your specific challenge.

Naïve Bayes Classification

What is Naïve Bayes?

A probabilistic classifier that predicts outcomes based on the likelihood of features occurring together.

Business Analogy: Think of email spam filters. The algorithm learns from thousands of emails:

Emails with words like "FREE", "WINNER", "CLICK NOW" are usually spam

Emails from unknown senders with attachments are suspicious

When a new email arrives, it calculates the probability it's spam based on these learned patterns

                    Why "Naïve"? The algorithm assumes all features are independent of each other (which is rarely true in reality, but often works well in practice).
                

The Foundation: Conditional Probability

Naïve Bayes is built on a simple question: What is the probability of event A given that event B has occurred?

P(A|B) = P(B|A) × P(A) / P(B)

In Plain English

P(A|B): Probability of A given B happened

P(B|A): Probability of B given A happened

P(A): Overall probability of A

P(B): Overall probability of B

Email Spam Example

P(Spam|"FREE"): Probability email is spam given it contains "FREE"

P("FREE"|Spam): Probability spam contains "FREE"

P(Spam): Overall spam rate

P("FREE"): Overall frequency of "FREE"

How Naïve Bayes Makes Decisions

Customer Purchase Prediction Example: Will a customer buy based on their browsing behavior?

Step-by-Step Calculation

Business Problem: Email Classification

Dataset: 1,000 emails (700 legitimate, 300 spam)

Feature	Spam Emails (300)	Legitimate (700)
Contains "FREE"	240 (80%)	70 (10%)
Contains "Meeting"	30 (10%)	490 (70%)
Unknown Sender	270 (90%)	140 (20%)

New Email: Contains "FREE" and from Unknown Sender

Question Is this email spam or legitimate?

Calculating the Probabilities

Step 1 Calculate P(Spam): 300/1000 = 0.30 (30% of emails are spam)

                        Step 2 Calculate P(Features|Spam):
                        
• P("FREE"|Spam) = 240/300 = 0.80
                        
• P(Unknown|Spam) = 270/300 = 0.90
                        
• Combined: 0.80 × 0.90 = 0.72

                        Step 3 Calculate P(Features|Legitimate):
                        
• P("FREE"|Legitimate) = 70/700 = 0.10
                        
• P(Unknown|Legitimate) = 140/700 = 0.20
                        
• Combined: 0.10 × 0.20 = 0.02

                        Step 4 Final Calculation:
                        
• Spam Score: 0.72 × 0.30 = 0.216
                        
• Legitimate Score: 0.02 × 0.70 = 0.014
                        
                        Prediction: SPAM (0.216 > 0.014)

Knowledge Check: Naïve Bayes

Question: Why is the Naïve Bayes classifier called "naïve"?

A) It only works with small datasets

B) It requires extensive feature engineering

C) It assumes all features are independent of each other

D) It cannot handle categorical variables

Naïve Bayes: When to Use It

Strengths

Fast training and prediction

Works well with small datasets

Handles high-dimensional data effectively

Provides probability estimates

Simple to implement and interpret

Limitations

Assumes feature independence (rarely true)

Sensitive to irrelevant features

Requires sufficient data per class

Cannot learn feature interactions

Zero probability problem with unseen features

Best Use Cases: Text classification (spam detection, sentiment analysis), medical diagnosis with independent symptoms, real-time prediction systems where speed is critical

Support Vector Machines (SVM)

What is SVM?

A classification method that finds the optimal boundary (hyperplane) between classes with the maximum margin of separation.

Business Analogy: Imagine you're a retail analyst separating customers into "likely to buy" vs "unlikely to buy":

You want the clearest possible boundary between the two groups

The boundary should have the widest "safety margin" to minimize errors

The customers closest to the boundary (support vectors) define where this line should be drawn

                    Key Concept: SVM doesn't just find any boundary—it finds the boundary with the maximum margin, making it more robust to new data.
                

Finding the Optimal Boundary

Customer Segmentation: Spending vs Visit Frequency

                    Support Vectors: The data points closest to the decision boundary that actually determine where the boundary is placed. If we removed other points, the boundary wouldn't change, but removing support vectors would.
                

The Kernel Trick: Handling Complex Patterns

When Simple Boundaries Don't Work

Sometimes data cannot be separated by a straight line in its original form.

The Problem

Real-world data often has complex, non-linear patterns

A straight line cannot separate circular or curved patterns

Example: Customer clusters based on multiple behaviors

The Solution

Transform data into higher dimensions

Find linear separation in new space

Project decision boundary back to original space

Business Analogy: Imagine trying to separate customers using only "age" and "income". Adding a third dimension like "purchase history" might make the separation much clearer—the kernel trick does this mathematically without explicitly creating new features.

Kernel Transformation in Action

Watch how data that cannot be separated linearly becomes separable in higher dimensions:

                    Common Kernels: Linear (straight line), Polynomial (curves), RBF/Gaussian (complex patterns), Sigmoid (S-shaped boundaries)
                

Knowledge Check: Support Vector Machines

Question: What is the primary objective of Support Vector Machines?

A) Minimize the number of support vectors

B) Find the decision boundary with the maximum margin between classes

C) Classify data points as quickly as possible

D) Reduce the dimensionality of the feature space

SVM Business Applications

Real-World Use Cases

Medical Diagnosis: Classifying patients as high-risk or low-risk based on multiple health indicators where clear separation is crucial for treatment decisions.

Customer Segmentation: Identifying premium customers from regular customers using purchase patterns, demographics, and engagement metrics when boundaries are complex.

Fraud Detection: Separating legitimate from fraudulent transactions using transaction features where the cost of misclassification is high.

                    When to Choose SVM: Use when you need high accuracy with complex non-linear boundaries, have moderate-sized datasets, and can afford longer training times for better performance.
                

SVM: When to Use It

Strengths

Effective in high-dimensional spaces

Works well with clear margin of separation

Handles non-linear boundaries via kernels

Memory efficient (only uses support vectors)

Robust against overfitting in high dimensions

Limitations

Slow training time on large datasets

Requires careful parameter tuning

No probability estimates by default

Sensitive to feature scaling

Difficult to interpret with complex kernels

Best Use Cases: Binary classification with clear but complex boundaries, high-dimensional data (text, genomics), problems where accuracy is more important than speed

Gradient Boosting

What is Gradient Boosting?

An ensemble method that builds multiple weak learners sequentially, where each new model focuses on correcting the errors of previous models.

Business Analogy: Think of a team of analysts predicting employee attrition:

Analyst 1 makes initial predictions (70% accuracy)

Analyst 2 focuses only on the cases Analyst 1 got wrong

Analyst 3 corrects remaining errors from both previous analysts

Final prediction combines all analysts' insights, achieving 95% accuracy

                    Key Concept: Each new model is trained to predict the errors (residuals) of the combined previous models, gradually improving overall accuracy.
                

Sequential Error Correction

Building Models Step by Step

Boosting vs Bagging: Key Differences

Aspect	Random Forest (Bagging)	Gradient Boosting
Training	Parallel (independent trees)	Sequential (each tree learns from previous errors)
Focus	Reduce variance through averaging	Reduce bias by correcting errors
Speed	Fast (can parallelize)	Slower (sequential process)
Accuracy	Good	Typically higher
Overfitting Risk	Lower	Higher (needs careful tuning)
Interpretability	Moderate	Lower

Knowledge Check: Gradient Boosting

Question: How does Gradient Boosting differ from Random Forest in building models?

A) It uses more trees than Random Forest

B) It builds trees sequentially, with each tree correcting previous errors

C) It only works with numerical features

D) It trains all trees simultaneously in parallel

Key Parameters in Gradient Boosting

Number of Trees (n_estimators): More trees generally improve accuracy but increase training time and risk overfitting

Learning Rate: Controls how much each tree contributes to the final prediction. Lower rates need more trees but often perform better

Max Depth: Maximum depth of each tree. Deeper trees can capture complex patterns but risk overfitting

Subsample: Fraction of samples used for training each tree. Values < 1.0 introduce randomness and prevent overfitting

                    Practical Tip: Start with learning_rate=0.1, n_estimators=100, max_depth=3. Monitor validation performance and adjust gradually. Lower learning rates with more trees often yield best results.
                

Gradient Boosting Business Applications

Employee Attrition Prediction: Combining multiple factors (salary, performance, tenure, department) to predict which employees are likely to leave, achieving high accuracy for proactive retention strategies.

Credit Risk Assessment: Evaluating loan applicants using financial history, employment data, and behavioral patterns where prediction accuracy directly impacts financial risk.

Customer Lifetime Value: Predicting long-term customer value based on purchase patterns, engagement metrics, and demographics for targeted marketing strategies.

                    Industry Standard: Gradient Boosting (especially XGBoost and LightGBM implementations) frequently wins machine learning competitions and is widely used in industry for high-stakes predictions.
                

Gradient Boosting: When to Use It

Strengths

Often provides highest accuracy

Handles mixed data types well

Automatically captures feature interactions

Handles missing values effectively

Provides feature importance rankings

Robust to outliers

Limitations

Longer training time

Requires careful hyperparameter tuning

Risk of overfitting with too many trees

Less interpretable than single trees

Cannot be parallelized during training

Sensitive to noisy data

Best Use Cases: High-stakes predictions where accuracy is paramount, structured/tabular data, kaggle competitions, situations where you have time for proper hyperparameter tuning

Method Comparison Overview

Criteria	Naïve Bayes	SVM	Gradient Boosting
Training Speed	Very Fast	Slow	Moderate
Prediction Speed	Very Fast	Fast	Moderate
Typical Accuracy	Good	Very Good	Excellent
Interpretability	High	Low	Moderate
Handles Non-linearity	No	Yes (with kernels)	Yes
Dataset Size	Small to Medium	Small to Medium	Medium to Large

Choosing the Right Method

General Guidelines:Need fast results with limited data? → Naïve Bayes
Complex boundaries with moderate data? → SVM
Maximum accuracy on structured data? → Gradient Boosting
Need interpretability? → Naïve Bayes or Random Forest
Real-time predictions at scale? → Naïve Bayes or pre-trained SVM

Knowledge Check: Method Selection

Question: You need to build a real-time spam filter that processes millions of emails per day. Which method would be most appropriate?

A) Naïve Bayes - fast training and prediction with good accuracy for text

B) SVM - highest accuracy regardless of speed

C) Gradient Boosting - best overall performance

D) All methods would work equally well

Evaluating Classification Performance

Key Metrics for Model Evaluation

Accuracy: Overall correct predictions / total predictions

Precision: True positives / (True positives + False positives)

Recall: True positives / (True positives + False negatives)

F1-Score: Harmonic mean of precision and recall

When to use Accuracy: Balanced datasets

When to use Precision: False positives are costly

When to use Recall: False negatives are costly

When to use F1: Balance both concerns

Business Example: In fraud detection, missing actual fraud (false negative) is worse than flagging legitimate transactions (false positive). Therefore, prioritize Recall over Precision.

Understanding the Confusion Matrix

Metric	Formula	Example Value
Accuracy	(TP + TN) / Total	(85 + 10) / 120 = 79.2%
Precision	TP / (TP + FP)	85 / (85 + 15) = 85.0%
Recall	TP / (TP + FN)	85 / (85 + 10) = 89.5%
F1-Score	2 × (Precision × Recall) / (Precision + Recall)	87.2%

Best Practices for Implementation

Before Building Models

Understand your business problem and cost of errors

Examine class distribution (balanced vs imbalanced)

Perform exploratory data analysis

Handle missing values appropriately

Scale/normalize features (especially for SVM)

During Model Development

Always use train/test split or cross-validation

Start simple, then increase complexity

Monitor for overfitting vs underfitting

Compare multiple algorithms

Tune hyperparameters systematically

                    Remember: The best algorithm is the one that solves your specific business problem effectively, not necessarily the most complex one.
                

Knowledge Check: Comprehensive Review

Question: A healthcare company needs to predict patient readmission risk. They have a medium-sized dataset with complex feature interactions, and false negatives (missing high-risk patients) are very costly. Which method would be most appropriate?

A) Naïve Bayes - simple and fast

B) SVM with linear kernel - good for moderate datasets

C) Gradient Boosting - handles complex interactions, high accuracy, can optimize for recall

D) Random Forest - faster than Gradient Boosting

Case Study: Employee Attrition Prediction

Business Context

A company wants to predict which employees are likely to leave to implement proactive retention strategies.

Method	Accuracy	Precision	Recall	Training Time
Naïve Bayes	76%	71%	68%	2 seconds
SVM (RBF kernel)	84%	82%	79%	45 seconds
Gradient Boosting	89%	87%	86%	28 seconds

                    Decision: Gradient Boosting was selected because the 5% improvement in recall means catching 5% more at-risk employees, which justifies the extra 26 seconds of training time.
                

Today's Workshop Activities

Hands-On Practice

Activity 1: Naïve Bayes Email Classification

Build a spam filter using Naïve Bayes

Calculate probabilities manually

Implement in Orange Data Mining

Activity 2: SVM Customer Segmentation

Classify customer purchase likelihood

Experiment with different kernels

Visualize decision boundaries

Activity 3: Model Comparison Challenge

Apply all three methods to employee attrition dataset

Compare performance metrics

Recommend best approach with justification

Key Takeaways

                    Naïve Bayes
                    Use when: You need fast, simple classification with independent features (text classification, real-time systems)
                    
Remember: Based on probability, assumes feature independence
                

                    Support Vector Machines
                    Use when: You have complex non-linear boundaries and can afford training time
                    
Remember: Finds maximum margin boundary, uses kernel trick for complex patterns
                

                    Gradient Boosting
                    Use when: You need highest accuracy on structured data and have time for tuning
                    
Remember: Sequential error correction, requires careful parameter tuning
                

Next Steps: Practice with real datasets, experiment with different parameters, and always validate your results with appropriate metrics!