Decision Trees & Random Forests

Supervised Learning for Classification and Regression

Quick Poll: Have you made a decision tree today?

Yes, consciously
Yes, unconsciously
No
Not sure

Understand

How decision trees make predictions through hierarchical rule-based decisions

Build

Random forests to improve prediction accuracy and reduce overfitting

Apply

Tree-based methods to real-world classification and regression problems

Real-World Decision Making

How do you decide which movie to watch?

Netflix Recommendations

Personalized content filtering

Uses viewing history, ratings, and similar user behavior to predict preferences through sophisticated decision tree ensembles.

Loan Approvals

Financial risk assessment

Banks use income, credit score, employment history, and debt-to-income ratio to determine creditworthiness automatically.

Medical Diagnosis

Clinical decision support

Symptoms, test results, and patient history are processed through diagnostic trees to suggest potential conditions and treatments.

Key Insight: All these systems use the same fundamental approach - asking a series of structured questions to reach a decision, just like human reasoning but at scale.

The Classification Challenge

Can we teach a computer to classify customer responses?

Retail Campaign Dataset Sample

Age Income (K) Email Open Rate Loyalty Member Campaign Response
34 65.2 0.45 Yes Responded
22 28.7 0.12 No No Response
45 89.3 0.67 Yes Responded

Which features do you think matter most for predicting campaign response?

Age
Income
Email Open Rate
Loyalty Membership

Today's Journey

From Simple Trees to Powerful Forests

Decision
Trees
Information
Theory
Random
Forests
Hands-on
Practice

Prediction: Which will be more accurate?

Single Decision Tree
Random Forest (Multiple Trees)

Timeline: 90 minutes to master tree-based machine learning

What is a Decision Tree?

A series of yes/no questions leading to an answer

Should I bring an umbrella?
Is it cloudy?
Yes → Bring it
No → Leave it
Check forecast?
Rain → Bring it
Sun → Leave it

Key Insight: Decision trees mirror human decision-making processes. We naturally think in terms of sequential questions and logical rules.

In machine learning, we formalize this intuitive process using mathematical criteria to determine the optimal sequence of questions.

Tree Anatomy

Understanding the structure

Income ≥ $50K?

Yes

Age ≥ 30?

Yes

Responded

No

No Response

No

Loyalty Member?

Yes

Responded

No

No Response

Tree Components

  • Root Node: First decision point (Income ≥ $50K?)
  • Internal Nodes: Intermediate decisions (Age ≥ 30?, Loyalty Member?)
  • Leaf Nodes: Final predictions (Responded, No Response)
  • Branches: Possible outcomes of each decision

Sample Data Leading to Tree

IncomeAgeLoyaltyResponse
$75K35YesYes
$30K25YesYes
$45K22NoNo

Building Your First Tree (Demo)

Step-by-step tree construction

Bank Loan Dataset

Income Credit Score Age Loan Approved
$80K75035Yes
$35K62028No
$65K71042Yes
$25K58025No
$90K78038Yes
$40K65031No

Step 1: Choose Best Split

Calculate information gain for each feature

Step 2: Split Dataset

Divide data based on chosen feature

Step 3: Repeat Process

Apply same logic to each subset

Step 4: Stop Condition

When nodes are pure or depth limit reached

Resulting Decision Tree:

Best first split: Credit Score ≥ 700

Left branch: Income ≥ $60K → Final decision

Right branch: All approved (pure node)

Information Gain Concept

How do we choose the best questions?

Before Split: Mixed Results

Customer TypeCountResponse Rate
All Customers10050% (Mixed)

High Uncertainty: Equal mix of responses

After Split: Reduced "Messiness"

Customer TypeCountResponse Rate
High Income4080% (Clear)
Low Income6030% (Clear)

Low Uncertainty: Clear patterns in each group

Information Gain = Entropy(Parent) - Weighted Average(Entropy(Children))

Goal: Maximize information gain by choosing splits that create the most homogeneous subgroups

Intuition: Good splits transform confusion into clarity. We want each resulting group to be as "pure" as possible in terms of the target outcome.

Entropy Explained Simply

Entropy = measure of disorder/uncertainty

High Entropy (Mixed)

Entropy ≈ 1.58

Highly unpredictable

Low Entropy (Sorted)

Entropy = 0

Completely predictable

Real Example: Customer Campaign Response

All Customers

50 Responded, 50 No Response

Entropy = 1.0

High Income Group

80 Responded, 20 No Response

Entropy = 0.72

Loyalty Members

95 Responded, 5 No Response

Entropy = 0.29

Entropy = -Σ p(i) × log₂(p(i))

where p(i) is the proportion of samples belonging to class i

Gini Index Alternative

Another way to measure purity

What is Purity in Decision Trees?

Purity means how "unmixed" or "homogeneous" a group is. Think of it like:

  • Pure milk: Only milk, no water mixed in
  • Pure customer group: All customers behave the same way (all respond or all don't respond)
  • Impure mixture: 50% responders, 50% non-responders = very mixed/impure
  • High purity: 95% responders, 5% non-responders = almost pure

Goal: Split data to create the purest possible groups - where customers in each group behave as similarly as possible.

Entropy

  • Range: 0 to log₂(classes)
  • Based on information theory
  • More computationally expensive
  • Emphasizes balanced splits
  • Formula: -Σ p(i) × log₂(p(i))

Gini Index

  • Range: 0 to 0.5 (binary classification)
  • Based on probability theory
  • Faster to compute
  • Emphasizes largest class
  • Formula: 1 - Σ p(i)²

Interactive Calculation: 2-Class Example

Dataset: 60 customers responded, 40 did not respond

Calculate Gini Index:

p(Responded) = 60/100 = 0.6

p(No Response) = 40/100 = 0.4

Gini = 1 - (0.6² + 0.4²) = 1 - (0.36 + 0.16) = 0.48

Complete Solution:

Gini Index = 1 - (0.6² + 0.4²) = 1 - (0.36 + 0.16) = 1 - 0.52 = 0.48

Interpretation: Gini = 0.48 indicates moderate impurity. Lower values (closer to 0) represent more homogeneous groups, while higher values indicate more mixed groups.

Hands-On Activity 1

Calculate the best split for campaign response prediction

Customer Dataset (12 records)

Age Income Email Rate Response
2535K0.2No
3565K0.6Yes
4585K0.8Yes
2840K0.3No
5295K0.7Yes
3155K0.4Yes
4275K0.9Yes
2945K0.1No
3870K0.5Yes
3350K0.3No
4790K0.8Yes
2638K0.2No

Key Formulas

Entropy = -Σ p(i) × log₂(p(i))
Information Gain = Entropy(Parent) - Weighted Average(Entropy(Children))

where p(i) is the proportion of samples in class i

Your Task: Find Best Split

1. Current entropy of all data:

7 Yes, 5 No → Entropy =

2. Test split: Email Rate ≥ 0.5

Left branch (≥0.5): Yes, No

Right branch (<0.5): Yes, No

3. Calculate entropies for each branch:

Entropy(Left) =

Entropy(Right) =

4. Calculate weighted average entropy:

Weighted Avg = (/12) × + (/12) ×

=

5. Calculate information gain:

Information Gain = - =

Complete Solution:

1. Initial entropy: H(S) = -7/12×log₂(7/12) - 5/12×log₂(5/12) = 0.98

2. Email Rate ≥ 0.5: 6 Yes, 0 No | Email Rate < 0.5: 1 Yes, 5 No

3. Entropy(Left) = 0, Entropy(Right) = 0.65

4. Weighted Avg = (6/12)×0 + (6/12)×0.65 = 0.325

5. Information Gain = 0.98 - 0.325 = 0.655

Conclusion: This is an excellent split with high information gain!

Decision Tree Advantages & Limitations

Understanding when to use single trees

Advantages

  • Interpretability: Easy to understand and explain to stakeholders
  • No preprocessing: Handles both numerical and categorical data
  • Feature selection: Automatically identifies important variables
  • Non-linear relationships: Captures complex interactions
  • Fast training: Efficient algorithm for large datasets
  • Rule extraction: Can extract explicit if-then rules

Limitations

  • Overfitting: Can create overly complex trees
  • Instability: Small data changes cause large tree changes
  • Bias: Favors features with more levels
  • Linear relationships: Poor at capturing linear patterns
  • Greedy algorithm: May not find globally optimal tree
  • Predictive accuracy: Often outperformed by ensemble methods

Key Insight: Single decision trees are excellent for interpretation and quick insights, but their limitations motivate ensemble methods like Random Forests for improved predictive performance.

Best Use Cases

Exploratory analysis, rule generation, and situations requiring interpretability

Alternative Solutions

Random Forests for accuracy, pruning for overfitting, gradient boosting for performance

The Wisdom of Crowds

Why ask one expert when you can ask many?

Guess the Number of Jellybeans in the Jar

⋮ ⋮ ⋮

Glass Jar with Colorful Jellybeans

150-200
250-300
350-400
450-500

Individual Expert Prediction

  • Single point of view
  • Personal biases affect judgment
  • Can be significantly wrong
  • High variance in predictions
  • Example: One person guesses 275 jellybeans

Crowd Average Prediction

  • Multiple diverse perspectives
  • Individual biases cancel out
  • Consistently more accurate
  • Lower variance, higher reliability
  • Example: 100 people average to 287 jellybeans

Francis Galton's Discovery (1907): The average guess of 787 people at a county fair was 1,207 pounds for an ox weight. The actual weight was 1,198 pounds - remarkably close!

This principle forms the foundation of ensemble methods in machine learning.

Random Forest Overview

Multiple trees → majority vote → better predictions

Tree 1

Prediction: Responded

Based on Income & Age

Tree 2

Prediction: No Response

Based on Email Rate & Loyalty

Tree 3

Prediction: Responded

Based on Age & Transaction History

Majority Vote Process

2

Responded

vs

1

No Response

Final Prediction: Responded

Sample Customer: Jane, Age 35, Income $65K, Email Rate 0.6, Loyalty Member

TreeFeatures UsedPredictionConfidence
1Income, AgeResponded0.8
2Email Rate, LoyaltyNo Response0.6
3Age, Transaction HistoryResponded0.9

Bootstrapping Explained

Creating different training sets from same data

Original Dataset (N=12 customers)

IDAgeIncomeResponse
C12535KNo
C23565KYes
C34585KYes
C42840KNo
C55295KYes
C63155KYes
C74275KYes
C82945KNo
C93870KYes
C103350KNo
C114790KYes
C122638KNo

Bootstrap Samples (with replacement)

Bootstrap Sample 1

Selected: C1, C3, C3, C5, C7, C2, C9, C11, C4, C6, C8, C1

Note: C3 and C1 appear twice

Bootstrap Sample 2

Selected: C2, C4, C6, C8, C10, C12, C5, C7, C9, C2, C11, C6

Note: C2 and C6 appear twice

Bootstrap Sample 3

Selected: C5, C1, C9, C11, C3, C7, C4, C8, C10, C12, C5, C9

Note: C5 and C9 appear twice

Key Principle: Each bootstrap sample is the same size as the original dataset but contains different combinations of records. This creates diverse training sets for each tree.

Physical Simulation Process:
  1. Place 12 customer cards in a bag
  2. Draw one card, record it, put it back
  3. Repeat 12 times for one bootstrap sample
  4. Create new bag for next bootstrap sample

Feature Randomness

Each tree considers only some features

Available Features in Dataset

  • 1. Age
  • 2. Income
  • 3. Email Open Rate
  • 4. Loyalty Member
  • 5. Previous Purchases
  • 6. Days Since Last Purchase
  • 7. Average Transaction Value
  • 8. Preferred Channel

Total Features: 8
Typical Selection: √8 ≈ 3 features per split

Feature Subsets for Different Trees

Tree 1 Features

Age, Income, Email Rate

Tree 2 Features

Loyalty, Previous Purchases, Channel

Tree 3 Features

Income, Days Since Purchase, Transaction Value

Without Feature Randomness

  • Strong features dominate all trees
  • Trees become very similar
  • Limited diversity in predictions
  • Higher correlation between trees
  • Reduced ensemble benefit

With Feature Randomness

  • Each tree focuses on different aspects
  • Diverse decision-making strategies
  • Captures various feature interactions
  • Lower correlation between trees
  • Improved ensemble performance

Benefit: Feature randomness prevents any single strong predictor from dominating the forest, ensuring each tree contributes unique insights to the final prediction.

Random Forest Algorithm

4 simple steps with clear process

1. Bootstrap Sampling

Create B bootstrap samples from original dataset with replacement

Example: For 1000 customers, create 100 different samples of 1000 customers each

2. Random Features

For each split, randomly select √p features from p total features

Example: From 8 features, randomly choose 3 for each decision node

3. Build Trees

Train decision trees on bootstrap samples using random feature subsets

Example: Grow 100 trees, each trained on different data and features

4. Aggregate Predictions

Combine predictions through majority vote (classification) or averaging (regression)

Example: If 70 trees predict "Responded", final prediction is "Responded"

Implementation Example

Original Data: 1000 customers, 8 features

Bootstrap Samples: 100 samples of 1000 customers each

Feature Selection: √8 ≈ 3 features per split

Trees: 100 decision trees trained independently

Prediction: Majority vote from all 100 trees

Sample 1
Features: Age, Income, Email

Sample 2
Features: Loyalty, Purchases, Channel

96 more trees...

Key Insight: The algorithm's simplicity belies its power. Random sampling in both data and features creates diversity, while aggregation harnesses the collective intelligence of the forest.

Model Evaluation

Understanding performance through confusion matrix and metrics

Confusion Matrix

Predicted: No
Predicted: Yes
Actual: No
85
True Negative
15
False Positive
Actual: Yes
10
False Negative
90
True Positive
Sample Dataset Results (200 customers)

True Negatives: 85 correctly predicted non-responders

True Positives: 90 correctly predicted responders

False Positives: 15 incorrectly predicted responders

False Negatives: 10 missed responders

Performance Metrics

Accuracy

Formula: (TP + TN) / Total

Result: (90 + 85) / 200 = 87.5%

Overall correctness of predictions

Precision

Formula: TP / (TP + FP)

Result: 90 / (90 + 15) = 85.7%

Of predicted responders, how many actually responded?

Recall (Sensitivity)

Formula: TP / (TP + FN)

Result: 90 / (90 + 10) = 90.0%

Of actual responders, how many did we find?

F1-Score

Formula: 2 × (Precision × Recall) / (Precision + Recall)

Result: 2 × (0.857 × 0.90) / (0.857 + 0.90) = 87.8%

Harmonic mean of precision and recall

Interpretation: This model shows strong performance with high accuracy and balanced precision/recall. The 87.5% accuracy means we correctly identify customer response 7 out of 8 times.

Random Forest vs Single Tree

Accuracy, interpretability, and speed trade-offs

Single Decision Tree

  • Accuracy: Moderate (70-80% typical)
  • Interpretability: Excellent - clear rules
  • Training Speed: Very fast
  • Prediction Speed: Very fast
  • Overfitting: High risk
  • Stability: Low - sensitive to data changes
  • Feature Importance: Limited perspective
  • Memory Usage: Very low

Random Forest

  • Accuracy: High (85-95% typical)
  • Interpretability: Moderate - feature importance
  • Training Speed: Slower (parallel processing helps)
  • Prediction Speed: Moderate
  • Overfitting: Low risk - built-in regularization
  • Stability: High - robust to data variations
  • Feature Importance: Comprehensive ranking
  • Memory Usage: Higher (stores multiple trees)

Scenario Decision Guide

Categorize these scenarios for Single Tree vs Random Forest:

Scenario A: Medical Diagnosis

Doctor needs to understand exactly why the system made a diagnosis recommendation.

Single Tree
Random Forest
Scenario B: Credit Card Fraud

Bank needs highest possible accuracy to minimize false positives and false negatives.

Single Tree
Random Forest
Scenario C: Mobile App

Real-time recommendations with millisecond response requirements.

Single Tree
Random Forest
Scenario D: Marketing Campaign

Optimize customer targeting with large dataset and complex patterns.

Single Tree
Random Forest

Evaluation Quiz

Test your understanding: Calculate model performance

Scenario: Email Marketing Campaign Prediction

Your Random Forest model predicted customer responses for 500 customers. Here are the results:

Predicted: No Response
Predicted: Responded
Actual: No Response
220
30
Actual: Responded
40
210

Total Customers: 500

True Negatives (TN): 220

False Positives (FP): 30

False Negatives (FN): 40

True Positives (TP): 210

Calculate Performance Metrics

1. Accuracy

Formula: (TP + TN) / Total

Calculation: ( + ) /

Result: or %

2. Precision

Formula: TP / (TP + FP)

Calculation: / ( + )

Result: or %

3. Recall (Sensitivity)

Formula: TP / (TP + FN)

Calculation: / ( + )

Result: or %

4. F1-Score

Formula: 2 × (Precision × Recall) / (Precision + Recall)

Using your calculated Precision and Recall values:

F1-Score: or %

Business Interpretation

After calculating the metrics above, write a brief interpretation:

Summary and Next Steps

From trees to forests: A journey in ensemble learning

What We Learned

Decision trees make predictions through hierarchical splits, Random forests combine multiple trees for better accuracy, Ensemble methods leverage the wisdom of crowds

Key Takeaways

Single trees are interpretable but prone to overfitting, Random forests trade interpretability for robustness, Bootstrap sampling and feature randomness create diversity

Practical Applications

Customer segmentation and campaign optimization, Risk assessment and fraud detection, Recommendation systems and personalization

Decision
Trees ✓
Random
Forests ✓
Next:
Boosting
Advanced:
Deep Learning

Immediate Next Steps

  • Practice with the retail campaign dataset
  • Implement Random Forest in Python/R
  • Compare performance with single trees
  • Analyze feature importance rankings
  • Experiment with hyperparameter tuning

Advanced Topics to Explore

  • Gradient Boosting (XGBoost, LightGBM)
  • Out-of-bag error estimation
  • Handling imbalanced datasets
  • Feature engineering for tree-based models
  • Model interpretability techniques (SHAP, LIME)

Remember: Machine learning is about finding patterns in data to make predictions. Tree-based methods excel at capturing non-linear relationships while remaining relatively interpretable - making them invaluable tools in the data scientist's toolkit.

The Problem of Overfitting

When your model memorizes instead of learning

Training Data: Customer Campaign Response

CustomerAgeIncomeEmail RateResponse
A2535K0.2No
B3565K0.6Yes
C4585K0.8Yes
D2840K0.3No
E5295K0.7Yes
F3155K0.4Yes

Training Accuracy: 100% ✓

Overfitted Decision Tree

Age ≥ 26?

Yes

Income ≥ 36K?

Yes

Email ≥ 0.35?

Yes

Yes

No

No

No

No

No

No

New Customer Test Data

CustomerAgeIncomeEmail RateActualPredicted
G3060K0.5YesNo
H4070K0.6YesYes
I3345K0.2NoYes
J2750K0.3NoNo

Test Accuracy: 50% ✗

Signs of Overfitting

  • Perfect training accuracy: 100% correct on training data
  • Poor test performance: Low accuracy on new data
  • Complex tree structure: Too many splits and branches
  • Memorization: Learns specific examples, not patterns
  • High variance: Small data changes cause big model changes

Why Overfitting Happens

  • Too much complexity: Tree grows too deep with too many rules
  • Limited training data: Not enough examples to learn true patterns
  • Noise in data: Model learns from random variations
  • No stopping criteria: Tree keeps splitting until pure leaves
  • Feature abundance: Too many irrelevant features confuse the model

Business Impact: An overfitted marketing model might achieve 100% accuracy on historical campaigns but fail completely on new customers, leading to wasted marketing budget and poor targeting decisions.

The Problem of Underfitting

When your model is too simple to capture patterns

Training Data: Customer Campaign Response

CustomerAgeIncomeEmail RateLoyaltyResponse
A2535K0.2NoNo
B3565K0.6YesYes
C4585K0.8YesYes
D2840K0.3NoNo
E5295K0.7YesYes
F3155K0.4YesYes
G2945K0.1NoNo
H3870K0.5YesYes

Underfitted Decision Tree (Too Simple)

Income ≥ $50K?

Yes

Responded

No

No Response

Only considers income, ignores other important features

Training Performance

Accuracy: 75% (6/8 correct)

Misses important patterns in the data

Test Performance

Accuracy: 73% (similar to training)

Consistently mediocre performance

Signs of Underfitting

  • Poor training accuracy: Low performance on training data
  • Poor test performance: Similar low accuracy on new data
  • Overly simple model: Too few splits or rules
  • High bias: Makes systematic errors consistently
  • Ignores important features: Misses relevant information

Why Underfitting Happens

  • Too simple model: Tree depth limited too severely
  • Insufficient features: Important variables not included
  • Poor feature selection: Wrong variables chosen for splits
  • Early stopping: Tree building stopped too soon
  • Over-regularization: Too many constraints on model complexity

Fixing Overfitting

Solutions: Pruning, cross-validation, ensemble methods, more training data

Sweet Spot

Goal: Balance complexity and generalization for optimal performance

Fixing Underfitting

Solutions: Deeper trees, more features, feature engineering, reduce regularization

The Goldilocks Principle: Your model should be complex enough to capture important patterns but simple enough to generalize to new data. Random Forests help achieve this balance through ensemble averaging.