How decision trees make predictions through hierarchical rule-based decisions
Random forests to improve prediction accuracy and reduce overfitting
Tree-based methods to real-world classification and regression problems
Personalized content filtering
Financial risk assessment
Clinical decision support
Key Insight: All these systems use the same fundamental approach - asking a series of structured questions to reach a decision, just like human reasoning but at scale.
| Age | Income (K) | Email Open Rate | Loyalty Member | Campaign Response |
|---|---|---|---|---|
| 34 | 65.2 | 0.45 | Yes | Responded |
| 22 | 28.7 | 0.12 | No | No Response |
| 45 | 89.3 | 0.67 | Yes | Responded |
Timeline: 90 minutes to master tree-based machine learning
Key Insight: Decision trees mirror human decision-making processes. We naturally think in terms of sequential questions and logical rules.
In machine learning, we formalize this intuitive process using mathematical criteria to determine the optimal sequence of questions.
Yes
Yes
No
No
Yes
No
| Income | Age | Loyalty | Response |
|---|---|---|---|
| $75K | 35 | Yes | Yes |
| $30K | 25 | Yes | Yes |
| $45K | 22 | No | No |
| Income | Credit Score | Age | Loan Approved |
|---|---|---|---|
| $80K | 750 | 35 | Yes |
| $35K | 620 | 28 | No |
| $65K | 710 | 42 | Yes |
| $25K | 580 | 25 | No |
| $90K | 780 | 38 | Yes |
| $40K | 650 | 31 | No |
Calculate information gain for each feature
Divide data based on chosen feature
Apply same logic to each subset
When nodes are pure or depth limit reached
Best first split: Credit Score ≥ 700
Left branch: Income ≥ $60K → Final decision
Right branch: All approved (pure node)
| Customer Type | Count | Response Rate |
|---|---|---|
| All Customers | 100 | 50% (Mixed) |
High Uncertainty: Equal mix of responses
| Customer Type | Count | Response Rate |
|---|---|---|
| High Income | 40 | 80% (Clear) |
| Low Income | 60 | 30% (Clear) |
Low Uncertainty: Clear patterns in each group
Goal: Maximize information gain by choosing splits that create the most homogeneous subgroups
Intuition: Good splits transform confusion into clarity. We want each resulting group to be as "pure" as possible in terms of the target outcome.
Entropy ≈ 1.58
Highly unpredictable
Entropy = 0
Completely predictable
50 Responded, 50 No Response
Entropy = 1.0
80 Responded, 20 No Response
Entropy = 0.72
95 Responded, 5 No Response
Entropy = 0.29
where p(i) is the proportion of samples belonging to class i
Purity means how "unmixed" or "homogeneous" a group is. Think of it like:
Goal: Split data to create the purest possible groups - where customers in each group behave as similarly as possible.
Dataset: 60 customers responded, 40 did not respond
Calculate Gini Index:
p(Responded) = 60/100 = 0.6
p(No Response) = 40/100 = 0.4
Gini = 1 - (0.6² + 0.4²) = 1 - (0.36 + 0.16) = 0.48
Gini Index = 1 - (0.6² + 0.4²) = 1 - (0.36 + 0.16) = 1 - 0.52 = 0.48
Interpretation: Gini = 0.48 indicates moderate impurity. Lower values (closer to 0) represent more homogeneous groups, while higher values indicate more mixed groups.
| Age | Income | Email Rate | Response |
|---|---|---|---|
| 25 | 35K | 0.2 | No |
| 35 | 65K | 0.6 | Yes |
| 45 | 85K | 0.8 | Yes |
| 28 | 40K | 0.3 | No |
| 52 | 95K | 0.7 | Yes |
| 31 | 55K | 0.4 | Yes |
| 42 | 75K | 0.9 | Yes |
| 29 | 45K | 0.1 | No |
| 38 | 70K | 0.5 | Yes |
| 33 | 50K | 0.3 | No |
| 47 | 90K | 0.8 | Yes |
| 26 | 38K | 0.2 | No |
where p(i) is the proportion of samples in class i
1. Current entropy of all data:
7 Yes, 5 No → Entropy =
2. Test split: Email Rate ≥ 0.5
Left branch (≥0.5): Yes, No
Right branch (<0.5): Yes, No
3. Calculate entropies for each branch:
Entropy(Left) =
Entropy(Right) =
4. Calculate weighted average entropy:
Weighted Avg = (/12) × + (/12) ×
=
5. Calculate information gain:
Information Gain = - =
1. Initial entropy: H(S) = -7/12×log₂(7/12) - 5/12×log₂(5/12) = 0.98
2. Email Rate ≥ 0.5: 6 Yes, 0 No | Email Rate < 0.5: 1 Yes, 5 No
3. Entropy(Left) = 0, Entropy(Right) = 0.65
4. Weighted Avg = (6/12)×0 + (6/12)×0.65 = 0.325
5. Information Gain = 0.98 - 0.325 = 0.655
Conclusion: This is an excellent split with high information gain!
Key Insight: Single decision trees are excellent for interpretation and quick insights, but their limitations motivate ensemble methods like Random Forests for improved predictive performance.
Exploratory analysis, rule generation, and situations requiring interpretability
Random Forests for accuracy, pruning for overfitting, gradient boosting for performance
Glass Jar with Colorful Jellybeans
Francis Galton's Discovery (1907): The average guess of 787 people at a county fair was 1,207 pounds for an ox weight. The actual weight was 1,198 pounds - remarkably close!
This principle forms the foundation of ensemble methods in machine learning.
Prediction: Responded
Based on Income & Age
Prediction: No Response
Based on Email Rate & Loyalty
Prediction: Responded
Based on Age & Transaction History
Responded
vs
No Response
| Tree | Features Used | Prediction | Confidence |
|---|---|---|---|
| 1 | Income, Age | Responded | 0.8 |
| 2 | Email Rate, Loyalty | No Response | 0.6 |
| 3 | Age, Transaction History | Responded | 0.9 |
| ID | Age | Income | Response |
|---|---|---|---|
| C1 | 25 | 35K | No |
| C2 | 35 | 65K | Yes |
| C3 | 45 | 85K | Yes |
| C4 | 28 | 40K | No |
| C5 | 52 | 95K | Yes |
| C6 | 31 | 55K | Yes |
| C7 | 42 | 75K | Yes |
| C8 | 29 | 45K | No |
| C9 | 38 | 70K | Yes |
| C10 | 33 | 50K | No |
| C11 | 47 | 90K | Yes |
| C12 | 26 | 38K | No |
Selected: C1, C3, C3, C5, C7, C2, C9, C11, C4, C6, C8, C1
Note: C3 and C1 appear twice
Selected: C2, C4, C6, C8, C10, C12, C5, C7, C9, C2, C11, C6
Note: C2 and C6 appear twice
Selected: C5, C1, C9, C11, C3, C7, C4, C8, C10, C12, C5, C9
Note: C5 and C9 appear twice
Key Principle: Each bootstrap sample is the same size as the original dataset but contains different combinations of records. This creates diverse training sets for each tree.
Total Features: 8
Typical Selection: √8 ≈ 3 features per split
Age, Income, Email Rate
Loyalty, Previous Purchases, Channel
Income, Days Since Purchase, Transaction Value
Benefit: Feature randomness prevents any single strong predictor from dominating the forest, ensuring each tree contributes unique insights to the final prediction.
Create B bootstrap samples from original dataset with replacement
For each split, randomly select √p features from p total features
Train decision trees on bootstrap samples using random feature subsets
Combine predictions through majority vote (classification) or averaging (regression)
Original Data: 1000 customers, 8 features
Bootstrap Samples: 100 samples of 1000 customers each
Feature Selection: √8 ≈ 3 features per split
Trees: 100 decision trees trained independently
Prediction: Majority vote from all 100 trees
Sample 1
Features: Age, Income, Email
Sample 2
Features: Loyalty, Purchases, Channel
⋮
96 more trees...
Key Insight: The algorithm's simplicity belies its power. Random sampling in both data and features creates diversity, while aggregation harnesses the collective intelligence of the forest.
True Negatives: 85 correctly predicted non-responders
True Positives: 90 correctly predicted responders
False Positives: 15 incorrectly predicted responders
False Negatives: 10 missed responders
Formula: (TP + TN) / Total
Result: (90 + 85) / 200 = 87.5%
Overall correctness of predictions
Formula: TP / (TP + FP)
Result: 90 / (90 + 15) = 85.7%
Of predicted responders, how many actually responded?
Formula: TP / (TP + FN)
Result: 90 / (90 + 10) = 90.0%
Of actual responders, how many did we find?
Formula: 2 × (Precision × Recall) / (Precision + Recall)
Result: 2 × (0.857 × 0.90) / (0.857 + 0.90) = 87.8%
Harmonic mean of precision and recall
Interpretation: This model shows strong performance with high accuracy and balanced precision/recall. The 87.5% accuracy means we correctly identify customer response 7 out of 8 times.
Categorize these scenarios for Single Tree vs Random Forest:
Doctor needs to understand exactly why the system made a diagnosis recommendation.
Bank needs highest possible accuracy to minimize false positives and false negatives.
Real-time recommendations with millisecond response requirements.
Optimize customer targeting with large dataset and complex patterns.
Your Random Forest model predicted customer responses for 500 customers. Here are the results:
Total Customers: 500
True Negatives (TN): 220
False Positives (FP): 30
False Negatives (FN): 40
True Positives (TP): 210
Formula: (TP + TN) / Total
Calculation: ( + ) /
Result: or %
Formula: TP / (TP + FP)
Calculation: / ( + )
Result: or %
Formula: TP / (TP + FN)
Calculation: / ( + )
Result: or %
Formula: 2 × (Precision × Recall) / (Precision + Recall)
Using your calculated Precision and Recall values:
F1-Score: or %
After calculating the metrics above, write a brief interpretation:
Decision trees make predictions through hierarchical splits, Random forests combine multiple trees for better accuracy, Ensemble methods leverage the wisdom of crowds
Single trees are interpretable but prone to overfitting, Random forests trade interpretability for robustness, Bootstrap sampling and feature randomness create diversity
Customer segmentation and campaign optimization, Risk assessment and fraud detection, Recommendation systems and personalization
Remember: Machine learning is about finding patterns in data to make predictions. Tree-based methods excel at capturing non-linear relationships while remaining relatively interpretable - making them invaluable tools in the data scientist's toolkit.
| Customer | Age | Income | Email Rate | Response |
|---|---|---|---|---|
| A | 25 | 35K | 0.2 | No |
| B | 35 | 65K | 0.6 | Yes |
| C | 45 | 85K | 0.8 | Yes |
| D | 28 | 40K | 0.3 | No |
| E | 52 | 95K | 0.7 | Yes |
| F | 31 | 55K | 0.4 | Yes |
Training Accuracy: 100% ✓
Yes
Yes
Yes
No
No
No
| Customer | Age | Income | Email Rate | Actual | Predicted |
|---|---|---|---|---|---|
| G | 30 | 60K | 0.5 | Yes | No |
| H | 40 | 70K | 0.6 | Yes | Yes |
| I | 33 | 45K | 0.2 | No | Yes |
| J | 27 | 50K | 0.3 | No | No |
Test Accuracy: 50% ✗
Business Impact: An overfitted marketing model might achieve 100% accuracy on historical campaigns but fail completely on new customers, leading to wasted marketing budget and poor targeting decisions.
| Customer | Age | Income | Email Rate | Loyalty | Response |
|---|---|---|---|---|---|
| A | 25 | 35K | 0.2 | No | No |
| B | 35 | 65K | 0.6 | Yes | Yes |
| C | 45 | 85K | 0.8 | Yes | Yes |
| D | 28 | 40K | 0.3 | No | No |
| E | 52 | 95K | 0.7 | Yes | Yes |
| F | 31 | 55K | 0.4 | Yes | Yes |
| G | 29 | 45K | 0.1 | No | No |
| H | 38 | 70K | 0.5 | Yes | Yes |
Yes
No
Only considers income, ignores other important features
Accuracy: 75% (6/8 correct)
Misses important patterns in the data
Accuracy: 73% (similar to training)
Consistently mediocre performance
Solutions: Pruning, cross-validation, ensemble methods, more training data
Goal: Balance complexity and generalization for optimal performance
Solutions: Deeper trees, more features, feature engineering, reduce regularization
The Goldilocks Principle: Your model should be complex enough to capture important patterns but simple enough to generalize to new data. Random Forests help achieve this balance through ensemble averaging.