Today's journey: We'll start where we left off — a single yes/no rule — and build all the way up to a Random Forest that can handle thousands of messy real-world features.
Roadmap
Binary Split
→
OneR Model
→
Decision Tree
→
Bagging
→
Random Forest
James Cook University · College of Science and Engineering
Section 1
Why Random Forests?
Motivation, strengths, and where they fit in your ML toolkit
01
Section 1 · Why RF?
What Is a Random Forest?
Introduced by Leo Breiman in 1999, Random Forests became the go-to method for tabular data throughout the 2000s.
Core idea
Build many decision trees independently
Each tree sees a random subset of rows (bootstrapping) and features
Combine predictions by averaging (regression) or majority vote (classification)
Why it works
Individual trees have high variance but low bias
Averaging cancels out each tree's random errors
Result: a model with lower variance and strong accuracy
Fun fact: Jeremy Howard's early Kaggle wins earned him the nickname "Mr. Random Forests" — the method is that reliable.
Section 1 · Why RF?
Random Forests vs. Logistic Regression
Logistic regression is often called the "simple baseline" — but it's surprisingly fragile in practice.
Handles non-linearities and interactions automatically
Outliers
Can collapse the model
Largely ignored — trees split on thresholds
Missing values
Must impute carefully
Forgiving; median fill usually sufficient
Failure rate
High — easy to misuse
Rare in practice
Key insight: Logistic regression is only simple if you do everything right. One slip in preprocessing and the whole model collapses. Random Forests are resilient — they learn the complexity instead of requiring you to engineer it in manually.
Section 2
Data Preparation
Getting the Titanic dataset ready — and why trees need less preprocessing than you think
02
Section 2 · Data Prep
FastAI Setup & Titanic Preprocessing
One import brings in NumPy, pandas, and matplotlib automatically:
from fastai.imports import *
# Download Titanic train + test
path = untar_data(URLs.TITANIC)
df = pd.read_csv(path/'train.csv')
Essential preprocessing steps
Fill missing numerics with median/mode — trees need no NaN values
Log-transformFare → log_fare = np.log1p(Fare) for a nicer distribution in plots
ConvertSex and Embarked to category dtype
Separate features into categorical and continuous lists
Remember: Trees split on thresholds, not values — so you don't need to normalise or standardise continuous features.
Section 2 · Data Prep
Pandas category dtype & cat_codes
Converting a string column to category is a critical step:
Leave Pclass (1st, 2nd, 3rd) as a numeric column. Trees can learn thresholds like "Pclass < 2.5" directly — converting it forces the model to treat each class as unordered.
Rule of thumb: Use category for nominal variables (Sex, Embarked). Keep ordinal or naturally numeric variables as integers or floats.
Section 3
Binary Splits & The OneR Model
The foundation of every tree — one rule at a time
03
Section 3 · Binary Splits
What Is a Binary Split?
A binary split partitions all rows into exactly two groups based on a single rule:
e.g., Sex == "male" → Group Left | Group Right
The Titanic Sex split — a very strong rule
Group
Survival Rate
Count
Female
≈ 75%
~314
Male
≈ 20%
~577
With just one yes/no question, we can predict survival with reasonable accuracy. This is the entire basis of decision trees — we just keep asking more questions.
Ask one yes/no question
→
Split rows into two groups
→
Predict the group's mean
Section 3 · Binary Splits
Scoring a Split — The Variance Method
How do we know if a split is good? We measure how "pure" each resulting group is using its standard deviation:
score_side = σ(yside) × |yside|
split_score = (score_left + score_right) / |y|
Interpretation
σ (std dev) → how mixed the labels are in that group
× group size → weigh larger groups more heavily
Lower score = better split (tighter, more uniform groups)
Two splits compared
Split
MAE
Sex == female
0.215 ✓ better
log_fare > 2.7
0.333
Picking thresholds by eye is unreliable — scoring finds the best one automatically.
Section 3 · Binary Splits
The OneR Model
OneR = "One Rule" — find the single best binary split across all features and stop there.
Result on Titanic: "Predict survived = 1 if Sex == female" → MAE ≈ 0.215 (21.5% error)
Why start so simple?
A 1990s meta-analysis found OneR often matched or beat more complex models
It creates a baseline to beat — if your complex model doesn't clearly improve on this, something's wrong
It's transparent and explainable — you can show a business stakeholder exactly what the model does
Always build a simple baseline first. If your deep model barely beats a single yes/no rule, that's important information about your data.
Knowledge Checkpoint · Quiz 1 of 5
When scoring a binary split, a lower split score is better because it indicates…
A Larger group sizes on each side of the split
B A higher mean value of the target variable
C Each group has tightly clustered, more uniform target values
D The split uses fewer features
Section 4
Decision Trees
Recursively stacking binary splits until we have a powerful (but interpretable) model
04
Section 4 · Decision Trees
From One Split to a Full Tree
A decision tree is nothing more than recursively applied binary splits. After the first split (Sex), we split each group again:
Female branch
Best next split: Pclass == 1
Female & 1st class → ≈ 97% survive
Female & other class → ≈ 39% survive
Male branch
Best next split: Age < 6
Male & young child → ≈ 46% survive
Male & adult → ≈ 16% survive
After two levels of splitting we have 4 leaf nodes — a proper decision tree.
Leaf node: A terminal group where we stop splitting. The prediction for any row landing in that leaf is the group's mean (or majority class).
Section 4 · Decision Trees
Building a Tree with scikit-learn
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
tree = DecisionTreeClassifier(
max_leaf_nodes=4,
random_state=42
).fit(X_train, y_train)
fig, ax = plt.subplots(figsize=(12, 5))
plot_tree(tree, feature_names=X_train.columns,
class_names=['Died','Survived'], filled=True, ax=ax)
plt.show()
Key parameter: max_leaf_nodes
Controls tree depth / capacity — the main knob for bias-variance tradeoff
Start small (4–8 leaves) to understand the model, then increase
Alternative: min_samples_leaf=50 — each leaf must have ≥ 50 rows
Section 4 · Decision Trees
Interpreting the 4-Leaf Tree
Leaf
Rule
Survived / Total
Insight
L1
Female & 1st class
116 / 120
"Rich women lived"
L2
Female & not 1st class
73 / 186
Mixed outcomes
L3
Male & Age < 6
24 / 52
"Young boys had a chance"
L4
Male & Age ≥ 6
68 / 418
"Adult men mostly perished"
This four-leaf tree is fully human-readable. You can explain every prediction. This interpretability is a major advantage of tree-based models.
Section 4 · Decision Trees
Gini Impurity — scikit-learn's Split Criterion
scikit-learn uses Gini impurity to score splits (instead of our variance method). For a binary target:
Gini = 2 × p × (1 − p)
where p is the proportion of class-1 in the node.
Key values
Pure node (all same class): Gini = 0
50/50 mix: Gini = 0.5 (maximum)
How the tree grows
At each node, try every possible split for every feature. Choose whichever split minimises the weighted average Gini of the two resulting children.
Gini vs variance scoring: Both measure purity; Gini is preferred in scikit-learn for classification. For regression trees, mean squared error (MSE) is used instead.
Section 4 · Decision Trees
How Well Does a Single Tree Perform?
Model
MAE (validation)
Notes
OneR — Sex split
0.215
Our baseline
Decision tree — 4 leaves
0.224
Slightly worse on small data
Decision tree — min_samples_leaf=50
0.183
Best so far
Why can one rule sometimes beat a tree?
With only ~890 training rows, deeper splits create tiny, noisy leaf nodes
Small leaves → unstable predictions → higher variance
This is the core problem that Random Forests solve
Note: On larger real-world datasets, a deeper tree with min_samples_leaf=50 typically outperforms OneR convincingly.
Knowledge Checkpoint · Quiz 2 of 5
A decision tree node has Gini impurity = 0. What does this tell you?
A The node has equal numbers of each class — a 50/50 split
B All samples in that node belong to the same class — a pure leaf
Tiny leaves give unstable, high-variance predictions
Accuracy plateaus; adding more depth eventually overfits
Leo Breiman's insight — Bagging
What if we built many trees and averaged them? Errors that are random will cancel out; only the signal remains.
Three conditions for bagging to work:
Each model should be reasonably good individually (low bias)
Errors between models should be uncorrelated (different data/features)
Average over many models — positive and negative errors cancel
Bagging = Bootstrap Aggregating
Section 5 · Random Forests
Building a Random Forest — Step by Step
A Random Forest is bagged decision trees with an extra twist: random feature subsets at each split.
Step 1 — Bootstrap: Sample k% of training rows with replacement → creates a unique sub-dataset for this tree
Step 2 — Random features: At each split, only consider a random subset of columns (typically √p features)
Step 3 — Grow tree: Build a shallow decision tree on this subset (don't prune aggressively)
Step 4 — Repeat: Build N trees (e.g., 100) — each sees different rows and different features
Final prediction: Average (regression) or majority vote (classification) across all N trees
Section 5 · Random Forests
Building a Random Forest in scikit-learn
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100, # number of trees
min_samples_leaf=5, # minimum rows per leaf
max_features=0.5, # random subset of features per split
oob_score=True, # built-in validation (explained next)
n_jobs=-1, # use all CPU cores
random_state=42
).fit(X_train, y_train)
print(f"OOB accuracy: {rf.oob_score_:.3f}")
print(f"Validation MAE: {mean_absolute_error(y_val, rf.predict(X_val)):.3f}")
Results on Titanic
Model
MAE
Comment
Single tree (min_leaf=50)
0.183
Previous best
Random forest (100 trees)
≈ 0.185
Similar — Titanic is tiny
Don't be fooled by the Titanic result. On datasets with thousands of rows and hundreds of features, Random Forests typically win convincingly. Titanic (~890 rows) is too small to show the full benefit.
Section 5 · Random Forests
How Many Trees Do You Need?
More trees always help — error decreases monotonically as you add trees. It never gets worse.
Practical guidance
50–100 trees usually reach the plateau of diminishing returns
Beyond that, you're paying in inference time for tiny gains
Start with 100, check your OOB error curve, then decide
Can it overfit?
No — adding more trees cannot increase test error
Risk comes only from very deep trees with too few estimators — fix by raising n_estimators
RFs are remarkably forgiving of hyperparameter choices
RF resilience: Thousands of irrelevant features? Fine — trees ignore them. Non-linear interactions? Learned automatically. Outliers? Thresholds make them irrelevant. Normalisation? Not needed.
Knowledge Checkpoint · Quiz 3 of 5
Random Forests reduce the variance of single decision trees primarily through…
A Pruning branches that don't improve accuracy by more than 1%
B Averaging predictions from many trees trained on different random subsets of data and features
C Using Gini impurity instead of variance to score splits
D Training on the full dataset with regularisation applied to each tree
Section 6
Interpretability
Feature importance, OOB validation, partial dependence, and explaining individual predictions
06
Section 6 · Interpretability
Out-of-Bag (OOB) Error — Free Validation
Because each tree only sees a bootstrapped subset of data, approximately 25–37% of rows are left out for any given tree. These are the "out-of-bag" rows.
1. For each row, collect predictions from all trees that did not use it in training
2. Average those predictions → OOB prediction for that row
Why this matters: OOB error is a built-in, unbiased estimate of generalisation error — no separate validation set required. Enable with RandomForestClassifier(oob_score=True).
Section 6 · Interpretability
Feature Importance
For every split in every tree, track how much the Gini impurity improved × how many rows were in that node. Sum across all trees → feature importance score.
Real-world power: A credit scoring team had 7,000 raw database columns. A Random Forest reduced this to ~30 key variables in under 2 hours — matching a multi-million-dollar consulting study.
Section 6 · Interpretability
Partial Dependence Plots (PDPs)
Feature importance tells you what matters. PDPs tell you how a feature affects the prediction, all else held equal.
How to compute a PDP for one feature
Take the validation set
Set the target column to a fixed value (e.g., Age = 5) for all rows
Run the forest and average all predictions → one point on the curve
Repeat for every candidate value across the feature's range → full curve
Key insight: Because we're averaging over all other feature values, the curve isolates the true effect of the one variable — unconfounded by correlated features like passenger class and fare.
Feature importance explains the model overall. But sometimes you need to explain why this specific passenger was predicted to die.
Tree-path contribution method
For one row, trace its path through every tree in the forest
At each split, record: which feature was used, and how much it changed the prediction
Sum contributions across all trees → per-feature Δ prediction for that row
Result: A waterfall chart showing e.g. "Sex ↑ +0.42, Pclass ↓ −0.15, Age ↑ +0.07…" — intuitive enough to explain to a non-technical stakeholder why a loan was declined.
SHAP library provides a principled, model-agnostic implementation of this idea: pip install shap. SHAP values are the gold standard for single-prediction explanations.
Knowledge Checkpoint · Quiz 4 of 5
What is the key advantage of Out-of-Bag (OOB) error over using a separate validation set?
A OOB error is always lower (more optimistic) than hold-out validation error
B OOB error can only be computed on classification tasks, not regression
C It gives an unbiased generalisation estimate without removing any rows from training
D OOB error measures how well the forest memorised the training data
Section 7
Practical Tips & Bigger Picture
When to use RF, how it compares to boosting, and the Kaggle mindset
07
Section 7 · Practical
Bagging vs. Boosting
Two fundamentally different ensemble strategies — know when to use each.
Aspect
Bagging (Random Forest)
Boosting (XGBoost / LightGBM)
Core idea
Train trees on random subsets simultaneously, then average
Train tiny trees sequentially, each correcting the previous
Tuning effort
Minimal — hard to break
Many hyperparams, can overfit
Typical accuracy
Strong baseline
Often higher with careful tuning
When to use
First model — fast, robust, interpretable
Second pass, once baseline is established
Recommended workflow: Start with a Random Forest to get a strong baseline and understand your features. Then try gradient boosting if you need to squeeze out more accuracy.
Section 7 · Practical
Breiman's "Two Cultures" — A Final Thought
Leo Breiman (2001) argued there are two cultures in statistical modelling:
Data modelling culture
Assume a parametric form for the data (e.g., linear regression). Fit that form, explain coefficients. Prioritises interpretability of the assumed model.
Algorithmic culture
Let the algorithm find the structure. Prioritise predictive accuracy first — then explain the accurate model. Breiman championed this approach decades before "data science" existed.
Practical implication: A model that predicts poorly also explains poorly. Always get a good-fitting model first. A 60% accurate model that you can explain is worse than a 95% accurate model that you can also explain with SHAP.
Section 7 · Practical
The Kaggle Iteration Mindset
Jeremy Howard's approach to competition — applicable to every real project.
Reliable validation Your local validation must mirror the hidden leaderboard. Without it, you're flying blind.
Rapid iteration Models that train in seconds (RF on Titanic) let you try 50+ ideas. Slow models let you try 5.
test_pred = rf.predict(X_test)
sub = pd.DataFrame({
'PassengerId': test_df.PassengerId,
'Survived': test_pred
})
sub.to_csv('submission_rf.csv', index=False)
# → Upload to Kaggle. Repeat. Improve.
Knowledge Checkpoint · Quiz 5 of 5
You need to explain to a bank manager why a specific customer's loan application was rejected by your Random Forest model. Which tool is most appropriate?
A Feature importance plot — it shows the most influential features across all predictions
B A partial dependence plot for income — it shows how income affects predictions on average
C SHAP values for that specific customer — they quantify each feature's contribution to this individual prediction
D OOB error — it measures whether the model is generalising well overall
Week 7 · Summary
What We Covered Today
Theory
Binary splits → OneR → Decision Trees → Random Forests
Variance scoring & Gini impurity for split quality
Bagging: why averaging uncorrelated trees works
Bagging vs. Boosting trade-offs
Implementation
Pandas category dtype and cat_codes
DecisionTreeClassifier and RandomForestClassifier
OOB score for built-in validation
Feature importance & SHAP for explanations
Next Steps
Practical workshop: build and tune a Random Forest on Titanic in Colab
Submit to Kaggle — aim to beat your single-tree MAE
Try adding oob_score=True and inspecting feature importances
Key takeaway: Random Forests are the best "first serious model" for any tabular dataset. They're forgiving, interpretable, and surprisingly hard to break.
Workshop goal: Achieve MAE < 0.18 using a tuned Random Forest with feature importance analysis.
Use ← → arrow keys to navigate · Click section tabs to jump