Title
Why RF?
Data Prep
Binary Splits
Decision Trees
Random Forests
Interpretability
Practical Tips
CP3501 / CP5701 · Week 7 · FastAI Lesson 6

Random Forests

From a single binary split to a powerful ensemble
Today's journey: We'll start where we left off — a single yes/no rule — and build all the way up to a Random Forest that can handle thousands of messy real-world features.
Roadmap
Binary Split
OneR Model
Decision Tree
Bagging
Random Forest
James Cook University · College of Science and Engineering
Section 1

Why Random Forests?

Motivation, strengths, and where they fit in your ML toolkit

01
Section 1 · Why RF?

What Is a Random Forest?

Introduced by Leo Breiman in 1999, Random Forests became the go-to method for tabular data throughout the 2000s.

Core idea

  • Build many decision trees independently
  • Each tree sees a random subset of rows (bootstrapping) and features
  • Combine predictions by averaging (regression) or majority vote (classification)

Why it works

  • Individual trees have high variance but low bias
  • Averaging cancels out each tree's random errors
  • Result: a model with lower variance and strong accuracy
Fun fact: Jeremy Howard's early Kaggle wins earned him the nickname "Mr. Random Forests" — the method is that reliable.
Section 1 · Why RF?

Random Forests vs. Logistic Regression

Logistic regression is often called the "simple baseline" — but it's surprisingly fragile in practice.

Logistic RegressionRandom Forest
Feature engineeringRequires careful transforms, interaction terms, outlier handlingHandles non-linearities and interactions automatically
OutliersCan collapse the modelLargely ignored — trees split on thresholds
Missing valuesMust impute carefullyForgiving; median fill usually sufficient
Failure rateHigh — easy to misuseRare in practice
Key insight: Logistic regression is only simple if you do everything right. One slip in preprocessing and the whole model collapses. Random Forests are resilient — they learn the complexity instead of requiring you to engineer it in manually.
Section 2

Data Preparation

Getting the Titanic dataset ready — and why trees need less preprocessing than you think

02
Section 2 · Data Prep

FastAI Setup & Titanic Preprocessing

One import brings in NumPy, pandas, and matplotlib automatically:

from fastai.imports import *

# Download Titanic train + test
path = untar_data(URLs.TITANIC)
df = pd.read_csv(path/'train.csv')

Essential preprocessing steps

Remember: Trees split on thresholds, not values — so you don't need to normalise or standardise continuous features.
Section 2 · Data Prep

Pandas category dtype & cat_codes

Converting a string column to category is a critical step:

df['Sex'] = df['Sex'].astype('category')

# Inspect what pandas stores internally:
df['Sex'].cat.categories   # → Index(['female', 'male'])
df['Sex'].cat.codes        # → 0 = female, 1 = male

What this gives you

  • Human-readable labels are preserved
  • Internally stored as compact integers
  • No one-hot encoding needed for tree models
  • Faster computation on large datasets

When NOT to use category

Leave Pclass (1st, 2nd, 3rd) as a numeric column. Trees can learn thresholds like "Pclass < 2.5" directly — converting it forces the model to treat each class as unordered.

Rule of thumb: Use category for nominal variables (Sex, Embarked). Keep ordinal or naturally numeric variables as integers or floats.
Section 3

Binary Splits & The OneR Model

The foundation of every tree — one rule at a time

03
Section 3 · Binary Splits

What Is a Binary Split?

A binary split partitions all rows into exactly two groups based on a single rule:

e.g.,    Sex == "male"    →   Group Left   |   Group Right

The Titanic Sex split — a very strong rule

GroupSurvival RateCount
Female≈ 75%~314
Male≈ 20%~577

With just one yes/no question, we can predict survival with reasonable accuracy. This is the entire basis of decision trees — we just keep asking more questions.

Ask one yes/no question
Split rows into two groups
Predict the group's mean
Section 3 · Binary Splits

Scoring a Split — The Variance Method

How do we know if a split is good? We measure how "pure" each resulting group is using its standard deviation:

score_side = σ(yside) × |yside|

split_score = (score_left + score_right) / |y|

Interpretation

  • σ (std dev) → how mixed the labels are in that group
  • × group size → weigh larger groups more heavily
  • Lower score = better split (tighter, more uniform groups)

Two splits compared

SplitMAE
Sex == female0.215 ✓ better
log_fare > 2.70.333

Picking thresholds by eye is unreliable — scoring finds the best one automatically.

Section 3 · Binary Splits

The OneR Model

OneR = "One Rule" — find the single best binary split across all features and stop there.

Result on Titanic: "Predict survived = 1 if Sex == female" → MAE ≈ 0.215 (21.5% error)

Why start so simple?

Always build a simple baseline first. If your deep model barely beats a single yes/no rule, that's important information about your data.
Knowledge Checkpoint · Quiz 1 of 5
When scoring a binary split, a lower split score is better because it indicates…
Section 4

Decision Trees

Recursively stacking binary splits until we have a powerful (but interpretable) model

04
Section 4 · Decision Trees

From One Split to a Full Tree

A decision tree is nothing more than recursively applied binary splits. After the first split (Sex), we split each group again:

Female branch

Best next split: Pclass == 1

  • Female & 1st class → ≈ 97% survive
  • Female & other class → ≈ 39% survive

Male branch

Best next split: Age < 6

  • Male & young child → ≈ 46% survive
  • Male & adult → ≈ 16% survive

After two levels of splitting we have 4 leaf nodes — a proper decision tree.

Leaf node: A terminal group where we stop splitting. The prediction for any row landing in that leaf is the group's mean (or majority class).
Section 4 · Decision Trees

Building a Tree with scikit-learn

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

tree = DecisionTreeClassifier(
    max_leaf_nodes=4,
    random_state=42
).fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(12, 5))
plot_tree(tree, feature_names=X_train.columns,
          class_names=['Died','Survived'], filled=True, ax=ax)
plt.show()

Key parameter: max_leaf_nodes

Section 4 · Decision Trees

Interpreting the 4-Leaf Tree

LeafRuleSurvived / TotalInsight
L1Female & 1st class116 / 120"Rich women lived"
L2Female & not 1st class73 / 186Mixed outcomes
L3Male & Age < 624 / 52"Young boys had a chance"
L4Male & Age ≥ 668 / 418"Adult men mostly perished"
This four-leaf tree is fully human-readable. You can explain every prediction. This interpretability is a major advantage of tree-based models.
Section 4 · Decision Trees

Gini Impurity — scikit-learn's Split Criterion

scikit-learn uses Gini impurity to score splits (instead of our variance method). For a binary target:

Gini = 2 × p × (1 − p)

where p is the proportion of class-1 in the node.

Key values

  • Pure node (all same class): Gini = 0
  • 50/50 mix: Gini = 0.5 (maximum)

How the tree grows

At each node, try every possible split for every feature. Choose whichever split minimises the weighted average Gini of the two resulting children.

Gini vs variance scoring: Both measure purity; Gini is preferred in scikit-learn for classification. For regression trees, mean squared error (MSE) is used instead.
Section 4 · Decision Trees

How Well Does a Single Tree Perform?

ModelMAE (validation)Notes
OneR — Sex split0.215Our baseline
Decision tree — 4 leaves0.224Slightly worse on small data
Decision tree — min_samples_leaf=500.183Best so far

Why can one rule sometimes beat a tree?

Note: On larger real-world datasets, a deeper tree with min_samples_leaf=50 typically outperforms OneR convincingly.
Knowledge Checkpoint · Quiz 2 of 5
A decision tree node has Gini impurity = 0. What does this tell you?
Section 5

Random Forests

Bagging many trees to get the best of all worlds

05
Section 5 · Random Forests

The Problem with a Single Tree

Leo Breiman's insight — Bagging

What if we built many trees and averaged them? Errors that are random will cancel out; only the signal remains.

Three conditions for bagging to work:
  1. Each model should be reasonably good individually (low bias)
  2. Errors between models should be uncorrelated (different data/features)
  3. Average over many models — positive and negative errors cancel
Bagging = Bootstrap Aggregating
Section 5 · Random Forests

Building a Random Forest — Step by Step

A Random Forest is bagged decision trees with an extra twist: random feature subsets at each split.

Step 1 — Bootstrap: Sample k% of training rows with replacement → creates a unique sub-dataset for this tree
Step 2 — Random features: At each split, only consider a random subset of columns (typically √p features)
Step 3 — Grow tree: Build a shallow decision tree on this subset (don't prune aggressively)
Step 4 — Repeat: Build N trees (e.g., 100) — each sees different rows and different features
Final prediction: Average (regression) or majority vote (classification) across all N trees
Section 5 · Random Forests

Building a Random Forest in scikit-learn

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,       # number of trees
    min_samples_leaf=5,     # minimum rows per leaf
    max_features=0.5,       # random subset of features per split
    oob_score=True,         # built-in validation (explained next)
    n_jobs=-1,              # use all CPU cores
    random_state=42
).fit(X_train, y_train)

print(f"OOB accuracy: {rf.oob_score_:.3f}")
print(f"Validation MAE: {mean_absolute_error(y_val, rf.predict(X_val)):.3f}")

Results on Titanic

ModelMAEComment
Single tree (min_leaf=50)0.183Previous best
Random forest (100 trees)≈ 0.185Similar — Titanic is tiny
Don't be fooled by the Titanic result. On datasets with thousands of rows and hundreds of features, Random Forests typically win convincingly. Titanic (~890 rows) is too small to show the full benefit.
Section 5 · Random Forests

How Many Trees Do You Need?

More trees always help — error decreases monotonically as you add trees. It never gets worse.

Practical guidance

  • 50–100 trees usually reach the plateau of diminishing returns
  • Beyond that, you're paying in inference time for tiny gains
  • Start with 100, check your OOB error curve, then decide

Can it overfit?

  • No — adding more trees cannot increase test error
  • Risk comes only from very deep trees with too few estimators — fix by raising n_estimators
  • RFs are remarkably forgiving of hyperparameter choices
RF resilience: Thousands of irrelevant features? Fine — trees ignore them. Non-linear interactions? Learned automatically. Outliers? Thresholds make them irrelevant. Normalisation? Not needed.
Knowledge Checkpoint · Quiz 3 of 5
Random Forests reduce the variance of single decision trees primarily through…
Section 6

Interpretability

Feature importance, OOB validation, partial dependence, and explaining individual predictions

06
Section 6 · Interpretability

Out-of-Bag (OOB) Error — Free Validation

Because each tree only sees a bootstrapped subset of data, approximately 25–37% of rows are left out for any given tree. These are the "out-of-bag" rows.

1. For each row, collect predictions from all trees that did not use it in training
2. Average those predictions → OOB prediction for that row
3. Compare OOB predictions to true labels → OOB error score
Why this matters: OOB error is a built-in, unbiased estimate of generalisation error — no separate validation set required. Enable with RandomForestClassifier(oob_score=True).
Section 6 · Interpretability

Feature Importance

For every split in every tree, track how much the Gini impurity improved × how many rows were in that node. Sum across all trees → feature importance score.

import matplotlib.pyplot as plt

importances = pd.Series(rf.feature_importances_, index=X_train.columns)
importances.sort_values().plot.barh(figsize=(8, 5), color='#C0272D')
plt.title('Feature Importances')
plt.tight_layout()
plt.show()

Titanic result

Sex ≫ Pclass ≈ Fare ≫ Age ≫ Embarked

Real-world power: A credit scoring team had 7,000 raw database columns. A Random Forest reduced this to ~30 key variables in under 2 hours — matching a multi-million-dollar consulting study.
Section 6 · Interpretability

Partial Dependence Plots (PDPs)

Feature importance tells you what matters. PDPs tell you how a feature affects the prediction, all else held equal.

How to compute a PDP for one feature

  1. Take the validation set
  2. Set the target column to a fixed value (e.g., Age = 5) for all rows
  3. Run the forest and average all predictions → one point on the curve
  4. Repeat for every candidate value across the feature's range → full curve
Key insight: Because we're averaging over all other feature values, the curve isolates the true effect of the one variable — unconfounded by correlated features like passenger class and fare.
scikit-learn shortcut: from sklearn.inspection import PartialDependenceDisplay — just call .from_estimator(rf, X_val, features=[...])
Section 6 · Interpretability

Explaining a Single Prediction

Feature importance explains the model overall. But sometimes you need to explain why this specific passenger was predicted to die.

Tree-path contribution method

  1. For one row, trace its path through every tree in the forest
  2. At each split, record: which feature was used, and how much it changed the prediction
  3. Sum contributions across all trees → per-feature Δ prediction for that row
Result: A waterfall chart showing e.g. "Sex ↑ +0.42, Pclass ↓ −0.15, Age ↑ +0.07…" — intuitive enough to explain to a non-technical stakeholder why a loan was declined.
SHAP library provides a principled, model-agnostic implementation of this idea: pip install shap. SHAP values are the gold standard for single-prediction explanations.
Knowledge Checkpoint · Quiz 4 of 5
What is the key advantage of Out-of-Bag (OOB) error over using a separate validation set?
Section 7

Practical Tips & Bigger Picture

When to use RF, how it compares to boosting, and the Kaggle mindset

07
Section 7 · Practical

Bagging vs. Boosting

Two fundamentally different ensemble strategies — know when to use each.

AspectBagging (Random Forest)Boosting (XGBoost / LightGBM)
Core ideaTrain trees on random subsets simultaneously, then averageTrain tiny trees sequentially, each correcting the previous
Tuning effortMinimal — hard to breakMany hyperparams, can overfit
Typical accuracyStrong baselineOften higher with careful tuning
When to useFirst model — fast, robust, interpretableSecond pass, once baseline is established
Recommended workflow: Start with a Random Forest to get a strong baseline and understand your features. Then try gradient boosting if you need to squeeze out more accuracy.
Section 7 · Practical

Breiman's "Two Cultures" — A Final Thought

Leo Breiman (2001) argued there are two cultures in statistical modelling:

Data modelling culture

Assume a parametric form for the data (e.g., linear regression). Fit that form, explain coefficients. Prioritises interpretability of the assumed model.

Algorithmic culture

Let the algorithm find the structure. Prioritise predictive accuracy first — then explain the accurate model. Breiman championed this approach decades before "data science" existed.

Practical implication: A model that predicts poorly also explains poorly. Always get a good-fitting model first. A 60% accurate model that you can explain is worse than a 95% accurate model that you can also explain with SHAP.
Section 7 · Practical

The Kaggle Iteration Mindset

Jeremy Howard's approach to competition — applicable to every real project.

Reliable validation
Your local validation must mirror the hidden leaderboard. Without it, you're flying blind.
Rapid iteration
Models that train in seconds (RF on Titanic) let you try 50+ ideas. Slow models let you try 5.
Clean code
Avoid "Untitled (25).ipynb" chaos. Small, reproducible notebooks are essential.

Submission workflow

test_pred = rf.predict(X_test)
sub = pd.DataFrame({
    'PassengerId': test_df.PassengerId,
    'Survived': test_pred
})
sub.to_csv('submission_rf.csv', index=False)
# → Upload to Kaggle. Repeat. Improve.
Knowledge Checkpoint · Quiz 5 of 5
You need to explain to a bank manager why a specific customer's loan application was rejected by your Random Forest model. Which tool is most appropriate?
Week 7 · Summary

What We Covered Today

Theory

  • Binary splits → OneR → Decision Trees → Random Forests
  • Variance scoring & Gini impurity for split quality
  • Bagging: why averaging uncorrelated trees works
  • Bagging vs. Boosting trade-offs

Implementation

  • Pandas category dtype and cat_codes
  • DecisionTreeClassifier and RandomForestClassifier
  • OOB score for built-in validation
  • Feature importance & SHAP for explanations

Next Steps

  • Practical workshop: build and tune a Random Forest on Titanic in Colab
  • Submit to Kaggle — aim to beat your single-tree MAE
  • Try adding oob_score=True and inspecting feature importances
Key takeaway: Random Forests are the best "first serious model" for any tabular dataset. They're forgiving, interpretable, and surprisingly hard to break.
Workshop goal: Achieve MAE < 0.18 using a tuned Random Forest with feature importance analysis.
Use ← → arrow keys to navigate · Click section tabs to jump