COIT20059 · Week 7

Random Forests

From a single binary split to a powerful ensemble

Today's journey: We'll start where we left off — a single yes/no rule — and build all the way up to a Random Forest that can handle thousands of messy real-world features.

Roadmap

Binary Split

→

OneR Model

→

Decision Tree

→

Bagging

→

Random Forest

James Cook University · College of Science and Engineering

Section 1

Why Random Forests?

Motivation, strengths, and where they fit in your ML toolkit

01

Section 1 · Why RF?

What Is a Random Forest?

Introduced by Leo Breiman in 1999, Random Forests became the go-to method for tabular data throughout the 2000s.

Core idea

Build many decision trees independently
Each tree sees a random subset of rows (bootstrapping) and features
Combine predictions by averaging (regression) or majority vote (classification)

Why it works

Individual trees have high variance but low bias
Averaging cancels out each tree's random errors
Result: a model with lower variance and strong accuracy

Fun fact: Jeremy Howard's early Kaggle wins earned him the nickname "Mr. Random Forests" — the method is that reliable.

Section 1 · Why RF?

Random Forests vs. Logistic Regression

Logistic regression is often called the "simple baseline" — but it's surprisingly fragile in practice.

	Logistic Regression	Random Forest
Feature engineering	Requires careful transforms, interaction terms, outlier handling	Handles non-linearities and interactions automatically
Outliers	Can collapse the model	Largely ignored — trees split on thresholds
Missing values	Must impute carefully	Forgiving; median fill usually sufficient
Failure rate	High — easy to misuse	Rare in practice

Key insight: Logistic regression is only simple if you do everything right. One slip in preprocessing and the whole model collapses. Random Forests are resilient — they learn the complexity instead of requiring you to engineer it in manually.

Section 2

Data Preparation

Getting the Titanic dataset ready — and why trees need less preprocessing than you think

02

Section 2 · Data Prep

FastAI Setup & Titanic Preprocessing

One import brings in NumPy, pandas, and matplotlib automatically:

from fastai.imports import *

# Download Titanic train + test
path = untar_data(URLs.TITANIC)
df = pd.read_csv(path/'train.csv')

Essential preprocessing steps

Fill missing numerics with median/mode — trees need no NaN values
Log-transform Fare → log_fare = np.log1p(Fare) for a nicer distribution in plots
Convert Sex and Embarked to category dtype
Separate features into categorical and continuous lists

Remember: Trees split on thresholds, not values — so you don't need to normalise or standardise continuous features.

Section 2 · Data Prep

Pandas `category` dtype & `cat_codes`

Converting a string column to category is a critical step:

df['Sex'] = df['Sex'].astype('category')

# Inspect what pandas stores internally:
df['Sex'].cat.categories   # → Index(['female', 'male'])
df['Sex'].cat.codes        # → 0 = female, 1 = male

What this gives you

Human-readable labels are preserved
Internally stored as compact integers
No one-hot encoding needed for tree models
Faster computation on large datasets

When NOT to use category

Leave Pclass (1st, 2nd, 3rd) as a numeric column. Trees can learn thresholds like "Pclass < 2.5" directly — converting it forces the model to treat each class as unordered.

Rule of thumb: Use category for nominal variables (Sex, Embarked). Keep ordinal or naturally numeric variables as integers or floats.

Section 3

Binary Splits & The OneR Model

The foundation of every tree — one rule at a time

03

Section 3 · Binary Splits

What Is a Binary Split?

A binary split partitions all rows into exactly two groups based on a single rule:

e.g., Sex == "male" → Group Left | Group Right

The Titanic Sex split — a very strong rule

Group	Survival Rate	Count
Female	≈ 75%	~314
Male	≈ 20%	~577

With just one yes/no question, we can predict survival with reasonable accuracy. This is the entire basis of decision trees — we just keep asking more questions.

Ask one yes/no question

→

Split rows into two groups

→

Predict the group's mean

Section 3 · Binary Splits

Scoring a Split — The Variance Method

How do we know if a split is good? We measure how "pure" each resulting group is using its standard deviation:

score_side = σ(y_side) × |y_side|

split_score = (score_left + score_right) / |y|

Interpretation

σ (std dev) → how mixed the labels are in that group
× group size → weigh larger groups more heavily
Lower score = better split (tighter, more uniform groups)

Two splits compared

Split	MAE
Sex == female	0.215 ✓ better
log_fare > 2.7	0.333

Picking thresholds by eye is unreliable — scoring finds the best one automatically.

Section 3 · Binary Splits

The OneR Model

OneR = "One Rule" — find the single best binary split across all features and stop there.

Result on Titanic: "Predict survived = 1 if Sex == female" → MAE ≈ 0.215 (21.5% error)

Why start so simple?

A 1990s meta-analysis found OneR often matched or beat more complex models
It creates a baseline to beat — if your complex model doesn't clearly improve on this, something's wrong
It's transparent and explainable — you can show a business stakeholder exactly what the model does

Always build a simple baseline first. If your deep model barely beats a single yes/no rule, that's important information about your data.

Knowledge Checkpoint · Quiz 1 of 5

When scoring a binary split, a lower split score is better because it indicates…

A Larger group sizes on each side of the split
B A higher mean value of the target variable
C Each group has tightly clustered, more uniform target values
D The split uses fewer features

Section 4

Decision Trees

Recursively stacking binary splits until we have a powerful (but interpretable) model

04

Section 4 · Decision Trees

From One Split to a Full Tree

A decision tree is nothing more than recursively applied binary splits. After the first split (Sex), we split each group again:

Female branch

Best next split: Pclass == 1

Female & 1st class → ≈ 97% survive
Female & other class → ≈ 39% survive

Male branch

Best next split: Age < 6

Male & young child → ≈ 46% survive
Male & adult → ≈ 16% survive

After two levels of splitting we have 4 leaf nodes — a proper decision tree.

Leaf node: A terminal group where we stop splitting. The prediction for any row landing in that leaf is the group's mean (or majority class).

Section 4 · Decision Trees

Building a Tree with scikit-learn

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

tree = DecisionTreeClassifier(
    max_leaf_nodes=4,
    random_state=42
).fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(12, 5))
plot_tree(tree, feature_names=X_train.columns,
          class_names=['Died','Survived'], filled=True, ax=ax)
plt.show()

Key parameter: `max_leaf_nodes`

Controls tree depth / capacity — the main knob for bias-variance tradeoff
Start small (4–8 leaves) to understand the model, then increase
Alternative: min_samples_leaf=50 — each leaf must have ≥ 50 rows

Section 4 · Decision Trees

Interpreting the 4-Leaf Tree

Leaf	Rule	Survived / Total	Insight
L1	Female & 1st class	116 / 120	"Rich women lived"
L2	Female & not 1st class	73 / 186	Mixed outcomes
L3	Male & Age < 6	24 / 52	"Young boys had a chance"
L4	Male & Age ≥ 6	68 / 418	"Adult men mostly perished"

This four-leaf tree is fully human-readable. You can explain every prediction. This interpretability is a major advantage of tree-based models.

Section 4 · Decision Trees

Gini Impurity — scikit-learn's Split Criterion

scikit-learn uses Gini impurity to score splits (instead of our variance method). For a binary target:

Gini = 2 × p × (1 − p)

where p is the proportion of class-1 in the node.

Key values

Pure node (all same class): Gini = 0
50/50 mix: Gini = 0.5 (maximum)

How the tree grows

At each node, try every possible split for every feature. Choose whichever split minimises the weighted average Gini of the two resulting children.

Gini vs variance scoring: Both measure purity; Gini is preferred in scikit-learn for classification. For regression trees, mean squared error (MSE) is used instead.

Section 4 · Decision Trees

How Well Does a Single Tree Perform?

Model	MAE (validation)	Notes
OneR — Sex split	0.215	Our baseline
Decision tree — 4 leaves	0.224	Slightly worse on small data
Decision tree — `min_samples_leaf=50`	0.183	Best so far

Why can one rule sometimes beat a tree?

With only ~890 training rows, deeper splits create tiny, noisy leaf nodes
Small leaves → unstable predictions → higher variance
This is the core problem that Random Forests solve

Note: On larger real-world datasets, a deeper tree with min_samples_leaf=50 typically outperforms OneR convincingly.

Knowledge Checkpoint · Quiz 2 of 5

A decision tree node has Gini impurity = 0. What does this tell you?

A The node has equal numbers of each class — a 50/50 split
B All samples in that node belong to the same class — a pure leaf
C The node has not yet been split
D The split used the maximum number of features

Section 5

Random Forests

Bagging many trees to get the best of all worlds

05

Section 5 · Random Forests

The Problem with a Single Tree

Deeper splits → smaller leaf nodes → fewer rows per leaf
Tiny leaves give unstable, high-variance predictions
Accuracy plateaus; adding more depth eventually overfits

Leo Breiman's insight — Bagging

What if we built many trees and averaged them? Errors that are random will cancel out; only the signal remains.

Three conditions for bagging to work:

Each model should be reasonably good individually (low bias)
Errors between models should be uncorrelated (different data/features)
Average over many models — positive and negative errors cancel

Bagging = Bootstrap Aggregating

Section 5 · Random Forests

Building a Random Forest — Step by Step

A Random Forest is bagged decision trees with an extra twist: random feature subsets at each split.

Step 1 — Bootstrap: Sample k% of training rows with replacement → creates a unique sub-dataset for this tree

Step 2 — Random features: At each split, only consider a random subset of columns (typically √p features)

Step 3 — Grow tree: Build a shallow decision tree on this subset (don't prune aggressively)

Step 4 — Repeat: Build N trees (e.g., 100) — each sees different rows and different features

Final prediction: Average (regression) or majority vote (classification) across all N trees

Section 5 · Random Forests

Building a Random Forest in scikit-learn

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,       # number of trees
    min_samples_leaf=5,     # minimum rows per leaf
    max_features=0.5,       # random subset of features per split
    oob_score=True,         # built-in validation (explained next)
    n_jobs=-1,              # use all CPU cores
    random_state=42
).fit(X_train, y_train)

print(f"OOB accuracy: {rf.oob_score_:.3f}")
print(f"Validation MAE: {mean_absolute_error(y_val, rf.predict(X_val)):.3f}")

Results on Titanic

Model	MAE	Comment
Single tree (min_leaf=50)	0.183	Previous best
Random forest (100 trees)	≈ 0.185	Similar — Titanic is tiny

Don't be fooled by the Titanic result. On datasets with thousands of rows and hundreds of features, Random Forests typically win convincingly. Titanic (~890 rows) is too small to show the full benefit.

Section 5 · Random Forests

How Many Trees Do You Need?

More trees always help — error decreases monotonically as you add trees. It never gets worse.

Practical guidance

50–100 trees usually reach the plateau of diminishing returns
Beyond that, you're paying in inference time for tiny gains
Start with 100, check your OOB error curve, then decide

Can it overfit?

No — adding more trees cannot increase test error
Risk comes only from very deep trees with too few estimators — fix by raising n_estimators
RFs are remarkably forgiving of hyperparameter choices

RF resilience: Thousands of irrelevant features? Fine — trees ignore them. Non-linear interactions? Learned automatically. Outliers? Thresholds make them irrelevant. Normalisation? Not needed.

Knowledge Checkpoint · Quiz 3 of 5

Random Forests reduce the variance of single decision trees primarily through…

A Pruning branches that don't improve accuracy by more than 1%
B Averaging predictions from many trees trained on different random subsets of data and features
C Using Gini impurity instead of variance to score splits
D Training on the full dataset with regularisation applied to each tree

Section 6

Interpretability

Feature importance, OOB validation, partial dependence, and explaining individual predictions

06

Section 6 · Interpretability

Out-of-Bag (OOB) Error — Free Validation

Because each tree only sees a bootstrapped subset of data, approximately 25–37% of rows are left out for any given tree. These are the "out-of-bag" rows.

1. For each row, collect predictions from all trees that did not use it in training

2. Average those predictions → OOB prediction for that row

3. Compare OOB predictions to true labels → OOB error score

Why this matters: OOB error is a built-in, unbiased estimate of generalisation error — no separate validation set required. Enable with RandomForestClassifier(oob_score=True).

Section 6 · Interpretability

Feature Importance

For every split in every tree, track how much the Gini impurity improved × how many rows were in that node. Sum across all trees → feature importance score.

import matplotlib.pyplot as plt

importances = pd.Series(rf.feature_importances_, index=X_train.columns)
importances.sort_values().plot.barh(figsize=(8, 5), color='#C0272D')
plt.title('Feature Importances')
plt.tight_layout()
plt.show()

Titanic result

Sex ≫ Pclass ≈ Fare ≫ Age ≫ Embarked

Real-world power: A credit scoring team had 7,000 raw database columns. A Random Forest reduced this to ~30 key variables in under 2 hours — matching a multi-million-dollar consulting study.

Section 6 · Interpretability

Partial Dependence Plots (PDPs)

Feature importance tells you what matters. PDPs tell you how a feature affects the prediction, all else held equal.

How to compute a PDP for one feature

Take the validation set
Set the target column to a fixed value (e.g., Age = 5) for all rows
Run the forest and average all predictions → one point on the curve
Repeat for every candidate value across the feature's range → full curve

Key insight: Because we're averaging over all other feature values, the curve isolates the true effect of the one variable — unconfounded by correlated features like passenger class and fare.

scikit-learn shortcut: from sklearn.inspection import PartialDependenceDisplay — just call .from_estimator(rf, X_val, features=[...])

Section 6 · Interpretability

Explaining a Single Prediction

Feature importance explains the model overall. But sometimes you need to explain why this specific passenger was predicted to die.

Tree-path contribution method

For one row, trace its path through every tree in the forest
At each split, record: which feature was used, and how much it changed the prediction
Sum contributions across all trees → per-feature Δ prediction for that row

Result: A waterfall chart showing e.g. "Sex ↑ +0.42, Pclass ↓ −0.15, Age ↑ +0.07…" — intuitive enough to explain to a non-technical stakeholder why a loan was declined.

SHAP library provides a principled, model-agnostic implementation of this idea: pip install shap. SHAP values are the gold standard for single-prediction explanations.

Knowledge Checkpoint · Quiz 4 of 5

What is the key advantage of Out-of-Bag (OOB) error over using a separate validation set?

A OOB error is always lower (more optimistic) than hold-out validation error
B OOB error can only be computed on classification tasks, not regression
C It gives an unbiased generalisation estimate without removing any rows from training
D OOB error measures how well the forest memorised the training data

Section 7

Practical Tips & Bigger Picture

When to use RF, how it compares to boosting, and the Kaggle mindset

07

Section 7 · Practical

Bagging vs. Boosting

Two fundamentally different ensemble strategies — know when to use each.

Aspect	Bagging (Random Forest)	Boosting (XGBoost / LightGBM)
Core idea	Train trees on random subsets simultaneously, then average	Train tiny trees sequentially, each correcting the previous
Tuning effort	Minimal — hard to break	Many hyperparams, can overfit
Typical accuracy	Strong baseline	Often higher with careful tuning
When to use	First model — fast, robust, interpretable	Second pass, once baseline is established

Recommended workflow: Start with a Random Forest to get a strong baseline and understand your features. Then try gradient boosting if you need to squeeze out more accuracy.

Section 7 · Practical

Breiman's "Two Cultures" — A Final Thought

Leo Breiman (2001) argued there are two cultures in statistical modelling:

Data modelling culture

Assume a parametric form for the data (e.g., linear regression). Fit that form, explain coefficients. Prioritises interpretability of the assumed model.

Algorithmic culture

Let the algorithm find the structure. Prioritise predictive accuracy first — then explain the accurate model. Breiman championed this approach decades before "data science" existed.

Practical implication: A model that predicts poorly also explains poorly. Always get a good-fitting model first. A 60% accurate model that you can explain is worse than a 95% accurate model that you can also explain with SHAP.

Section 7 · Practical

The Kaggle Iteration Mindset

Jeremy Howard's approach to competition — applicable to every real project.

Reliable validation
Your local validation must mirror the hidden leaderboard. Without it, you're flying blind.

Rapid iteration
Models that train in seconds (RF on Titanic) let you try 50+ ideas. Slow models let you try 5.

Clean code
Avoid "Untitled (25).ipynb" chaos. Small, reproducible notebooks are essential.

Submission workflow

test_pred = rf.predict(X_test)
sub = pd.DataFrame({
    'PassengerId': test_df.PassengerId,
    'Survived': test_pred
})
sub.to_csv('submission_rf.csv', index=False)
# → Upload to Kaggle. Repeat. Improve.

Knowledge Checkpoint · Quiz 5 of 5

You need to explain to a bank manager why a specific customer's loan application was rejected by your Random Forest model. Which tool is most appropriate?

A Feature importance plot — it shows the most influential features across all predictions
B A partial dependence plot for income — it shows how income affects predictions on average
C SHAP values for that specific customer — they quantify each feature's contribution to this individual prediction
D OOB error — it measures whether the model is generalising well overall

Week 7 · Summary

What We Covered Today

Theory

Binary splits → OneR → Decision Trees → Random Forests
Variance scoring & Gini impurity for split quality
Bagging: why averaging uncorrelated trees works
Bagging vs. Boosting trade-offs

Implementation

Pandas category dtype and cat_codes
DecisionTreeClassifier and RandomForestClassifier
OOB score for built-in validation
Feature importance & SHAP for explanations

Next Steps

Practical workshop: build and tune a Random Forest on Titanic in Colab
Submit to Kaggle — aim to beat your single-tree MAE
Try adding oob_score=True and inspecting feature importances

Key takeaway: Random Forests are the best "first serious model" for any tabular dataset. They're forgiving, interpretable, and surprisingly hard to break.

Workshop goal: Achieve MAE < 0.18 using a tuned Random Forest with feature importance analysis.

Random Forests

Why Random Forests?

What Is a Random Forest?

Core idea

Why it works

Random Forests vs. Logistic Regression

Data Preparation

FastAI Setup & Titanic Preprocessing

Essential preprocessing steps

Pandas category dtype & cat_codes

What this gives you

When NOT to use category

Binary Splits & The OneR Model

What Is a Binary Split?

The Titanic Sex split — a very strong rule

Scoring a Split — The Variance Method

Interpretation

Two splits compared

The OneR Model

Why start so simple?

Decision Trees

From One Split to a Full Tree

Female branch

Male branch

Building a Tree with scikit-learn

Key parameter: max_leaf_nodes

Interpreting the 4-Leaf Tree

Gini Impurity — scikit-learn's Split Criterion

Key values

How the tree grows

How Well Does a Single Tree Perform?

Why can one rule sometimes beat a tree?

Random Forests

The Problem with a Single Tree

Leo Breiman's insight — Bagging

Building a Random Forest — Step by Step

Building a Random Forest in scikit-learn

Results on Titanic

How Many Trees Do You Need?

Practical guidance

Can it overfit?

Interpretability

Out-of-Bag (OOB) Error — Free Validation

Feature Importance

Titanic result

Partial Dependence Plots (PDPs)

How to compute a PDP for one feature

Explaining a Single Prediction

Tree-path contribution method

Practical Tips & Bigger Picture

Bagging vs. Boosting

Breiman's "Two Cultures" — A Final Thought

Data modelling culture

Algorithmic culture

The Kaggle Iteration Mindset

Submission workflow

What We Covered Today

Theory

Implementation

Next Steps

Pandas `category` dtype & `cat_codes`

Key parameter: `max_leaf_nodes`