CP3501 · Deep Learning · Week 6

From Data to Neural Networks

Building and understanding tabular deep learning models — from scratch to framework

Data
Preprocessing

Linear
Model

Gradient
Descent

Neural
Network

FastAI
Framework

SubjectCP3501 / CP5701

DatasetTitanic (Kaggle)

ToolsPyTorch · FastAI · Pandas

OutcomeSLO1 · SLO2 · SLO3

1 / 30

Lesson Overview

What We Are Building Today

By the end of this lesson, you will have implemented a complete binary classifier — twice: once manually, once with FastAI.

1. Preprocess
Titanic Data

2. Linear Model
in PyTorch

3. Gradient
Descent Loop

4. Add Neural
Network Layers

5. Redo in
FastAI

What you'll build

A model that predicts whether a Titanic passenger survived — achieving 82%+ accuracy

Key concept

Build from scratch first so you understand what frameworks do automatically

Why Titanic?

Real data, mixed types, missing values, categorical features — everything you'll face in practice

2 / 30

Part 1 of 4

Data Preprocessing

Before any model can learn, data must be clean, numerical, and scaled. This is where most real-world time is spent.

3 / 30

Part 1 · Preprocessing

The Titanic Dataset

Each row is one passenger. The goal is to predict the Survived column (0 = died, 1 = survived).

Column	Type	Example
Survived	Binary target	0 or 1
Pclass	Categorical (numeric)	1, 2, 3
Sex	Categorical (text)	male / female
Age	Continuous	24.0
Fare	Continuous, skewed	7.25 to 512
Embarked	Categorical (text)	C, Q, S
Cabin	Mostly missing	B96 B98

Four preprocessing challenges:

① Missing values in Age, Cabin, Embarked
② Skewed distribution in Fare
③ Text categories can't multiply
④ Columns at different scales

Load the data:

import pandas as pd
df = pd.read_csv('titanic.csv')
df.head()

4 / 30

Part 1 · Preprocessing

Step 1 — Handle Missing Values

Finding missing values

# Boolean mask: True where value is missing
df.isna().sum()

# Result:
# Age      177  ← some missing
# Cabin    687  ← mostly empty
# Embarked   2  ← nearly complete

Key insight: In pandas/numpy, True = 1 and False = 0, so .sum() counts the missing values per column.

Mode imputation — the simple fix

# Replace missing with most common value
mode_vals = df.mode().iloc[0]
df.fillna(mode_vals, inplace=True)

# What gets filled in:
# Age      → 24
# Cabin    → B96 B98
# Embarked → S

Philosophy: Start with the simplest method that always works. Mode imputation works for both numeric and categorical columns. Build complexity later.

5 / 30

Part 1 · Preprocessing

Step 2 — Fix Skewed Distributions

The Fare problem

Most passengers paid <£50, but a few paid up to £512. This long tail causes problems in linear models — the model spends too much effort fitting a handful of extreme values.

Log transformation — the fix

# +1 to handle fares of £0
df['LogFare'] = np.log(df['Fare'] + 1)

# Equivalent shorthand:
df['LogFare'] = np.log1p(df['Fare'])

Rule of thumb: Money, population, and any data that spans several orders of magnitude usually needs a log transform.

6 / 30

Part 1 · Preprocessing

Step 3 — Convert Categories to Numbers

A neural network multiplies values by coefficients. You cannot multiply "male" by a number — so we convert categories to binary (0/1) columns.

The problem

Sex	Embarked
male	S
female	C
male	Q

These cannot be used in a linear equation as-is.

One-hot encoding (dummy variables)

Sex_male	Sex_female	Emb_S	Emb_C	Emb_Q
1	0	1	0	0
0	1	0	1	0
1	0	0	0	1

# pandas does all of this in one line:
df_encoded = pd.get_dummies(df)

# Creates: Sex_male, Sex_female, Pclass_1, Pclass_2, Pclass_3, Embarked_C, Embarked_Q, Embarked_S ...

Strategy used here: Keep all levels (n encoding, no dropped column). Each row has exactly one True per category group. No constant term needed.

7 / 30

Knowledge Checkpoint Quiz

Check Your Understanding — Preprocessing

After one-hot encoding the Sex column (values: male, female), how many new columns are created using full n-encoding?

1 column — Sex_male only (drop one level to avoid multicollinearity)

2 columns — Sex_male and Sex_female (keep all levels)

3 columns — one for each category plus an "other" column

The Sex column is dropped entirely — it causes bias

Try again

You have a Fare column with values ranging from £0 to £512. Why should you apply a log transformation before training?

Log makes all values exactly equal, so training is fair

It converts Fare from continuous to categorical

It compresses the long tail so extreme values don't dominate the model

Log transformation removes missing values automatically

Try again

8 / 30

Part 2 of 4

Building a Linear Model from Scratch

We'll implement every step manually in PyTorch — tensors, normalization, predictions, loss, and gradient descent. No shortcuts yet.

9 / 30

Part 2 · Linear Model

Tensors — The Foundation

Why PyTorch instead of NumPy?

PyTorch tensors do everything NumPy arrays do, plus:

Automatic differentiation — calculates gradients for you
GPU acceleration — same code runs on GPU
One library instead of two

Creating tensors from data

import torch

# Target variable (what we're predicting)
y = torch.tensor(df['Survived'].values)

# Input features
X = torch.tensor(df[feat_cols].values,
                  dtype=torch.float)
# X.shape → (891, 12)

Tensor ranks explained

10 / 30

Part 2 · Linear Model

Step 4 — Normalise Features

Raw features live on very different scales. If Age ranges 0–80 and Sex_male is 0 or 1, the coefficients are not comparable.

The scaling problem

Max normalisation — divides by max value

# Max value per column (across rows)
max_vals = X.max(dim=0).values

# Divide each column by its max
X_norm = X / max_vals
# Now all features range from 0.0 to 1.0

Other options: Z-score (X - mean) / std or Min-Max scaling. For most tabular problems, the choice matters less than ensuring you do normalise.

11 / 30

Part 2 · Linear Model

Making Predictions — Matrix Multiplication

The linear model equation

# Random initial coefficients
torch.manual_seed(42)
coeffs = torch.rand(n_coef) - 0.5

# Make predictions
preds = (X_norm * coeffs).sum(dim=1)

# Equivalent — matrix multiply:
preds = X_norm @ coeffs

* is element-wise multiply
@ is matrix multiply
Both give the same result here — but @ is cleaner and scales to neural networks.

Broadcasting visualised

12 / 30

Part 2 · Linear Model

Gradient Descent — How the Model Learns

The intuition — rolling downhill

Think of loss as a hill. The gradient tells us which direction is uphill. We take a step in the opposite direction.

The four-step training loop

1

Enable gradients: tell PyTorch to track operations on coefficients

2

Forward pass: compute predictions and calculate loss (MAE)

3

Backward pass: PyTorch computes gradients automatically

4

Update: subtract a small step in the gradient direction

# The update equation
coeffs -= lr * coeffs.grad

# Why subtract?
# Gradient = direction of increasing loss
# We want LESS loss → go opposite way

13 / 30

Part 2 · Linear Model

The Complete Training Loop

def calc_preds(coeffs, X):
    return (X * coeffs).sum(dim=1)

def calc_loss(coeffs, X, y):
    preds = calc_preds(coeffs, X)
    return torch.abs(preds - y).mean()

def train(epochs=30, lr=0.1):
    coeffs = torch.rand(n_coef) - 0.5
    coeffs.requires_grad_()          # ① enable

    for epoch in range(epochs):
        loss = calc_loss(coeffs, X_tr, y_tr)  # ② forward
        loss.backward()                         # ③ gradients

        with torch.no_grad():
            coeffs -= lr * coeffs.grad             # ④ update
            coeffs.grad.zero_()                 # clear grads

    return coeffs

Why zero the gradients? PyTorch accumulates gradients by default. If you don't zero them, each step adds to the previous one — giving wrong updates.

Loss over training:

Result: Loss drops from 0.53 → ~0.30. Accuracy ~79%.

14 / 30

Part 2 · Linear Model

The Sigmoid Function — Essential for Binary Classification

The problem with raw predictions

Without sigmoid, our model outputs values like -0.3 or 1.7. But survival probability must be between 0 and 1.

The sigmoid formula

σ(x) = 1 / (1 + e^-x)

Sigmoid squishes everything to [0, 1]

# Apply sigmoid to predictions
def calc_preds(coeffs, X):
    return torch.sigmoid(X @ coeffs)

# Result: accuracy 0.79 → 0.823
# Learning rate: 0.1 → 2.0 (easier!)

Rule: For binary classification (0 or 1 target), always end your model with sigmoid. Missing it is the #1 beginner mistake.

15 / 30

Part 2 · Linear Model

Interpreting Coefficients — Do They Make Sense?

After training, inspect the learned coefficients. They should align with historical knowledge. If they don't — something is wrong.

# View coefficients with feature names
dict(zip(feature_cols, coefficients))

Feature	Coefficient	Interpretation
Age	Negative	Older → less likely to survive
Sex_male	Negative	Males had lower survival
Pclass_1	Positive	1st class → more likely to survive
Fare (log)	Positive	Higher fare → better odds

These results match history: "Women and children first" — the famous Titanic evacuation order. Negative coefficients for Age and Sex_male confirm our model has learned something real.

Accuracy metric vs. loss:

We train using MAE loss (differentiable, has gradients). We report accuracy (what Kaggle cares about). Both are needed — for different purposes.

# Calculate accuracy
preds_binary = (preds > 0.5).float()
acc = (preds_binary == y).float().mean()
# → 82.3% on validation set

16 / 30

Knowledge Checkpoint Quiz

Check Your Understanding — Linear Models

In gradient descent, after computing loss.backward(), the update step is: coeffs -= lr * coeffs.grad. Why do we subtract rather than add?

Because subtraction is computationally faster than addition in PyTorch

The gradient points toward increasing loss; we subtract to move in the direction of decreasing loss

It's a convention — addition would work equally well

Because negative coefficients represent survival and positive ones represent death

Try again

You build a binary classifier but forget to add sigmoid at the final layer. What symptom are you most likely to observe?

The model trains perfectly but is very slow

An error is thrown because PyTorch requires sigmoid

Predictions go outside [0,1] and the model trains poorly or fails to converge

Accuracy immediately reaches 100% because the model is unconstrained

Try again

17 / 30

Part 3 of 4

Building a Neural Network

The linear model draws a straight decision boundary. Real survival depended on complex interactions — young women in 3rd class had different odds than old men in 1st class. We need curves, not lines.

18 / 30

Part 3 · Neural Network

Why Go Beyond a Linear Model?

Linear model limitation

Neural network: curved boundaries

How do we get curves? By stacking linear layers with a non-linear activation function (ReLU) in between. ReLU is what gives neural networks their expressive power.

19 / 30

Part 3 · Neural Network

Neural Network Architecture

Our architecture

Input layer: 12 features from Titanic data
Hidden layer: 20 activations with ReLU
Output layer: 1 survival prediction with Sigmoid

Weight matrices:
W1: (12 × 20) = 240 params
W2: (20 × 1) + bias = 21 params
Total: 261 parameters

20 / 30

Part 3 · Neural Network

ReLU — The Non-linearity That Makes It Work

What is ReLU?

Rectified Linear Unit — the simplest useful activation function:

ReLU(x) = max(0, x)

Negative → 0 | Positive → unchanged

The full forward pass

def calc_preds(coeffs, X):
    l1, l2, const = coeffs

    # Layer 1: input → hidden
    h1 = (X @ l1).relu()

    # Layer 2: hidden → output
    out = (h1 @ l2) + const

    # Final activation (binary)
    return out.sigmoid()

Activation function rules:
— Hidden layers → ReLU
— Final layer for binary classification → Sigmoid
— Final layer for multi-class → Softmax
— Final layer for regression → None

21 / 30

Part 3 · Neural Network

Initialising Neural Network Weights

Unlike the linear model, neural networks need careful weight initialisation to avoid gradients that are too large or too small.

# Layer 1: input (n_coef) → hidden (20)
n_hidden = 20
l1 = (torch.randn(n_coef, n_hidden) - 0.5) / n_hidden

# Layer 2: hidden (20) → output (1)
l2 = torch.randn(n_hidden, 1) - 0.3

# Constant (bias) term for layer 2
const = torch.randn(1) - 0.3

# Enable gradients for all parameters
for p in [l1, l2, const]:
    p.requires_grad_()

Why divide by n_hidden?

When 20 values are summed in the matrix multiply, the result grows by a factor of ~20. Dividing by n_hidden keeps the scale similar to before, so gradients don't explode on the first step.

Why a constant term for layer 2?

Layer 1 doesn't need a constant because one-hot encoding already covers all levels (the dummy variables sum to 1). Layer 2 does need one — without it, the output is forced through the origin.

22 / 30

Knowledge Checkpoint Quiz

Check Your Understanding — Neural Networks

In our 2-layer neural network (Input → Hidden → Output), what activation function should be used after the hidden layer, and what should be used at the output layer for binary classification?

Hidden: Sigmoid | Output: ReLU

Hidden: ReLU | Output: Sigmoid

Hidden: ReLU | Output: Softmax

Hidden: None | Output: ReLU

Try again

If you add a second hidden layer with 10 units after the first hidden layer of 20 units, what should be the shape of the new weight matrix connecting them?

(12 × 10) — same as the input layer shape

(10 × 20) — reverse of the first layer

(20 × 10) — takes 20 inputs from hidden layer 1, outputs 10 activations

(20 × 1) — always ends with a single output

Try again

23 / 30

Part 4 of 4

FastAI — The Same Model in Far Less Code

Now that you understand every step, let FastAI automate the boilerplate. The concepts are identical — the framework just handles preprocessing, initialisation, and training loops for you.

24 / 30

Part 4 · FastAI

Manual Implementation vs. FastAI — Side by Side

Manual PyTorch

# Missing values
df.fillna(df.mode().iloc[0])

# Dummy variables
df = pd.get_dummies(df)

# Normalise
X = X / X.max(dim=0).values

# Init weights manually
l1 = torch.randn(n_coef, 20) / 20
l2 = torch.randn(20, 1)

# Training loop — ~20 lines

FastAI equivalent

# All preprocessing in one call:
dls = TabularDataLoaders.from_df(
    df,
    cat_names=['Pclass', 'Sex'],  # ← dummies
    cont_names=['Age', 'LogFare'], # ← normalise
    y_names='Survived',
    valid_idx=valid_idx
)

# Model + training:
learn = tabular_learner(dls,
          layers=[10, 10],
          metrics=accuracy)

FastAI handles automatically: dummy variables, missing values, normalisation, weight initialisation, gradient zeroing, learning rate scheduling, and the training loop.

25 / 30

Part 4 · FastAI

Finding the Right Learning Rate New

Choosing a learning rate manually is guesswork. FastAI includes a learning rate finder that tests many values systematically.

# Run the learning rate finder
learn.lr_find()

How it works

1

Starts with a very small learning rate (1e-7)

2

Gradually increases it across mini-batches

3

Tracks the loss at each step

4

Plots loss vs. learning rate — you pick from the curve

Reading the lr_find plot

Rule of thumb: Pick the learning rate at the steepest downward slope — just before the loss starts climbing again. Typically around the "valley" of the curve.

26 / 30

Part 4 · FastAI

Ensemble Methods — Easy Accuracy Boost New

The idea

Train multiple identical models independently. Each starts with different random weights → learns slightly differently → makes different errors. Combining them cancels individual errors.

def train_one():
    learn = tabular_learner(dls, layers=[10,10])
    learn.fit(10, 0.03)
    return learn.get_preds(dl=test_dl)

# Train 5 independent models
all_preds = [train_one() for _ in range(5)]

# Average their probability predictions
final = torch.stack(all_preds).mean(0)

Results on Kaggle

Approach	Kaggle Rank
Single linear model	~50th percentile
Single neural network	~50th percentile
Ensemble of 5 NNs	~25th percentile

Key insight: Small code changes can yield substantial performance improvements. Ensembling is one of the highest return-on-effort techniques in practice.

How to aggregate predictions?

Average the probabilities (not the binary 0/1 values), then apply a threshold of 0.5. This retains more information than voting.

27 / 30

Knowledge Checkpoint Quiz

Check Your Understanding — FastAI & Ensembles

In the FastAI TabularDataLoaders.from_df() call, what is the difference between cat_names and cont_names?

cat_names are columns you care about; cont_names are continuous columns you'll drop

cat_names are categorical columns (FastAI applies embeddings/dummy vars); cont_names are numeric columns (FastAI normalises these)

Both do the same thing — they're just different parameter names for historical reasons

cat_names is for training data; cont_names is for test data

Try again

When ensembling 5 models for binary classification, which aggregation strategy is generally best?

Convert each model's output to 0/1 first, then take the mode (majority vote)

Average the raw probabilities from all 5 models, then apply a 0.5 threshold

Always use the prediction from whichever single model had the best validation loss

Sum the probabilities and predict 1 if the sum exceeds 2.5 (half of 5)

Try again

28 / 30

Summary

Key Takeaways

Concepts to remember

Mode imputation — simplest way to handle missing values; works for all column types
Log transform — compresses long-tailed distributions like prices or populations
One-hot encoding — converts text categories to 0/1 numbers a model can use
Normalisation — puts all features on a comparable scale (0 to 1)
Sigmoid — always use for binary classification output; compresses to [0, 1]
ReLU — non-linearity in hidden layers; enables curved decision boundaries
Ensembles — average multiple models to cancel individual errors

The progression we followed

① Raw data → Preprocessed (clean, numeric, scaled)

② Coefficients → Loss → Gradients → Update

③ Linear model + Sigmoid → 82.3% accuracy

④ Add hidden layer + ReLU → Neural Network

⑤ FastAI automates all of the above

⑥ Ensemble × 5 → Top 25% on Kaggle

29 / 30

Looking Ahead

Next Steps

This week's workshop

Implement the full preprocessing pipeline from scratch
Build and train the linear model with sigmoid
Add a hidden layer and verify accuracy improves
Reproduce using FastAI's tabular_learner
Submit to Kaggle and record your leaderboard position

Before the workshop: Make sure you have a Kaggle account and have accepted the Titanic competition rules — you'll need this to submit.

Coming in Week 7

Convolutional Neural Networks (CNNs) for image data
Transfer learning — using pre-trained models
How FastAI manages fine-tuning automatically
Data augmentation to improve generalisation

Self-study: Read Chapter 9 of the FastAI book (Tabular Modelling). Pay attention to entity embeddings — a more powerful alternative to one-hot encoding for high-cardinality categories.

Exam relevance: Gradient descent, sigmoid, loss vs. accuracy distinction, and activation functions are all mid-term and final exam topics (SLO1, SLO2).

30 / 30