CP3501 · Deep Learning · Week 6

From Data to Neural Networks

Building and understanding tabular deep learning models — from scratch to framework

Data
Preprocessing
Linear
Model
Gradient
Descent
Neural
Network
FastAI
Framework
SubjectCP3501 / CP5701
DatasetTitanic (Kaggle)
ToolsPyTorch · FastAI · Pandas
OutcomeSLO1 · SLO2 · SLO3
1 / 30
Lesson Overview

What We Are Building Today

By the end of this lesson, you will have implemented a complete binary classifier — twice: once manually, once with FastAI.

1. Preprocess
Titanic Data
2. Linear Model
in PyTorch
3. Gradient
Descent Loop
4. Add Neural
Network Layers
5. Redo in
FastAI

What you'll build

A model that predicts whether a Titanic passenger survived — achieving 82%+ accuracy

Key concept

Build from scratch first so you understand what frameworks do automatically

Why Titanic?

Real data, mixed types, missing values, categorical features — everything you'll face in practice

2 / 30
Part 1 of 4

Data Preprocessing

Before any model can learn, data must be clean, numerical, and scaled. This is where most real-world time is spent.

3 / 30
Part 1 · Preprocessing

The Titanic Dataset

Each row is one passenger. The goal is to predict the Survived column (0 = died, 1 = survived).

ColumnTypeExample
SurvivedBinary target0 or 1
PclassCategorical (numeric)1, 2, 3
SexCategorical (text)male / female
AgeContinuous24.0
FareContinuous, skewed7.25 to 512
EmbarkedCategorical (text)C, Q, S
CabinMostly missingB96 B98

Four preprocessing challenges:

① Missing values in Age, Cabin, Embarked
② Skewed distribution in Fare
③ Text categories can't multiply
④ Columns at different scales

Load the data:

import pandas as pd
df = pd.read_csv('titanic.csv')
df.head()
4 / 30
Part 1 · Preprocessing

Step 1 — Handle Missing Values

Finding missing values

# Boolean mask: True where value is missing
df.isna().sum()

# Result:
# Age      177  ← some missing
# Cabin    687  ← mostly empty
# Embarked   2  ← nearly complete

Key insight: In pandas/numpy, True = 1 and False = 0, so .sum() counts the missing values per column.

Mode imputation — the simple fix

# Replace missing with most common value
mode_vals = df.mode().iloc[0]
df.fillna(mode_vals, inplace=True)

# What gets filled in:
# Age      → 24
# Cabin    → B96 B98
# Embarked → S

Philosophy: Start with the simplest method that always works. Mode imputation works for both numeric and categorical columns. Build complexity later.

5 / 30
Part 1 · Preprocessing

Step 2 — Fix Skewed Distributions

The Fare problem

Most passengers paid <£50, but a few paid up to £512. This long tail causes problems in linear models — the model spends too much effort fitting a handful of extreme values.

Raw Fare distribution Most passengers (£5–50) Some (£50–150) Few (£150–300) Rare (£300+)

Log transformation — the fix

# +1 to handle fares of £0
df['LogFare'] = np.log(df['Fare'] + 1)

# Equivalent shorthand:
df['LogFare'] = np.log1p(df['Fare'])
After log transform More balanced

Rule of thumb: Money, population, and any data that spans several orders of magnitude usually needs a log transform.

6 / 30
Part 1 · Preprocessing

Step 3 — Convert Categories to Numbers

A neural network multiplies values by coefficients. You cannot multiply "male" by a number — so we convert categories to binary (0/1) columns.

The problem

SexEmbarked
maleS
femaleC
maleQ

These cannot be used in a linear equation as-is.

One-hot encoding (dummy variables)

Sex_maleSex_femaleEmb_SEmb_CEmb_Q
10100
01010
10001
# pandas does all of this in one line:
df_encoded = pd.get_dummies(df)

# Creates: Sex_male, Sex_female, Pclass_1, Pclass_2, Pclass_3, Embarked_C, Embarked_Q, Embarked_S ...

Strategy used here: Keep all levels (n encoding, no dropped column). Each row has exactly one True per category group. No constant term needed.

7 / 30
Knowledge Checkpoint Quiz

Check Your Understanding — Preprocessing

After one-hot encoding the Sex column (values: male, female), how many new columns are created using full n-encoding?

1 column — Sex_male only (drop one level to avoid multicollinearity)
2 columns — Sex_male and Sex_female (keep all levels)
3 columns — one for each category plus an "other" column
The Sex column is dropped entirely — it causes bias
Try again

You have a Fare column with values ranging from £0 to £512. Why should you apply a log transformation before training?

Log makes all values exactly equal, so training is fair
It converts Fare from continuous to categorical
It compresses the long tail so extreme values don't dominate the model
Log transformation removes missing values automatically
Try again
8 / 30
Part 2 of 4

Building a Linear Model from Scratch

We'll implement every step manually in PyTorch — tensors, normalization, predictions, loss, and gradient descent. No shortcuts yet.

9 / 30
Part 2 · Linear Model

Tensors — The Foundation

Why PyTorch instead of NumPy?

PyTorch tensors do everything NumPy arrays do, plus:

  • Automatic differentiation — calculates gradients for you
  • GPU acceleration — same code runs on GPU
  • One library instead of two

Creating tensors from data

import torch

# Target variable (what we're predicting)
y = torch.tensor(df['Survived'].values)

# Input features
X = torch.tensor(df[feat_cols].values,
                  dtype=torch.float)
# X.shape → (891, 12)

Tensor ranks explained

Rank 0 — Scalar 42 shape: () Rank 1 — Vector 0.3 0.7 0.1 shape: (3,) Rank 2 — Matrix shape: (2, 3) X = 891 rows × 12 cols
10 / 30
Part 2 · Linear Model

Step 4 — Normalise Features

Raw features live on very different scales. If Age ranges 0–80 and Sex_male is 0 or 1, the coefficients are not comparable.

The scaling problem

Before normalisation Age: 0–80 Sex: 0–1 Fare: 0–512

Max normalisation — divides by max value

# Max value per column (across rows)
max_vals = X.max(dim=0).values

# Divide each column by its max
X_norm = X / max_vals
# Now all features range from 0.0 to 1.0

Other options: Z-score (X - mean) / std or Min-Max scaling. For most tabular problems, the choice matters less than ensuring you do normalise.

11 / 30
Part 2 · Linear Model

Making Predictions — Matrix Multiplication

The linear model equation

# Random initial coefficients
torch.manual_seed(42)
coeffs = torch.rand(n_coef) - 0.5

# Make predictions
preds = (X_norm * coeffs).sum(dim=1)

# Equivalent — matrix multiply:
preds = X_norm @ coeffs

* is element-wise multiply
@ is matrix multiply
Both give the same result here — but @ is cleaner and scales to neural networks.

Broadcasting visualised

X (3 rows × 3 cols) × coeffs (3,) c₁ c₂ c₃ same coeffs applied to every row Result: 3 predictions (one per row) 0.62 0.21 0.88
12 / 30
Part 2 · Linear Model

Gradient Descent — How the Model Learns

The intuition — rolling downhill

Loss (how wrong we are) start minimum gradient step Coefficient value (what we adjust)

Think of loss as a hill. The gradient tells us which direction is uphill. We take a step in the opposite direction.

The four-step training loop

1

Enable gradients: tell PyTorch to track operations on coefficients

2

Forward pass: compute predictions and calculate loss (MAE)

3

Backward pass: PyTorch computes gradients automatically

4

Update: subtract a small step in the gradient direction

# The update equation
coeffs -= lr * coeffs.grad

# Why subtract?
# Gradient = direction of increasing loss
# We want LESS loss → go opposite way
13 / 30
Part 2 · Linear Model

The Complete Training Loop

def calc_preds(coeffs, X):
    return (X * coeffs).sum(dim=1)

def calc_loss(coeffs, X, y):
    preds = calc_preds(coeffs, X)
    return torch.abs(preds - y).mean()

def train(epochs=30, lr=0.1):
    coeffs = torch.rand(n_coef) - 0.5
    coeffs.requires_grad_()          # ① enable

    for epoch in range(epochs):
        loss = calc_loss(coeffs, X_tr, y_tr)  # ② forward
        loss.backward()                         # ③ gradients

        with torch.no_grad():
            coeffs -= lr * coeffs.grad             # ④ update
            coeffs.grad.zero_()                 # clear grads

    return coeffs

Why zero the gradients? PyTorch accumulates gradients by default. If you don't zero them, each step adds to the previous one — giving wrong updates.

Loss over training:

Loss 0 Epochs 0.53 ~0.30

Result: Loss drops from 0.53 → ~0.30. Accuracy ~79%.

14 / 30
Part 2 · Linear Model

The Sigmoid Function — Essential for Binary Classification

The problem with raw predictions

Without sigmoid, our model outputs values like -0.3 or 1.7. But survival probability must be between 0 and 1.

Raw predictions — problematic 0 1 -0.3 1.7 impossible values!

The sigmoid formula

σ(x) = 1 / (1 + e-x)

Sigmoid squishes everything to [0, 1]

1.0 0.5 0.0 0.5 x → very negative → output ≈ 0 x → very positive → output ≈ 1
# Apply sigmoid to predictions
def calc_preds(coeffs, X):
    return torch.sigmoid(X @ coeffs)

# Result: accuracy 0.79 → 0.823
# Learning rate: 0.1 → 2.0 (easier!)

Rule: For binary classification (0 or 1 target), always end your model with sigmoid. Missing it is the #1 beginner mistake.

15 / 30
Part 2 · Linear Model

Interpreting Coefficients — Do They Make Sense?

After training, inspect the learned coefficients. They should align with historical knowledge. If they don't — something is wrong.

# View coefficients with feature names
dict(zip(feature_cols, coefficients))
FeatureCoefficientInterpretation
AgeNegativeOlder → less likely to survive
Sex_maleNegativeMales had lower survival
Pclass_1Positive1st class → more likely to survive
Fare (log)PositiveHigher fare → better odds

These results match history: "Women and children first" — the famous Titanic evacuation order. Negative coefficients for Age and Sex_male confirm our model has learned something real.

Accuracy metric vs. loss:

We train using MAE loss (differentiable, has gradients). We report accuracy (what Kaggle cares about). Both are needed — for different purposes.

# Calculate accuracy
preds_binary = (preds > 0.5).float()
acc = (preds_binary == y).float().mean()
# → 82.3% on validation set
16 / 30
Knowledge Checkpoint Quiz

Check Your Understanding — Linear Models

In gradient descent, after computing loss.backward(), the update step is: coeffs -= lr * coeffs.grad. Why do we subtract rather than add?

Because subtraction is computationally faster than addition in PyTorch
The gradient points toward increasing loss; we subtract to move in the direction of decreasing loss
It's a convention — addition would work equally well
Because negative coefficients represent survival and positive ones represent death
Try again

You build a binary classifier but forget to add sigmoid at the final layer. What symptom are you most likely to observe?

The model trains perfectly but is very slow
An error is thrown because PyTorch requires sigmoid
Predictions go outside [0,1] and the model trains poorly or fails to converge
Accuracy immediately reaches 100% because the model is unconstrained
Try again
17 / 30
Part 3 of 4

Building a Neural Network

The linear model draws a straight decision boundary. Real survival depended on complex interactions — young women in 3rd class had different odds than old men in 1st class. We need curves, not lines.

18 / 30
Part 3 · Neural Network

Why Go Beyond a Linear Model?

Linear model limitation

Linear: one straight boundary misclassified

Neural network: curved boundaries

Neural net: flexible boundary

How do we get curves? By stacking linear layers with a non-linear activation function (ReLU) in between. ReLU is what gives neural networks their expressive power.

19 / 30
Part 3 · Neural Network

Neural Network Architecture

Our architecture

  • Input layer: 12 features from Titanic data
  • Hidden layer: 20 activations with ReLU
  • Output layer: 1 survival prediction with Sigmoid

Weight matrices:
W1: (12 × 20) = 240 params
W2: (20 × 1) + bias = 21 params
Total: 261 parameters

Input (12) 12 features W1 (12×20) Hidden (20) 20 activations ReLU W2 (20×1) Output (1) Sigmoid → 0.82 "likely survived"
20 / 30
Part 3 · Neural Network

ReLU — The Non-linearity That Makes It Work

What is ReLU?

Rectified Linear Unit — the simplest useful activation function:

ReLU(x) = max(0, x)

Negative → 0   |   Positive → unchanged

negative → 0 positive → x x

The full forward pass

def calc_preds(coeffs, X):
    l1, l2, const = coeffs

    # Layer 1: input → hidden
    h1 = (X @ l1).relu()

    # Layer 2: hidden → output
    out = (h1 @ l2) + const

    # Final activation (binary)
    return out.sigmoid()

Activation function rules:
— Hidden layers → ReLU
— Final layer for binary classification → Sigmoid
— Final layer for multi-class → Softmax
— Final layer for regression → None

21 / 30
Part 3 · Neural Network

Initialising Neural Network Weights

Unlike the linear model, neural networks need careful weight initialisation to avoid gradients that are too large or too small.

# Layer 1: input (n_coef) → hidden (20)
n_hidden = 20
l1 = (torch.randn(n_coef, n_hidden) - 0.5) / n_hidden

# Layer 2: hidden (20) → output (1)
l2 = torch.randn(n_hidden, 1) - 0.3

# Constant (bias) term for layer 2
const = torch.randn(1) - 0.3

# Enable gradients for all parameters
for p in [l1, l2, const]:
    p.requires_grad_()

Why divide by n_hidden?

When 20 values are summed in the matrix multiply, the result grows by a factor of ~20. Dividing by n_hidden keeps the scale similar to before, so gradients don't explode on the first step.

Why a constant term for layer 2?

Layer 1 doesn't need a constant because one-hot encoding already covers all levels (the dummy variables sum to 1). Layer 2 does need one — without it, the output is forced through the origin.

22 / 30
Knowledge Checkpoint Quiz

Check Your Understanding — Neural Networks

In our 2-layer neural network (Input → Hidden → Output), what activation function should be used after the hidden layer, and what should be used at the output layer for binary classification?

Hidden: Sigmoid   |   Output: ReLU
Hidden: ReLU   |   Output: Sigmoid
Hidden: ReLU   |   Output: Softmax
Hidden: None   |   Output: ReLU
Try again

If you add a second hidden layer with 10 units after the first hidden layer of 20 units, what should be the shape of the new weight matrix connecting them?

(12 × 10) — same as the input layer shape
(10 × 20) — reverse of the first layer
(20 × 10) — takes 20 inputs from hidden layer 1, outputs 10 activations
(20 × 1) — always ends with a single output
Try again
23 / 30
Part 4 of 4

FastAI — The Same Model in Far Less Code

Now that you understand every step, let FastAI automate the boilerplate. The concepts are identical — the framework just handles preprocessing, initialisation, and training loops for you.

24 / 30
Part 4 · FastAI

Manual Implementation vs. FastAI — Side by Side

Manual PyTorch

# Missing values
df.fillna(df.mode().iloc[0])

# Dummy variables
df = pd.get_dummies(df)

# Normalise
X = X / X.max(dim=0).values

# Init weights manually
l1 = torch.randn(n_coef, 20) / 20
l2 = torch.randn(20, 1)

# Training loop — ~20 lines

FastAI equivalent

# All preprocessing in one call:
dls = TabularDataLoaders.from_df(
    df,
    cat_names=['Pclass', 'Sex'],  # ← dummies
    cont_names=['Age', 'LogFare'], # ← normalise
    y_names='Survived',
    valid_idx=valid_idx
)

# Model + training:
learn = tabular_learner(dls,
          layers=[10, 10],
          metrics=accuracy)

FastAI handles automatically: dummy variables, missing values, normalisation, weight initialisation, gradient zeroing, learning rate scheduling, and the training loop.

25 / 30
Part 4 · FastAI

Finding the Right Learning Rate New

Choosing a learning rate manually is guesswork. FastAI includes a learning rate finder that tests many values systematically.

# Run the learning rate finder
learn.lr_find()

How it works

1

Starts with a very small learning rate (1e-7)

2

Gradually increases it across mini-batches

3

Tracks the loss at each step

4

Plots loss vs. learning rate — you pick from the curve

Reading the lr_find plot

Loss vs. Learning Rate Loss lr → pick here (0.03) too small too large sweet spot

Rule of thumb: Pick the learning rate at the steepest downward slope — just before the loss starts climbing again. Typically around the "valley" of the curve.

26 / 30
Part 4 · FastAI

Ensemble Methods — Easy Accuracy Boost New

The idea

Train multiple identical models independently. Each starts with different random weights → learns slightly differently → makes different errors. Combining them cancels individual errors.

def train_one():
    learn = tabular_learner(dls, layers=[10,10])
    learn.fit(10, 0.03)
    return learn.get_preds(dl=test_dl)

# Train 5 independent models
all_preds = [train_one() for _ in range(5)]

# Average their probability predictions
final = torch.stack(all_preds).mean(0)

Results on Kaggle

ApproachKaggle Rank
Single linear model~50th percentile
Single neural network~50th percentile
Ensemble of 5 NNs~25th percentile

Key insight: Small code changes can yield substantial performance improvements. Ensembling is one of the highest return-on-effort techniques in practice.

How to aggregate predictions?

Average the probabilities (not the binary 0/1 values), then apply a threshold of 0.5. This retains more information than voting.

27 / 30
Knowledge Checkpoint Quiz

Check Your Understanding — FastAI & Ensembles

In the FastAI TabularDataLoaders.from_df() call, what is the difference between cat_names and cont_names?

cat_names are columns you care about; cont_names are continuous columns you'll drop
cat_names are categorical columns (FastAI applies embeddings/dummy vars); cont_names are numeric columns (FastAI normalises these)
Both do the same thing — they're just different parameter names for historical reasons
cat_names is for training data; cont_names is for test data
Try again

When ensembling 5 models for binary classification, which aggregation strategy is generally best?

Convert each model's output to 0/1 first, then take the mode (majority vote)
Average the raw probabilities from all 5 models, then apply a 0.5 threshold
Always use the prediction from whichever single model had the best validation loss
Sum the probabilities and predict 1 if the sum exceeds 2.5 (half of 5)
Try again
28 / 30
Summary

Key Takeaways

Concepts to remember

  • Mode imputation — simplest way to handle missing values; works for all column types
  • Log transform — compresses long-tailed distributions like prices or populations
  • One-hot encoding — converts text categories to 0/1 numbers a model can use
  • Normalisation — puts all features on a comparable scale (0 to 1)
  • Sigmoid — always use for binary classification output; compresses to [0, 1]
  • ReLU — non-linearity in hidden layers; enables curved decision boundaries
  • Ensembles — average multiple models to cancel individual errors

The progression we followed

① Raw data → Preprocessed (clean, numeric, scaled)
② Coefficients → Loss → Gradients → Update
③ Linear model + Sigmoid → 82.3% accuracy
④ Add hidden layer + ReLU → Neural Network
⑤ FastAI automates all of the above
⑥ Ensemble × 5 → Top 25% on Kaggle
29 / 30
Looking Ahead

Next Steps

This week's workshop

  • Implement the full preprocessing pipeline from scratch
  • Build and train the linear model with sigmoid
  • Add a hidden layer and verify accuracy improves
  • Reproduce using FastAI's tabular_learner
  • Submit to Kaggle and record your leaderboard position

Before the workshop: Make sure you have a Kaggle account and have accepted the Titanic competition rules — you'll need this to submit.

Coming in Week 7

  • Convolutional Neural Networks (CNNs) for image data
  • Transfer learning — using pre-trained models
  • How FastAI manages fine-tuning automatically
  • Data augmentation to improve generalisation

Self-study: Read Chapter 9 of the FastAI book (Tabular Modelling). Pay attention to entity embeddings — a more powerful alternative to one-hot encoding for high-cardinality categories.

Exam relevance: Gradient descent, sigmoid, loss vs. accuracy distinction, and activation functions are all mid-term and final exam topics (SLO1, SLO2).

30 / 30