Building and understanding tabular deep learning models — from scratch to framework
By the end of this lesson, you will have implemented a complete binary classifier — twice: once manually, once with FastAI.
A model that predicts whether a Titanic passenger survived — achieving 82%+ accuracy
Build from scratch first so you understand what frameworks do automatically
Real data, mixed types, missing values, categorical features — everything you'll face in practice
Before any model can learn, data must be clean, numerical, and scaled. This is where most real-world time is spent.
Each row is one passenger. The goal is to predict the Survived column (0 = died, 1 = survived).
| Column | Type | Example |
|---|---|---|
| Survived | Binary target | 0 or 1 |
| Pclass | Categorical (numeric) | 1, 2, 3 |
| Sex | Categorical (text) | male / female |
| Age | Continuous | 24.0 |
| Fare | Continuous, skewed | 7.25 to 512 |
| Embarked | Categorical (text) | C, Q, S |
| Cabin | Mostly missing | B96 B98 |
Four preprocessing challenges:
① Missing values in Age, Cabin, Embarked
② Skewed distribution in Fare
③ Text categories can't multiply
④ Columns at different scales
Load the data:
import pandas as pd df = pd.read_csv('titanic.csv') df.head()
# Boolean mask: True where value is missing df.isna().sum() # Result: # Age 177 ← some missing # Cabin 687 ← mostly empty # Embarked 2 ← nearly complete
Key insight: In pandas/numpy, True = 1 and False = 0, so .sum() counts the missing values per column.
# Replace missing with most common value mode_vals = df.mode().iloc[0] df.fillna(mode_vals, inplace=True) # What gets filled in: # Age → 24 # Cabin → B96 B98 # Embarked → S
Philosophy: Start with the simplest method that always works. Mode imputation works for both numeric and categorical columns. Build complexity later.
Most passengers paid <£50, but a few paid up to £512. This long tail causes problems in linear models — the model spends too much effort fitting a handful of extreme values.
# +1 to handle fares of £0 df['LogFare'] = np.log(df['Fare'] + 1) # Equivalent shorthand: df['LogFare'] = np.log1p(df['Fare'])
Rule of thumb: Money, population, and any data that spans several orders of magnitude usually needs a log transform.
A neural network multiplies values by coefficients. You cannot multiply "male" by a number — so we convert categories to binary (0/1) columns.
| Sex | Embarked |
|---|---|
| male | S |
| female | C |
| male | Q |
These cannot be used in a linear equation as-is.
| Sex_male | Sex_female | Emb_S | Emb_C | Emb_Q |
|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 0 |
| 0 | 1 | 0 | 1 | 0 |
| 1 | 0 | 0 | 0 | 1 |
# pandas does all of this in one line: df_encoded = pd.get_dummies(df) # Creates: Sex_male, Sex_female, Pclass_1, Pclass_2, Pclass_3, Embarked_C, Embarked_Q, Embarked_S ...
Strategy used here: Keep all levels (n encoding, no dropped column). Each row has exactly one True per category group. No constant term needed.
After one-hot encoding the Sex column (values: male, female), how many new columns are created using full n-encoding?
You have a Fare column with values ranging from £0 to £512. Why should you apply a log transformation before training?
We'll implement every step manually in PyTorch — tensors, normalization, predictions, loss, and gradient descent. No shortcuts yet.
PyTorch tensors do everything NumPy arrays do, plus:
import torch # Target variable (what we're predicting) y = torch.tensor(df['Survived'].values) # Input features X = torch.tensor(df[feat_cols].values, dtype=torch.float) # X.shape → (891, 12)
Raw features live on very different scales. If Age ranges 0–80 and Sex_male is 0 or 1, the coefficients are not comparable.
# Max value per column (across rows) max_vals = X.max(dim=0).values # Divide each column by its max X_norm = X / max_vals # Now all features range from 0.0 to 1.0
Other options: Z-score (X - mean) / std or Min-Max scaling. For most tabular problems, the choice matters less than ensuring you do normalise.
# Random initial coefficients torch.manual_seed(42) coeffs = torch.rand(n_coef) - 0.5 # Make predictions preds = (X_norm * coeffs).sum(dim=1) # Equivalent — matrix multiply: preds = X_norm @ coeffs
* is element-wise multiply@ is matrix multiply
Both give the same result here — but @ is cleaner and scales to neural networks.
Think of loss as a hill. The gradient tells us which direction is uphill. We take a step in the opposite direction.
Enable gradients: tell PyTorch to track operations on coefficients
Forward pass: compute predictions and calculate loss (MAE)
Backward pass: PyTorch computes gradients automatically
Update: subtract a small step in the gradient direction
# The update equation coeffs -= lr * coeffs.grad # Why subtract? # Gradient = direction of increasing loss # We want LESS loss → go opposite way
def calc_preds(coeffs, X): return (X * coeffs).sum(dim=1) def calc_loss(coeffs, X, y): preds = calc_preds(coeffs, X) return torch.abs(preds - y).mean() def train(epochs=30, lr=0.1): coeffs = torch.rand(n_coef) - 0.5 coeffs.requires_grad_() # ① enable for epoch in range(epochs): loss = calc_loss(coeffs, X_tr, y_tr) # ② forward loss.backward() # ③ gradients with torch.no_grad(): coeffs -= lr * coeffs.grad # ④ update coeffs.grad.zero_() # clear grads return coeffs
Why zero the gradients? PyTorch accumulates gradients by default. If you don't zero them, each step adds to the previous one — giving wrong updates.
Loss over training:
Result: Loss drops from 0.53 → ~0.30. Accuracy ~79%.
Without sigmoid, our model outputs values like -0.3 or 1.7. But survival probability must be between 0 and 1.
σ(x) = 1 / (1 + e-x)
# Apply sigmoid to predictions def calc_preds(coeffs, X): return torch.sigmoid(X @ coeffs) # Result: accuracy 0.79 → 0.823 # Learning rate: 0.1 → 2.0 (easier!)
Rule: For binary classification (0 or 1 target), always end your model with sigmoid. Missing it is the #1 beginner mistake.
After training, inspect the learned coefficients. They should align with historical knowledge. If they don't — something is wrong.
# View coefficients with feature names dict(zip(feature_cols, coefficients))
| Feature | Coefficient | Interpretation |
|---|---|---|
| Age | Negative | Older → less likely to survive |
| Sex_male | Negative | Males had lower survival |
| Pclass_1 | Positive | 1st class → more likely to survive |
| Fare (log) | Positive | Higher fare → better odds |
These results match history: "Women and children first" — the famous Titanic evacuation order. Negative coefficients for Age and Sex_male confirm our model has learned something real.
Accuracy metric vs. loss:
We train using MAE loss (differentiable, has gradients). We report accuracy (what Kaggle cares about). Both are needed — for different purposes.
# Calculate accuracy preds_binary = (preds > 0.5).float() acc = (preds_binary == y).float().mean() # → 82.3% on validation set
In gradient descent, after computing loss.backward(), the update step is: coeffs -= lr * coeffs.grad. Why do we subtract rather than add?
You build a binary classifier but forget to add sigmoid at the final layer. What symptom are you most likely to observe?
The linear model draws a straight decision boundary. Real survival depended on complex interactions — young women in 3rd class had different odds than old men in 1st class. We need curves, not lines.
How do we get curves? By stacking linear layers with a non-linear activation function (ReLU) in between. ReLU is what gives neural networks their expressive power.
Weight matrices:
W1: (12 × 20) = 240 params
W2: (20 × 1) + bias = 21 params
Total: 261 parameters
Rectified Linear Unit — the simplest useful activation function:
ReLU(x) = max(0, x)
Negative → 0 | Positive → unchanged
def calc_preds(coeffs, X): l1, l2, const = coeffs # Layer 1: input → hidden h1 = (X @ l1).relu() # Layer 2: hidden → output out = (h1 @ l2) + const # Final activation (binary) return out.sigmoid()
Activation function rules:
— Hidden layers → ReLU
— Final layer for binary classification → Sigmoid
— Final layer for multi-class → Softmax
— Final layer for regression → None
Unlike the linear model, neural networks need careful weight initialisation to avoid gradients that are too large or too small.
# Layer 1: input (n_coef) → hidden (20) n_hidden = 20 l1 = (torch.randn(n_coef, n_hidden) - 0.5) / n_hidden # Layer 2: hidden (20) → output (1) l2 = torch.randn(n_hidden, 1) - 0.3 # Constant (bias) term for layer 2 const = torch.randn(1) - 0.3 # Enable gradients for all parameters for p in [l1, l2, const]: p.requires_grad_()
When 20 values are summed in the matrix multiply, the result grows by a factor of ~20. Dividing by n_hidden keeps the scale similar to before, so gradients don't explode on the first step.
Layer 1 doesn't need a constant because one-hot encoding already covers all levels (the dummy variables sum to 1). Layer 2 does need one — without it, the output is forced through the origin.
In our 2-layer neural network (Input → Hidden → Output), what activation function should be used after the hidden layer, and what should be used at the output layer for binary classification?
If you add a second hidden layer with 10 units after the first hidden layer of 20 units, what should be the shape of the new weight matrix connecting them?
Now that you understand every step, let FastAI automate the boilerplate. The concepts are identical — the framework just handles preprocessing, initialisation, and training loops for you.
Manual PyTorch
# Missing values df.fillna(df.mode().iloc[0]) # Dummy variables df = pd.get_dummies(df) # Normalise X = X / X.max(dim=0).values # Init weights manually l1 = torch.randn(n_coef, 20) / 20 l2 = torch.randn(20, 1) # Training loop — ~20 lines
FastAI equivalent
# All preprocessing in one call: dls = TabularDataLoaders.from_df( df, cat_names=['Pclass', 'Sex'], # ← dummies cont_names=['Age', 'LogFare'], # ← normalise y_names='Survived', valid_idx=valid_idx ) # Model + training: learn = tabular_learner(dls, layers=[10, 10], metrics=accuracy)
FastAI handles automatically: dummy variables, missing values, normalisation, weight initialisation, gradient zeroing, learning rate scheduling, and the training loop.
Choosing a learning rate manually is guesswork. FastAI includes a learning rate finder that tests many values systematically.
# Run the learning rate finder learn.lr_find()
Starts with a very small learning rate (1e-7)
Gradually increases it across mini-batches
Tracks the loss at each step
Plots loss vs. learning rate — you pick from the curve
Rule of thumb: Pick the learning rate at the steepest downward slope — just before the loss starts climbing again. Typically around the "valley" of the curve.
Train multiple identical models independently. Each starts with different random weights → learns slightly differently → makes different errors. Combining them cancels individual errors.
def train_one(): learn = tabular_learner(dls, layers=[10,10]) learn.fit(10, 0.03) return learn.get_preds(dl=test_dl) # Train 5 independent models all_preds = [train_one() for _ in range(5)] # Average their probability predictions final = torch.stack(all_preds).mean(0)
| Approach | Kaggle Rank |
|---|---|
| Single linear model | ~50th percentile |
| Single neural network | ~50th percentile |
| Ensemble of 5 NNs | ~25th percentile |
Key insight: Small code changes can yield substantial performance improvements. Ensembling is one of the highest return-on-effort techniques in practice.
Average the probabilities (not the binary 0/1 values), then apply a threshold of 0.5. This retains more information than voting.
In the FastAI TabularDataLoaders.from_df() call, what is the difference between cat_names and cont_names?
cat_names are columns you care about; cont_names are continuous columns you'll dropcat_names are categorical columns (FastAI applies embeddings/dummy vars); cont_names are numeric columns (FastAI normalises these)cat_names is for training data; cont_names is for test dataWhen ensembling 5 models for binary classification, which aggregation strategy is generally best?
tabular_learnerBefore the workshop: Make sure you have a Kaggle account and have accepted the Titanic competition rules — you'll need this to submit.
Self-study: Read Chapter 9 of the FastAI book (Tabular Modelling). Pay attention to entity embeddings — a more powerful alternative to one-hot encoding for high-cardinality categories.
Exam relevance: Gradient descent, sigmoid, loss vs. accuracy distinction, and activation functions are all mid-term and final exam topics (SLO1, SLO2).