CP3501 Deep Learning

Week 4: Understanding How Models Actually Learn

Today's Journey:

  • Build a neural network in Excel
  • Understand gradient descent from first principles
  • Translate Excel to PyTorch code
  • Apply modern architectures to your projects

By End of Today's Class

You Will Understand:

  • What "training" actually means mathematically
  • How gradient descent finds optimal weights
  • Why deep learning works (universal approximation)
  • How to pick better architectures for your projects

Note: First 20 minutes = Quick ML foundations refresher
(We need these concepts today!)

Knowledge Check: What Do You Remember?

Turn to a neighbor and discuss (2 minutes):
  1. What's the difference between training and testing data?
  2. What does "overfitting" mean?
  3. What's a loss function?

Don't worry if you're fuzzy on these - we're about to refresh them!

The Machine Learning Recipe (Refresher)

Every ML Model Has 4 Core Components:

1. DATA

  • Training set: Model learns from this
  • Validation set: Check progress
  • Test set: Final evaluation

2. MODEL

  • Mathematical function: y = f(x, weights)
  • Weights = the "knobs" we tune

3. LOSS FUNCTION

  • Measures "how wrong" predictions are
  • Example: MSE = average of (pred - actual)²

4. OPTIMIZER

  • Adjusts weights to minimize loss
  • Today's focus: Gradient Descent

How Models Learn: The Training Loop

1. Make predictions on training data
2. Calculate loss (how wrong we are)
3. Adjust weights to reduce loss ← The Magic!
4. Repeat until loss stops improving

Today we'll understand step 3:
HOW do we adjust weights?

What You Already Know (Weeks 1-3)

You've Already Done This:

learn = vision_learner(dls, resnet34, metrics=error_rate) learn.fine_tune(3) # ← What actually happens here?

Week 1-2: Trained image classifiers

Week 3: Improved with data augmentation

But we never opened the black box...

Answer: Gradient descent on millions of weights!

Today's Learning Strategy

1. Build in EXCEL

See every calculation
Use Excel Solver as "optimizer"

2. Translate to PyTorch

Connect Excel formulas to Python code

3. Apply to Real Projects

Use better architectures (timm models)

Ready? Let's build! 🛠️

Part 1: Neural Networks in Excel

Our First Neural Network
In a Spreadsheet!

The Titanic Dataset (Simplified)

Task: Predict survival from Age, Sex, and Class

Age IsMale Class Survived?
22 1 3 0
38 0 1 1
26 0 3 1

The Neural Network Structure

Simple Network (1 Hidden Layer)

INPUT
3 neurons
(Age, IsMale, Class)
HIDDEN
8 neurons
(calculated values)
OUTPUT
1 neuron
(survival 0-1)

Weights: Numbers We Need to Learn

  • 3 × 8 = 24 weights (input → hidden)
  • 8 × 1 = 8 weights (hidden → output)
  • Total: 32 parameters to optimize!

Matrix Multiplication: The Core Operation

KEY CONCEPT: Neural networks are matrix multiplications!

Excel Formula:

=SUMPRODUCT(inputs, weights)

Example Calculation:

Inputs: [22, 1, 3]

Weights: [0.1, -0.5, 0.2]

Output = 22×0.1 + 1×(-0.5) + 3×0.2
       = 2.2 - 0.5 + 0.6
       = 2.3

In Excel Cell:

=SUMPRODUCT( A2:C2, $W$1:$W$3 )

This single formula is the heart of deep learning!

DEMO TIME: Open the Excel File

Download: titanic_neural_net.xlsx

(Available on LearnJCU)

Follow Along as We Examine:

  1. The input data (Age, IsMale, Class, Survived)
  2. Weight cells (currently random numbers)
  3. Prediction formulas (SUMPRODUCT + MAX)
  4. Loss calculation (MSE formula)

⏱️ 5-minute guided tour

We'll walk through each section together

The Hidden Layer Formula (Excel)

Building the Hidden Layer (for each of 8 neurons):

Step 1: Calculate Weighted Sum

=SUMPRODUCT($B2:$D2, E$1:G$1)

Step 2: Apply ReLU Activation

=MAX(0, weighted_sum)

ReLU = "Rectified Linear Unit"

Keep positive numbers, zero out negatives

Input: -2.5 → ReLU → 0

Input:  3.7 → ReLU → 3.7

Why ReLU? The Nonlinearity Magic

Why not just: Output = Inputs × Weights?

Problem: Pure linear transformations can't model curves!

Linear: X → X×W → Y    (just a straight line)

With ReLU: X → X×W → ReLU(X×W) → Y    (can bend!)

ReLU creates "kinks"

Multiple ReLUs stacked → Can approximate ANY function!

LIVE DEMO: Use Excel Solver

Now for the MAGIC:
Let Excel Find Optimal Weights!

Your Turn (5 minutes):

  1. Go to Data → Solver
  2. Set Objective: Loss cell (minimize)
  3. By Changing: All weight cells
  4. Click "Solve"

Watch the loss decrease! 📉

What Just Happened?

Excel Solver Results:

Loss Accuracy
Before: 12.4 52%
After: 3.1 78%
How did Solver know which way to adjust each weight?

Answer Preview:

It calculated "gradients" (slopes):

  • If increasing weight → loss goes up: decrease it
  • If increasing weight → loss goes down: increase it

This is GRADIENT DESCENT!

Key Parallels: Excel vs Deep Learning

What We Just Did = Training a Neural Network!

Excel Deep Learning Term
Input columns Input features
Weight cells Model parameters
SUMPRODUCT Matrix multiplication
MAX(0, x) ReLU activation
Loss cell (MSE) Loss function
Solver "Solve" Optimizer.step()
Solver iterations Training epochs

Excel Solver ≈ PyTorch Optimizer!

Knowledge Check #1

Turn to a neighbor and explain:
  1. What are "weights" in a neural network?
  2. What does the loss function measure?
  3. How does Solver know how to improve weights?

⏱️ 2-minute pair discussion

Part 2: From Excel to PyTorch

Same Network, Different Tool

Excel:
SUMPRODUCT + Solver
PyTorch:
Matrix multiplication
+ Gradient descent

Side-by-side comparison coming...

Let's Start Even Simpler

Task: Fit a Curve to Noisy Data

True function: y = 3x² + 2x + 1
Our job: Find coefficients from noisy samples

Excel Version:

  • Cells A, B, C hold coefficients
  • Solver adjusts them to minimize MSE

PyTorch Version:

params = tensor([a, b, c], requires_grad=True) # Use gradient descent
Noisy data points

PyTorch Tensors: Excel Cells with Superpowers

params = torch.tensor([1.5, 1.5, 1.5], requires_grad=True) ↑ "Track changes to this"

Excel Equivalent:

  • Cell A1 = 1.5 (coefficient a)
  • Solver tracks "if I change A1, how does loss change?"

PyTorch does this automatically!

requires_grad = True
means "I want to optimize this parameter"

The Loss Function (Same as Excel)

PyTorch:

def mse(predictions, actual): return ((predictions - actual) ** 2).mean() # Calculate predictions preds = (params[0]*x**2 + params[1]*x + params[2]) # Calculate loss loss = mse(preds, y_actual)

Excel Parallel:

=AVERAGE( (Predictions - Actual)^2 )

Same formula,
different syntax!

The MAGIC Command: loss.backward()

loss.backward() # ← Excel Solver's "Calculate Gradient" step!

What It Does:

Before:  params.grad = None

After:   params.grad = tensor([-257., -30., -5.])

↑     ↑    ↑

"Slopes" for each parameter

Excel Equivalent:

Solver calculates: "If I increase A1 by 0.01, loss changes by X"

PyTorch calculus engine calculates this automatically for ALL weights!

Reading the Gradient

params.grad = tensor([-257., -30., -5.])

Translation:

  • Param 0 gradient = -257
    → If we increase param[0], loss decreases rapidly
    → We SHOULD increase it!
  • Param 1 gradient = -30
    → Smaller effect, but still should increase
  • Param 2 gradient = -5
    → Tiny effect

Negative gradient → move UP
Positive gradient → move DOWN

The Gradient Descent Update

lr = 0.01 # learning rate = "step size" with torch.no_grad(): # Don't track these changes params -= lr * params.grad params.grad.zero_() # Clear gradients for next round

Excel Equivalent:

New A1 = Old A1 - (step_size × slope)

Why Zero the Gradients?

.backward() ADDS to existing gradients
Without zeroing: gradient = grad_step1 + grad_step2 + ... ❌

The Complete Training Loop

# Setup params = torch.tensor([1.5, 1.5, 1.5], requires_grad=True) lr = 0.01 # Training loop (Excel: run Solver for 10 iterations) for epoch in range(10): # 1. Forward pass (Excel: calculate predictions) preds = params[0]*x**2 + params[1]*x + params[2] # 2. Calculate loss (Excel: MSE formula) loss = mse(preds, y_actual) # 3. Backward pass (Excel: Solver calculates slopes) loss.backward() # 4. Update weights (Excel: Solver adjusts cells) with torch.no_grad(): params -= lr * params.grad params.grad.zero_() print(f"Epoch {epoch}: Loss = {loss:.2f}")

This is the entire training algorithm!

Watch It Work!

Training Output:

Epoch 0: Loss = 11.50 Epoch 5: Loss = 8.32 Epoch 10: Loss = 5.71 Epoch 20: Loss = 3.42 Epoch 50: Loss = 2.07

After 50 Iterations:

Param 0 (a) Param 1 (b) Param 2 (c)
Learned: 3.01 1.98 1.02
True: 3.00 2.00 1.00

We recovered the function! 🎯

Learning Rate: The Most Important Hyperparameter

lr = "step size" in gradient descent

Learning Rate Behavior Loss Progress
Too Small Takes forever ⏰ 10.0 → 9.8 → 9.6 → 9.4 → ... (100 epochs)
Just Right Efficient convergence ✅ 10.0 → 7.2 → 4.1 → 2.8 → 2.1 (10 epochs)
Too Big Diverges! 💥 10.0 → 15.2 → 23.8 → 41.3 → ...

We use lr * grad so step size is proportional to slope AND tunable

From Simple Functions to Neural Networks

Same Algorithm, More Weights!

Quadratic Function

3 parameters (a, b, c)

Same gradient descent

Neural Net (Titanic)

32 parameters

Same gradient descent

ResNet-34

21 MILLION parameters

The algorithm doesn't change!
Only the model complexity scales.

Universal Approximation Theorem

Why Neural Networks Work

Stack enough ReLUs → Can approximate ANY continuous function!

Proof by Example:

  • One ReLU = one "kinked line"
  • Two ReLUs = two bends
  • 100 ReLUs = 100 bends
  • ∞ ReLUs = smooth curve
1 ReLU: 5 ReLUs: Many ReLUs:

This is why deep learning is so powerful! 🧠

Knowledge Check #2

Test Your Understanding:
  1. What does params.grad contain after calling loss.backward()?
  2. Why do we multiply the gradient by the learning rate?
  3. Why must we call params.grad.zero_() after each update?

Hint: Think back to the Excel Solver analogy!

Recap: The Complete Picture

Deep Learning = 3 Simple Ideas

1. MODEL: Stack ReLUs

(Universal approximation)

2. LOSS: Measure Prediction Error

(MSE, cross-entropy, etc.)

3. OPTIMIZER: Gradient Descent

params -= lr * params.grad

That's it! Everything else is engineering.

Part 3: Practical Application

You Now Understand HOW Models Learn

Let's apply this to your image classifiers

Week 1-2 Code:

learn = vision_learner(dls, resnet34) learn.fine_tune(3)

Translation to What We Just Learned:

  • resnet34 = architecture (specific ReLU stacking pattern)
  • fine_tune(3) = run gradient descent for 3 epochs
  • FastAI's optimizer = fancy version of our params -= lr * grad

Can We Do Better Than ResNet?

ResNet-34 (2015):

  • 21M parameters
  • 7% error on Pets dataset
  • Trains in 20 seconds/epoch
Are there better architectures in 2024?

Answer: YES!

Enter the timm library...

The timm Library: 500+ Modern Architectures

from timm import list_models len(list_models()) # → 500+ architectures! list_models('convnext*') # ['convnext_tiny', 'convnext_small', 'convnext_base', ...] list_models('efficientnet*') # ['efficientnetv2_rw_s', 'efficientnet_b0', ...]

How to Choose?

Next slides: The benchmark notebook...

Architecture Benchmark Results

Testing on Pets Dataset (same data as Week 1):

Model Params Time/Epoch Error (%)
ResNet-34 21M 20s 7.2
EfficientNet-B0 5M 18s 5.2
ConvNeXt-tiny 28M 27s 4.1

ConvNeXt wins!
30% error reduction for +7 seconds 🏆

Trying a New Architecture (1 Line of Code!)

Week 1 Code:

learn = vision_learner(dls, resnet34, metrics=error_rate)

Week 4 Improved Code:

learn = vision_learner(dls, 'convnext_tiny_in22k', metrics=error_rate) ↑ Just change the model name!
learn.fine_tune(3) # Same gradient descent, better architecture

The gradient descent algorithm doesn't change!
We just give it a better function to optimize.

What's Inside These Models?

All Modern CNNs Have the Same Structure:

1. Feature Extractor (pre-trained)

← Frozen initially

  • • Convolutional layers
  • • ReLU activations
  • • Pooling layers

2. Classification Head (new)

← We train this

  • • Fully connected layers
  • • Final softmax

fine_tune():
Epoch 1: Only train head (Excel: adjust last 8 weights)
Epochs 2-3: Train whole model (adjust ALL weights)

Transfer Learning Revisited

Why Pre-training Matters:

From-Scratch Training:

  • Random weights
  • Learn edges → textures → objects
  • Needs millions of images
  • Takes weeks on GPUs

Transfer Learning:

  • Start with ImageNet weights
  • Already knows edges/textures
  • Needs hundreds of images
  • Takes minutes!

Gradient descent in both cases

Just different starting points!

Practical Experiment for Your Project

Workshop Task (Remaining Time):

  1. Open your Week 2 pet classifier notebook
  2. Try 3 different architectures:
    • resnet34 (baseline)
    • efficientnet_b0
    • convnext_tiny_in22k
  3. Record results:
Model Error Rate Training Time
resnet34
efficientnet_b0
convnext_tiny_in22k

Which is best for YOUR dataset?

Training Knobs You Can Tune

Now That You Understand Gradient Descent

Knobs You Can Adjust:

1. Architecture (resnet vs convnext)

← Today's focus

2. Learning Rate (step size)

3. Number of Epochs (training iterations)

4. Data Augmentation (create more training examples)

Next week: We'll dive deeper into #2 and #4!

What You Learned Today

Fundamentals:

  • ✓ Built neural network in Excel (32 weights)
  • ✓ Used Solver to optimize weights
  • ✓ Translated Excel → PyTorch

Practical Skills:

  • ✓ Understand .fine_tune()
  • ✓ Swap in better architectures
  • ✓ Benchmark models

Conceptual Breakthroughs:

  • ✓ All DL = ReLUs + Gradient Descent
  • ✓ Same algorithm: 3 params → 21M params
  • ✓ Universal approximation explains why NNs work

The ONLY Equation That Matters

params -= lr × params.grad

Everything Else Is:

  • Different ways to calculate grad (backpropagation)
  • Different ways to use grad (optimizers: Adam, SGD, etc.)
  • Different model structures (architectures: ResNet, ConvNeXt, etc.)

But this core update rule NEVER changes! 🎯

Homework & Next Week

Before Next Class:

Required:

  1. Complete the architecture comparison experiment
  2. Review the benchmark notebook (link on LearnJCU)

Optional Deep Dive:

  1. Read: fastbook Chapter 4 (MNIST basics)
  2. Watch: 3Blue1Brown "Gradient Descent" video

Next Week: Data Augmentation Deep Dive

  • Create training data from thin air
  • When to use which transforms
  • Workshop: Build a custom augmentation pipeline
1 / 43