CP3501 Deep Learning

Week 4: Understanding How Models Actually Learn

Today's Journey:

Build a neural network in Excel
Understand gradient descent from first principles
Translate Excel to PyTorch code
Apply modern architectures to your projects

By End of Today's Class

You Will Understand:

What "training" actually means mathematically
How gradient descent finds optimal weights
Why deep learning works (universal approximation)
How to pick better architectures for your projects

Note: First 20 minutes = Quick ML foundations refresher
(We need these concepts today!)

Knowledge Check: What Do You Remember?

Turn to a neighbor and discuss (2 minutes):

What's the difference between training and testing data?
What does "overfitting" mean?
What's a loss function?

Don't worry if you're fuzzy on these - we're about to refresh them!

The Machine Learning Recipe (Refresher)

Every ML Model Has 4 Core Components:

1. DATA

Training set: Model learns from this
Validation set: Check progress
Test set: Final evaluation

2. MODEL

Mathematical function: y = f(x, weights)
Weights = the "knobs" we tune

3. LOSS FUNCTION

Measures "how wrong" predictions are
Example: MSE = average of (pred - actual)²

4. OPTIMIZER

Adjusts weights to minimize loss
Today's focus: Gradient Descent

How Models Learn: The Training Loop

1. Make predictions on training data

↓

2. Calculate loss (how wrong we are)

↓

3. Adjust weights to reduce loss ← The Magic!

↓

4. Repeat until loss stops improving

Today we'll understand step 3:
HOW do we adjust weights?

What You Already Know (Weeks 1-3)

You've Already Done This:

learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(3)  # ← What actually happens here?
            

Week 1-2: Trained image classifiers

Week 3: Improved with data augmentation

But we never opened the black box...

Answer: Gradient descent on millions of weights!

Today's Learning Strategy

1. Build in EXCEL

See every calculation
Use Excel Solver as "optimizer"

→

2. Translate to PyTorch

Connect Excel formulas to Python code

→

3. Apply to Real Projects

Use better architectures (timm models)

Ready? Let's build! 🛠️

Part 1: Neural Networks in Excel

Our First Neural Network
In a Spreadsheet!

The Titanic Dataset (Simplified)

Task: Predict survival from Age, Sex, and Class

Age	IsMale	Class	Survived?
22	1	3	0
38	0	1	1
26	0	3	1

The Neural Network Structure

Simple Network (1 Hidden Layer)

INPUT
3 neurons
(Age, IsMale, Class)

→

HIDDEN
8 neurons
(calculated values)

→

OUTPUT
1 neuron
(survival 0-1)

Weights: Numbers We Need to Learn

3 × 8 = 24 weights (input → hidden)
8 × 1 = 8 weights (hidden → output)
Total: 32 parameters to optimize!

Matrix Multiplication: The Core Operation

KEY CONCEPT: Neural networks are matrix multiplications!

Excel Formula:

=SUMPRODUCT(inputs, weights)

Example Calculation:

Inputs: [22, 1, 3]

Weights: [0.1, -0.5, 0.2]

Output = 22×0.1 + 1×(-0.5) + 3×0.2
= 2.2 - 0.5 + 0.6
= 2.3

In Excel Cell:

=SUMPRODUCT(
  A2:C2,
  $W$1:$W$3
)
                

This single formula is the heart of deep learning!

DEMO TIME: Open the Excel File

Download: titanic_neural_net.xlsx

(Available on LearnJCU)

Follow Along as We Examine:

The input data (Age, IsMale, Class, Survived)
Weight cells (currently random numbers)
Prediction formulas (SUMPRODUCT + MAX)
Loss calculation (MSE formula)

⏱️ 5-minute guided tour

We'll walk through each section together

The Hidden Layer Formula (Excel)

Building the Hidden Layer (for each of 8 neurons):

Step 1: Calculate Weighted Sum

=SUMPRODUCT($B2:$D2, E$1:G$1)

Step 2: Apply ReLU Activation

=MAX(0, weighted_sum)

ReLU = "Rectified Linear Unit"

Keep positive numbers, zero out negatives

Input: -2.5 → ReLU → 0
Input:  3.7 → ReLU → 3.7

Why ReLU? The Nonlinearity Magic

Why not just: Output = Inputs × Weights?

Problem: Pure linear transformations can't model curves!

Linear: X → X×W → Y (just a straight line)

With ReLU: X → X×W → ReLU(X×W) → Y (can bend!)

Multiple ReLUs stacked → Can approximate ANY function!

LIVE DEMO: Use Excel Solver

Now for the MAGIC:
Let Excel Find Optimal Weights!

Your Turn (5 minutes):

Go to Data → Solver
Set Objective: Loss cell (minimize)
By Changing: All weight cells
Click "Solve"

Watch the loss decrease! 📉

What Just Happened?

Excel Solver Results:

	Loss	Accuracy
Before:	12.4	52%
After:	3.1	78%

How did Solver know which way to adjust each weight?

Answer Preview:

It calculated "gradients" (slopes):

If increasing weight → loss goes up: decrease it
If increasing weight → loss goes down: increase it

This is GRADIENT DESCENT!

Key Parallels: Excel vs Deep Learning

What We Just Did = Training a Neural Network!

Excel	Deep Learning Term
Input columns	Input features
Weight cells	Model parameters
SUMPRODUCT	Matrix multiplication
MAX(0, x)	ReLU activation
Loss cell (MSE)	Loss function
Solver "Solve"	Optimizer.step()
Solver iterations	Training epochs

Excel Solver ≈ PyTorch Optimizer!

Knowledge Check #1

Turn to a neighbor and explain:

What are "weights" in a neural network?
What does the loss function measure?
How does Solver know how to improve weights?

⏱️ 2-minute pair discussion

Part 2: From Excel to PyTorch

Same Network, Different Tool

Excel:
SUMPRODUCT + Solver

→

PyTorch:
Matrix multiplication
+ Gradient descent

Side-by-side comparison coming...

Let's Start Even Simpler

Task: Fit a Curve to Noisy Data

True function: y = 3x² + 2x + 1
Our job: Find coefficients from noisy samples

Excel Version:

Cells A, B, C hold coefficients
Solver adjusts them to minimize MSE

PyTorch Version:

params = tensor([a, b, c],
    requires_grad=True)
# Use gradient descent
                

PyTorch Tensors: Excel Cells with Superpowers

params = torch.tensor([1.5, 1.5, 1.5], requires_grad=True)
                                        ↑
                                "Track changes to this"
        

Excel Equivalent:

Cell A1 = 1.5 (coefficient a)
Solver tracks "if I change A1, how does loss change?"

PyTorch does this automatically!

requires_grad = True
means "I want to optimize this parameter"

The Loss Function (Same as Excel)

PyTorch:

def mse(predictions, actual):
    return ((predictions - actual)
           ** 2).mean()

# Calculate predictions
preds = (params[0]*x**2 + 
         params[1]*x + 
         params[2])

# Calculate loss
loss = mse(preds, y_actual)
                

Excel Parallel:

=AVERAGE(
  (Predictions - Actual)^2
)
                

Same formula,
different syntax!

The MAGIC Command: loss.backward()

loss.backward() # ← Excel Solver's "Calculate Gradient" step!

What It Does:

Before: params.grad = None

After: params.grad = tensor([-257., -30., -5.])

↑ ↑ ↑

"Slopes" for each parameter

Excel Equivalent:

Solver calculates: "If I increase A1 by 0.01, loss changes by X"

PyTorch calculus engine calculates this automatically for ALL weights!

Reading the Gradient

params.grad = tensor([-257., -30., -5.])

Translation:

Param 0 gradient = -257
→ If we increase param[0], loss decreases rapidly
→ We SHOULD increase it!
Param 1 gradient = -30
→ Smaller effect, but still should increase
Param 2 gradient = -5
→ Tiny effect

Negative gradient → move UP
Positive gradient → move DOWN

The Gradient Descent Update

lr = 0.01  # learning rate = "step size"

with torch.no_grad():  # Don't track these changes
    params -= lr * params.grad
    params.grad.zero_()  # Clear gradients for next round
        

Excel Equivalent:

New A1 = Old A1 - (step_size × slope)

Why Zero the Gradients?

.backward() ADDS to existing gradients
Without zeroing: gradient = grad_step1 + grad_step2 + ... ❌

The Complete Training Loop

# Setup
params = torch.tensor([1.5, 1.5, 1.5], requires_grad=True)
lr = 0.01

# Training loop (Excel: run Solver for 10 iterations)
for epoch in range(10):
    # 1. Forward pass (Excel: calculate predictions)
    preds = params[0]*x**2 + params[1]*x + params[2]
    
    # 2. Calculate loss (Excel: MSE formula)
    loss = mse(preds, y_actual)
    
    # 3. Backward pass (Excel: Solver calculates slopes)
    loss.backward()
    
    # 4. Update weights (Excel: Solver adjusts cells)
    with torch.no_grad():
        params -= lr * params.grad
        params.grad.zero_()
    
    print(f"Epoch {epoch}: Loss = {loss:.2f}")
        

This is the entire training algorithm!

Watch It Work!

Training Output:

Epoch 0: Loss = 11.50
Epoch 5: Loss = 8.32
Epoch 10: Loss = 5.71
Epoch 20: Loss = 3.42
Epoch 50: Loss = 2.07
            

After 50 Iterations:

	Param 0 (a)	Param 1 (b)	Param 2 (c)
Learned:	3.01	1.98	1.02
True:	3.00	2.00	1.00

We recovered the function! 🎯

Learning Rate: The Most Important Hyperparameter

lr = "step size" in gradient descent

Learning Rate	Behavior	Loss Progress
Too Small	Takes forever ⏰	10.0 → 9.8 → 9.6 → 9.4 → ... (100 epochs)
Just Right	Efficient convergence ✅	10.0 → 7.2 → 4.1 → 2.8 → 2.1 (10 epochs)
Too Big	Diverges! 💥	10.0 → 15.2 → 23.8 → 41.3 → ...

We use lr * grad so step size is proportional to slope AND tunable

From Simple Functions to Neural Networks

Same Algorithm, More Weights!

Quadratic Function

3 parameters (a, b, c)

→

Same gradient descent

→

Neural Net (Titanic)

32 parameters

→

Same gradient descent

→

ResNet-34

21 MILLION parameters

The algorithm doesn't change!
Only the model complexity scales.

Universal Approximation Theorem

Why Neural Networks Work

Stack enough ReLUs → Can approximate ANY continuous function!

Proof by Example:

One ReLU = one "kinked line"
Two ReLUs = two bends
100 ReLUs = 100 bends
∞ ReLUs = smooth curve

This is why deep learning is so powerful! 🧠

Knowledge Check #2

Test Your Understanding:

What does params.grad contain after calling loss.backward()?
Why do we multiply the gradient by the learning rate?
Why must we call params.grad.zero_() after each update?

Hint: Think back to the Excel Solver analogy!

Recap: The Complete Picture

Deep Learning = 3 Simple Ideas

1. MODEL: Stack ReLUs

(Universal approximation)

2. LOSS: Measure Prediction Error

(MSE, cross-entropy, etc.)

3. OPTIMIZER: Gradient Descent

params -= lr * params.grad

That's it! Everything else is engineering.

Part 3: Practical Application

You Now Understand HOW Models Learn

Let's apply this to your image classifiers

Week 1-2 Code:

learn = vision_learner(dls, resnet34)
learn.fine_tune(3)
            

Translation to What We Just Learned:

resnet34 = architecture (specific ReLU stacking pattern)
fine_tune(3) = run gradient descent for 3 epochs
FastAI's optimizer = fancy version of our params -= lr * grad

Can We Do Better Than ResNet?

ResNet-34 (2015):

21M parameters
7% error on Pets dataset
Trains in 20 seconds/epoch

Are there better architectures in 2024?

Answer: YES!

Enter the timm library...

The timm Library: 500+ Modern Architectures

from timm import list_models

len(list_models())  # → 500+ architectures!

list_models('convnext*')
# ['convnext_tiny', 'convnext_small', 'convnext_base', ...]

list_models('efficientnet*')
# ['efficientnetv2_rw_s', 'efficientnet_b0', ...]
        

How to Choose?

Next slides: The benchmark notebook...

Architecture Benchmark Results

Testing on Pets Dataset (same data as Week 1):

Model	Params	Time/Epoch	Error (%)
ResNet-34	21M	20s	7.2
EfficientNet-B0	5M	18s	5.2
ConvNeXt-tiny	28M	27s	4.1

ConvNeXt wins!
30% error reduction for +7 seconds 🏆

Trying a New Architecture (1 Line of Code!)

Week 1 Code:

learn = vision_learner(dls, resnet34, metrics=error_rate)

Week 4 Improved Code:

learn = vision_learner(dls, 'convnext_tiny_in22k', metrics=error_rate)
                            ↑
                    Just change the model name!
            

learn.fine_tune(3) # Same gradient descent, better architecture

The gradient descent algorithm doesn't change!
We just give it a better function to optimize.

What's Inside These Models?

All Modern CNNs Have the Same Structure:

1. Feature Extractor (pre-trained)

← Frozen initially

• Convolutional layers
• ReLU activations
• Pooling layers

2. Classification Head (new)

← We train this

• Fully connected layers
• Final softmax

fine_tune():
Epoch 1: Only train head (Excel: adjust last 8 weights)
Epochs 2-3: Train whole model (adjust ALL weights)

Transfer Learning Revisited

Why Pre-training Matters:

From-Scratch Training:

Random weights
Learn edges → textures → objects
Needs millions of images
Takes weeks on GPUs

Transfer Learning:

Start with ImageNet weights
Already knows edges/textures
Needs hundreds of images
Takes minutes!

Gradient descent in both cases

Just different starting points!

Practical Experiment for Your Project

Workshop Task (Remaining Time):

Open your Week 2 pet classifier notebook
Try 3 different architectures:
- resnet34 (baseline)
- efficientnet_b0
- convnext_tiny_in22k
Record results:

Model	Error Rate	Training Time
resnet34
efficientnet_b0
convnext_tiny_in22k

CP3501 Deep Learning

Week 4: Understanding How Models Actually Learn

By End of Today's Class

You Will Understand:

Knowledge Check: What Do You Remember?

The Machine Learning Recipe (Refresher)

Every ML Model Has 4 Core Components:

1. DATA

2. MODEL

3. LOSS FUNCTION

4. OPTIMIZER

How Models Learn: The Training Loop

What You Already Know (Weeks 1-3)

You've Already Done This:

Today's Learning Strategy

1. Build in EXCEL

2. Translate to PyTorch

3. Apply to Real Projects

Part 1: Neural Networks in Excel

Our First Neural NetworkIn a Spreadsheet!

The Titanic Dataset (Simplified)

The Neural Network Structure

Simple Network (1 Hidden Layer)

Weights: Numbers We Need to Learn

Matrix Multiplication: The Core Operation

KEY CONCEPT: Neural networks are matrix multiplications!

Excel Formula:

Example Calculation:

In Excel Cell:

DEMO TIME: Open the Excel File

Download: titanic_neural_net.xlsx

Follow Along as We Examine:

The Hidden Layer Formula (Excel)

Building the Hidden Layer (for each of 8 neurons):

Step 1: Calculate Weighted Sum

Step 2: Apply ReLU Activation

ReLU = "Rectified Linear Unit"

Why ReLU? The Nonlinearity Magic

Problem: Pure linear transformations can't model curves!

LIVE DEMO: Use Excel Solver

Now for the MAGIC:Let Excel Find Optimal Weights!

Your Turn (5 minutes):

What Just Happened?

Excel Solver Results:

Answer Preview:

Key Parallels: Excel vs Deep Learning

What We Just Did = Training a Neural Network!

Knowledge Check #1

Part 2: From Excel to PyTorch

Same Network, Different Tool

Let's Start Even Simpler

Task: Fit a Curve to Noisy Data

Excel Version:

PyTorch Version:

PyTorch Tensors: Excel Cells with Superpowers

Excel Equivalent:

The Loss Function (Same as Excel)

PyTorch:

Excel Parallel:

The MAGIC Command: loss.backward()

What It Does:

Excel Equivalent:

Reading the Gradient

Translation:

The Gradient Descent Update

Excel Equivalent:

Why Zero the Gradients?

The Complete Training Loop

Watch It Work!

Training Output:

After 50 Iterations:

Learning Rate: The Most Important Hyperparameter

lr = "step size" in gradient descent

From Simple Functions to Neural Networks

Same Algorithm, More Weights!

Quadratic Function

Neural Net (Titanic)

ResNet-34

Universal Approximation Theorem

Why Neural Networks Work

Our First Neural Network
In a Spreadsheet!

Now for the MAGIC:
Let Excel Find Optimal Weights!