Week 4: Understanding How Models Actually Learn
Today's Journey:
- Build a neural network in Excel
- Understand gradient descent from first principles
- Translate Excel to PyTorch code
- Apply modern architectures to your projects
Today's Journey:
Note: First 20 minutes = Quick ML foundations refresher
(We need these concepts today!)
Don't worry if you're fuzzy on these - we're about to refresh them!
Today we'll understand step 3:
HOW do we adjust weights?
Week 1-2: Trained image classifiers
Week 3: Improved with data augmentation
But we never opened the black box...
Answer: Gradient descent on millions of weights!
See every calculation
Use Excel Solver as "optimizer"
Connect Excel formulas to Python code
Use better architectures (timm models)
Ready? Let's build! 🛠️
Task: Predict survival from Age, Sex, and Class
| Age | IsMale | Class | Survived? |
|---|---|---|---|
| 22 | 1 | 3 | 0 |
| 38 | 0 | 1 | 1 |
| 26 | 0 | 3 | 1 |
Inputs: [22, 1, 3]
Weights: [0.1, -0.5, 0.2]
Output = 22×0.1 + 1×(-0.5) + 3×0.2
= 2.2 - 0.5 + 0.6
= 2.3
This single formula is the heart of deep learning!
(Available on LearnJCU)
⏱️ 5-minute guided tour
We'll walk through each section together
Keep positive numbers, zero out negatives
Input: -2.5 → ReLU → 0
Input: 3.7 → ReLU → 3.7
Linear: X → X×W → Y (just a straight line)
With ReLU: X → X×W → ReLU(X×W) → Y (can bend!)
Multiple ReLUs stacked → Can approximate ANY function!
Watch the loss decrease! 📉
| Loss | Accuracy | |
|---|---|---|
| Before: | 12.4 | 52% |
| After: | 3.1 | 78% |
It calculated "gradients" (slopes):
This is GRADIENT DESCENT!
| Excel | Deep Learning Term |
|---|---|
| Input columns | Input features |
| Weight cells | Model parameters |
| SUMPRODUCT | Matrix multiplication |
| MAX(0, x) | ReLU activation |
| Loss cell (MSE) | Loss function |
| Solver "Solve" | Optimizer.step() |
| Solver iterations | Training epochs |
Excel Solver ≈ PyTorch Optimizer!
⏱️ 2-minute pair discussion
Side-by-side comparison coming...
True function: y = 3x² + 2x + 1
Our job: Find coefficients from noisy samples
PyTorch does this automatically!
requires_grad = True
means "I want to optimize this parameter"
Same formula,
different syntax!
Before: params.grad = None
After: params.grad = tensor([-257., -30., -5.])
↑ ↑ ↑
"Slopes" for each parameter
Solver calculates: "If I increase A1 by 0.01, loss changes by X"
PyTorch calculus engine calculates this automatically for ALL weights!
Negative gradient → move UP
Positive gradient → move DOWN
.backward() ADDS to existing gradients
Without zeroing: gradient = grad_step1 + grad_step2 + ... ❌
This is the entire training algorithm!
| Param 0 (a) | Param 1 (b) | Param 2 (c) | |
|---|---|---|---|
| Learned: | 3.01 | 1.98 | 1.02 |
| True: | 3.00 | 2.00 | 1.00 |
We recovered the function! 🎯
| Learning Rate | Behavior | Loss Progress |
|---|---|---|
| Too Small | Takes forever ⏰ | 10.0 → 9.8 → 9.6 → 9.4 → ... (100 epochs) |
| Just Right | Efficient convergence ✅ | 10.0 → 7.2 → 4.1 → 2.8 → 2.1 (10 epochs) |
| Too Big | Diverges! 💥 | 10.0 → 15.2 → 23.8 → 41.3 → ... |
We use lr * grad so step size is proportional to slope AND tunable
3 parameters (a, b, c)
32 parameters
21 MILLION parameters
The algorithm doesn't change!
Only the model complexity scales.
Stack enough ReLUs → Can approximate ANY continuous function!
This is why deep learning is so powerful! 🧠
Hint: Think back to the Excel Solver analogy!
(Universal approximation)
(MSE, cross-entropy, etc.)
That's it! Everything else is engineering.
Let's apply this to your image classifiers
Enter the timm library...
Next slides: The benchmark notebook...
| Model | Params | Time/Epoch | Error (%) |
|---|---|---|---|
| ResNet-34 | 21M | 20s | 7.2 |
| EfficientNet-B0 | 5M | 18s | 5.2 |
| ConvNeXt-tiny | 28M | 27s | 4.1 |
ConvNeXt wins!
30% error reduction for +7 seconds 🏆
The gradient descent algorithm doesn't change!
We just give it a better function to optimize.
← Frozen initially
← We train this
fine_tune():
Epoch 1: Only train head (Excel: adjust last 8 weights)
Epochs 2-3: Train whole model (adjust ALL weights)
Gradient descent in both cases
Just different starting points!
| Model | Error Rate | Training Time |
|---|---|---|
| resnet34 | ||
| efficientnet_b0 | ||
| convnext_tiny_in22k |
Which is best for YOUR dataset?
← Today's focus
Next week: We'll dive deeper into #2 and #4!
But this core update rule NEVER changes! 🎯