M1 · Lesson 5 — Math Notation Literacy

Calculus Notation
in ML/RS

You don't need to do calculus to read RS papers.
You need to recognise what the notation is asking for.

01
M1 · L5 — Partial Derivatives

The fundamental notation

∂ vs d — what's
the difference?

\[ rac{d\mathcal{L}}{d heta} \quad \text{vs} \quad rac{\partial\mathcal{L}}{\partial heta} \]

Both ask: "how does the loss change as θ changes?" The difference:

  • d (straight d) — used when the function depends on only one variable
  • ∂ (curly d, "partial") — used when the function depends on multiple variables simultaneously. Hold all others fixed.

In ML/RS — always ∂. The loss ℒ depends on p_u, q_i, biases, etc. simultaneously. So it's always ∂ℒ/∂p_u.

What the notation asks
∂ℒ/∂p_u
"If p_u changed slightly — by how much would ℒ change?" (holding all other params fixed)
∂ℒ/∂q_i
Same question but for item embedding q_i
∂ŷ/∂p_u
"How does the prediction change if we change p_u?"

Key insight: The partial derivative is a hypothetical rate of change — not a physical action. You're asking "what if p_u were slightly different?" not actually changing it.

02
M1 · L5 — The Gradient

Generalising to all parameters at once

The gradient ∇ — all
partial derivatives in one vector

\[ abla_ heta \mathcal{L} = \left[ rac{\partial\mathcal{L}}{\partial heta_1},\ rac{\partial\mathcal{L}}{\partial heta_2},\ \cdots,\ rac{\partial\mathcal{L}}{\partial heta_d} ight] \]

The gradient is a vector — same shape as θ — where each element is the partial derivative of ℒ with respect to one parameter.

The gradient points uphill. The direction of steepest increase of ℒ. Gradient descent goes the opposite direction to decrease ℒ.

\[ heta \leftarrow heta - lpha abla_ heta \mathcal{L} \]
Gradient descent update
θ ←
Replace θ with the updated value
θ
Current parameter values
Minus — move OPPOSITE to gradient (downhill)
α
Learning rate — size of the step
∇_θ ℒ
The gradient — direction of steepest increase
03
M1 · L5 — argmin and argmax

The training objective in one symbol

argmin / argmax —
finding the best parameters

\[ heta^* = rg\min_ heta \mathcal{L}( heta) \]
Decoded
θ*
The optimal parameters (star = optimal)
argmin_θ
"The value of θ that achieves the minimum" — arg = argument (the input), not the value
ℒ(θ)
The loss as a function of parameters θ

min vs argmin: min_θ ℒ = the smallest value ℒ achieves. argmin_θ ℒ = the parameters θ that achieve that minimum. Training finds the argmin.

NotationMeaning
min_θ ℒThe minimum VALUE of the loss
argmin_θ ℒThe PARAMETERS that achieve that minimum
max_θ ℒThe maximum value (e.g., likelihood maximisation)
argmax_θ ℒParameters that achieve the maximum
argtop-K_i f(i)The K items i that give the highest f(i) scores

argmax is used for recommendation: "return the K items with highest predicted score" = argtop-K_{i∈ℐ} ŷ_{ui}

04
M1 · L5 — Chain Rule

The foundation of backpropagation

The chain rule —
what backprop actually is

When a loss depends on θ through intermediate variables:

\[ rac{\partial\mathcal{L}}{\partial heta} = rac{\partial\mathcal{L}}{\partial\hat{y}} \cdot rac{\partial\hat{y}}{\partial heta} \]

ŷ is an intermediate variable — a stepping stone. You never change it directly. You change θ → which changes ŷ → which changes ℒ. The chain rule tracks that flow.

Backpropagation = chain rule applied repeatedly through all layers, from ℒ back to the first layer's parameters.

MF example: ℒ = (r - p^⊤q)², ŷ = p^⊤q
∂ℒ/∂ŷ
= -2(r_{ui} - ŷ) — how loss responds to prediction change
∂ŷ/∂p_u
= q_i — how prediction responds to p_u change
∂ℒ/∂p_u
= -2(r_{ui} - ŷ) · q_i — the full gradient (product of above)
\[ \mathbf{p}_u \leftarrow \mathbf{p}_u - lpha\left[-2(r_{ui} - \mathbf{p}_u^ op\mathbf{q}_i)\mathbf{q}_i + 2\lambda\mathbf{p}_u ight] \]
05
M1 · L5 — Key Takeaways

What to remember

01

∂ not d — because ML is multivariate

∂ℒ/∂θ = how ℒ changes w.r.t. θ, holding all other parameters fixed. Always ∂ in ML/RS — always.

02

∇ = vector of all ∂

The gradient points uphill. Gradient descent subtracts α·∇ to go downhill. The minus sign is essential.

03

argmin = find the best θ

Training IS argmin_θ ℒ(θ). Everything in the training loop is working toward this — finding the parameters that minimise the loss.

Next: M1 · L6 — Probability & Statistics Notation

06
← → arrow keys to navigate