M1 · Lesson 5 — Math Notation Literacy

Calculus Notation
in ML/RS

You don't need to do calculus to read RS papers.
You need to recognise what the notation is asking for.

01

M1 · L5 — Partial Derivatives

The fundamental notation

∂ vs d — what's
the difference?

\[ rac{d\mathcal{L}}{d heta} \quad \text{vs} \quad rac{\partial\mathcal{L}}{\partial heta} \]

Both ask: "how does the loss change as θ changes?" The difference:

d (straight d) — used when the function depends on only one variable
∂ (curly d, "partial") — used when the function depends on multiple variables simultaneously. Hold all others fixed.

In ML/RS — always ∂. The loss ℒ depends on p_u, q_i, biases, etc. simultaneously. So it's always ∂ℒ/∂p_u.

What the notation asks

∂ℒ/∂p_u

"If p_u changed slightly — by how much would ℒ change?" (holding all other params fixed)

∂ℒ/∂q_i

Same question but for item embedding q_i

∂ŷ/∂p_u

"How does the prediction change if we change p_u?"

Key insight: The partial derivative is a hypothetical rate of change — not a physical action. You're asking "what if p_u were slightly different?" not actually changing it.

02

M1 · L5 — The Gradient

Generalising to all parameters at once

The gradient ∇ — all
partial derivatives in one vector

\[ abla_ heta \mathcal{L} = \left[rac{\partial\mathcal{L}}{\partial heta_1},\ rac{\partial\mathcal{L}}{\partial heta_2},\ \cdots,\ rac{\partial\mathcal{L}}{\partial heta_d} ight] \]

The gradient is a vector — same shape as θ — where each element is the partial derivative of ℒ with respect to one parameter.

The gradient points uphill. The direction of steepest increase of ℒ. Gradient descent goes the opposite direction to decrease ℒ.

\[ heta \leftarrow heta - lpha abla_ heta \mathcal{L} \]

Gradient descent update

θ ←

Replace θ with the updated value

θ

Current parameter values

−

Minus — move OPPOSITE to gradient (downhill)

α

Learning rate — size of the step

∇_θ ℒ

The gradient — direction of steepest increase

03

M1 · L5 — argmin and argmax

The training objective in one symbol

argmin / argmax —
finding the best parameters

\[ heta^* = rg\min_ heta \mathcal{L}( heta) \]

Decoded

θ*

The optimal parameters (star = optimal)

argmin_θ

"The value of θ that achieves the minimum" — arg = argument (the input), not the value

ℒ(θ)

The loss as a function of parameters θ

min vs argmin: min_θ ℒ = the smallest value ℒ achieves. argmin_θ ℒ = the parameters θ that achieve that minimum. Training finds the argmin.

Notation	Meaning
min_θ ℒ	The minimum VALUE of the loss
argmin_θ ℒ	The PARAMETERS that achieve that minimum
max_θ ℒ	The maximum value (e.g., likelihood maximisation)
argmax_θ ℒ	Parameters that achieve the maximum
argtop-K_i f(i)	The K items i that give the highest f(i) scores

argmax is used for recommendation: "return the K items with highest predicted score" = argtop-K_{i∈ℐ} ŷ_{ui}

04

M1 · L5 — Chain Rule

The foundation of backpropagation

The chain rule —
what backprop actually is

When a loss depends on θ through intermediate variables:

\[ rac{\partial\mathcal{L}}{\partial heta} = rac{\partial\mathcal{L}}{\partial\hat{y}} \cdot rac{\partial\hat{y}}{\partial heta} \]

ŷ is an intermediate variable — a stepping stone. You never change it directly. You change θ → which changes ŷ → which changes ℒ. The chain rule tracks that flow.

Backpropagation = chain rule applied repeatedly through all layers, from ℒ back to the first layer's parameters.

MF example: ℒ = (r - p^⊤q)², ŷ = p^⊤q

∂ℒ/∂ŷ

= -2(r_{ui} - ŷ) — how loss responds to prediction change

∂ŷ/∂p_u

= q_i — how prediction responds to p_u change

∂ℒ/∂p_u

= -2(r_{ui} - ŷ) · q_i — the full gradient (product of above)

\[ \mathbf{p}_u \leftarrow \mathbf{p}_u - lpha\left[-2(r_{ui} - \mathbf{p}_u^ op\mathbf{q}_i)\mathbf{q}_i + 2\lambda\mathbf{p}_u ight] \]

05

M1 · L5 — Key Takeaways

What to remember

01

∂ not d — because ML is multivariate

∂ℒ/∂θ = how ℒ changes w.r.t. θ, holding all other parameters fixed. Always ∂ in ML/RS — always.

02

∇ = vector of all ∂

The gradient points uphill. Gradient descent subtracts α·∇ to go downhill. The minus sign is essential.

03

argmin = find the best θ

Training IS argmin_θ ℒ(θ). Everything in the training loop is working toward this — finding the parameters that minimise the loss.

Next: M1 · L6 — Probability & Statistics Notation

06

Calculus Notationin ML/RS