You don't need to do calculus to read RS papers.
You need to recognise what the notation is asking for.
Both ask: "how does the loss change as θ changes?" The difference:
In ML/RS — always ∂. The loss ℒ depends on p_u, q_i, biases, etc. simultaneously. So it's always ∂ℒ/∂p_u.
Key insight: The partial derivative is a hypothetical rate of change — not a physical action. You're asking "what if p_u were slightly different?" not actually changing it.
The gradient is a vector — same shape as θ — where each element is the partial derivative of ℒ with respect to one parameter.
The gradient points uphill. The direction of steepest increase of ℒ. Gradient descent goes the opposite direction to decrease ℒ.
min vs argmin: min_θ ℒ = the smallest value ℒ achieves. argmin_θ ℒ = the parameters θ that achieve that minimum. Training finds the argmin.
| Notation | Meaning |
|---|---|
| min_θ ℒ | The minimum VALUE of the loss |
| argmin_θ ℒ | The PARAMETERS that achieve that minimum |
| max_θ ℒ | The maximum value (e.g., likelihood maximisation) |
| argmax_θ ℒ | Parameters that achieve the maximum |
| argtop-K_i f(i) | The K items i that give the highest f(i) scores |
argmax is used for recommendation: "return the K items with highest predicted score" = argtop-K_{i∈ℐ} ŷ_{ui}
When a loss depends on θ through intermediate variables:
ŷ is an intermediate variable — a stepping stone. You never change it directly. You change θ → which changes ŷ → which changes ℒ. The chain rule tracks that flow.
Backpropagation = chain rule applied repeatedly through all layers, from ℒ back to the first layer's parameters.
∂ℒ/∂θ = how ℒ changes w.r.t. θ, holding all other parameters fixed. Always ∂ in ML/RS — always.
The gradient points uphill. Gradient descent subtracts α·∇ to go downhill. The minus sign is essential.
Training IS argmin_θ ℒ(θ). Everything in the training loop is working toward this — finding the parameters that minimise the loss.
Next: M1 · L6 — Probability & Statistics Notation