M1 · Lesson 4 — Math Notation Literacy

Norms, Distances
& Regularisation

Every RS loss function ends with λ‖θ‖². Learn what norms measure
and why regularisation is not optional.

01
M1 · L4 — What is a Norm

Measuring the size of a vector

A norm answers:
"how big is this vector?"

For vector x = [x₁, x₂, ..., x_d], three norms matter in RS:

\[ \|\mathbf{x}\|_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_d^2} \quad \text{(L2, Euclidean)} \]
\[ \|\mathbf{x}\|_1 = |x_1| + |x_2| + \cdots + |x_d| \quad \text{(L1, Manhattan)} \]
\[ \|\mathbf{M}\|_F = \sqrt{\sum_i\sum_j m_{ij}^2} \quad \text{(Frobenius, for matrices)} \]
NormShapeUsed in RS for
‖x‖₂VectorL2 regularisation: λ‖p_u‖₂²
‖x‖₁VectorSparse regularisation (less common)
‖M‖_FMatrixRegularising full embedding matrix P or Q
‖x‖ (no subscript)VectorAlmost always means L2 in RS context

No subscript = L2. When you see ‖·‖ without a subscript in an RS paper, assume L2 norm unless stated otherwise.

02
M1 · L4 — Dot Product and Cosine

How RS models measure similarity

Dot product vs
cosine similarity

Dot Product

\[ \mathbf{p}_u \cdot \mathbf{q}_i = \sum_{k=1}^d p_{uk} \cdot q_{ik} = \mathbf{p}_u^ op \mathbf{q}_i \]

Multiply corresponding dimensions, sum up. Higher value = more aligned. Used in MF, LightGCN, BPR.

p_u^⊤ q_i and p_u · q_i are the same thing — transpose notation and dot notation are interchangeable for vectors.

Cosine Similarity

\[ \cos(\mathbf{p}_u, \mathbf{q}_i) = rac{\mathbf{p}_u^ op \mathbf{q}_i}{\|\mathbf{p}_u\|_2 \cdot \|\mathbf{q}_i\|_2} \]

Dot product divided by both vectors' lengths. Values ∈ [−1, 1]. Measures direction only, not magnitude.

Key difference

Dot product rewards alignment AND magnitude — so larger embeddings score higher. Cosine rewards alignment only. RS typically uses dot product because embedding magnitude can encode how active a user is.

03
M1 · L4 — Regularisation

Why λ‖θ‖² appears in every RS loss

Regularisation — preventing
the model from going wild

Without regularisation, a model can set embedding values arbitrarily large to fit training data perfectly — then fail completely on new users. This is overfitting.

\[ \mathcal{L} = \underbrace{\sum_{(u,i)\in\mathcal{O}} (r_{ui} - \hat{r}_{ui})^2}_{\text{fit the data}} + \underbrace{\lambda\| heta\|^2}_{\text{stay small}} \]

Why squared norm? ‖θ‖² is differentiable everywhere. Plain ‖θ‖ has a kink at zero. The squared version gives cleaner gradients for gradient descent.

λ valueEffect
λ = 0No regularisation → model overfits to training data
λ very smallWeak regularisation → can still overfit
λ = 0.001Typical good value in RS (tune via validation)
λ very largeAll embeddings pushed to zero → model underfits
\[ \lambda(\|\mathbf{P}\|_F^2 + \|\mathbf{Q}\|_F^2) \]

MF version: regularise both user matrix P and item matrix Q independently.

04
M1 · L4 — Full MF Loss

Complete example

The full MF loss —
every symbol decoded

\[ \mathcal{L} = \sum_{(u,i)\in\mathcal{O}} \left(r_{ui} - \mathbf{p}_u^ op \mathbf{q}_i ight)^2 + \lambda\left(\|\mathbf{P}\|_F^2 + \|\mathbf{Q}\|_F^2 ight) \]
Every piece
∑_{(u,i)∈𝒪}
Loop over all observed user-item pairs
r_{ui}
True rating — a scalar
p_u^⊤ q_i
Predicted rating — dot product of user and item embeddings
(r − p^⊤q)²
Squared prediction error for this pair
λ(‖P‖²_F + ‖Q‖²_F)
Regularisation: penalise large values in both embedding matrices
Plain English
Learn user and item embeddings so their dot products match observed ratings — while keeping the embeddings from growing too large.
05
M1 · L4 — Key Takeaways

What to remember

01

‖x‖ (no subscript) = L2

Squared = sum of squared components. For matrices → Frobenius norm. Default assumption in RS unless stated.

02

Dot product = alignment + magnitude

p_u^⊤ q_i = ∑ p_{uk}·q_{ik}. Cosine removes magnitude effect. RS models typically prefer dot product.

03

λ‖θ‖² = keep embeddings small

λ=0 → overfit. λ too large → underfit. Always tune λ on a validation set. Squared norm because it's differentiable.

Next: M1 · L5 — Calculus Notation in ML/RS

06
← → arrow keys to navigate