M1 · Lesson 6 — Math Notation Literacy

Probability &
Statistics Notation

VAE-CF, Bayesian MF, diffusion-based RS — entire model families
require reading P(·), 𝔼[·], and KL divergence fluently.

01
M1 · L6 — Core Notation

The building blocks

Probability notation
you must recognise

NotationRead AsMeaningRS Context
P(A)"probability of A"Likelihood of event A occurringP(click) = probability user clicks an item
P(A|B)"prob of A given B"Conditional probability — A given B occurredP(r_{ui}|u,i,θ) = prob of rating given user, item, params
𝔼[X]"expectation of X"Weighted average of X over its distribution𝔼_{(u,i)~𝒟}[ℒ] = expected loss over training data
x ~ P"x distributed as P"x is sampled from distribution Pz ~ 𝒩(0,I) = z drawn from standard Gaussian
𝒩(μ, σ²)"Normal distribution"Gaussian with mean μ and variance σ²Prior over embeddings in Bayesian MF
KL(Q‖P)"KL divergence from P to Q"Gap between distributions Q and PRegularisation in VAE-CF latent space
02
M1 · L6 — Expectation

𝔼[X] — the weighted average

Expectation — "on average,
what value does X take?"

\[ \mathbb{E}[X] = \sum_x x \cdot P(X = x) \]

The subscript on 𝔼 tells you what's being randomly sampled:

\[ \mathbb{E}_{(u,i)\sim\mathcal{D}}\left[\mathcal{L}(u, i, heta) ight] \]
𝔼_{(u,i)~𝒟}
Sample (u,i) pairs randomly from training data 𝒟
[ℒ(u,i,θ)]
Compute the loss for that sampled pair
Plain English
The average loss you'd see if you randomly picked a training pair.

In practice: We can't compute 𝔼 exactly over all data — so we approximate it with mini-batches. SGD optimises the expected loss by sampling random batches and using each batch as an estimate.

Common usage in RS papers

"We minimise the expected BPR loss 𝔼_{(u,i,j)~𝒟}[ℒ_BPR(u,i,j)]" — this is standard phrasing. The expectation signals they're doing mini-batch SGD.

03
M1 · L6 — KL Divergence

The gap between two distributions

KL divergence —
how different are Q and P?

\[ D_{KL}(Q \| P) = \sum_x Q(x) \ln rac{Q(x)}{P(x)} \]
  • KL = 0 means Q and P are identical
  • KL > 0 always (never negative)
  • Not symmetric: D_KL(Q‖P) ≠ D_KL(P‖Q)
  • Q = what you have. P = what you want.

Plain English: KL = gap between your model's distribution and your target distribution. In VAE-CF: gap between encoder output and the Gaussian prior you want.

VAE-CF context
P = 𝒩(0,I)
The PRIOR — your target distribution (tidy Gaussian you want)
Q = q(z|x)
The POSTERIOR — what your encoder actually produces for user x
D_KL(Q‖P)
How far the encoder output drifts from the target Gaussian

Why Gaussian prior? The KL between two Gaussians has a closed-form solution, making training tractable. It's a convenience choice, not a law — other priors are valid research contributions.

04
M1 · L6 — VAE-CF ELBO

The most important probabilistic equation in RS

The VAE-CF ELBO —
fully decoded

\[ \mathcal{L}_{ELBO} = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\ln p(\mathbf{x}|\mathbf{z}) ight] - eta \cdot D_{KL}\left(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}) ight) \]
Every symbol
𝔼_{q(z|x)}[·]
Average over latent vectors z sampled from the encoder
ln p(x|z)
Log-likelihood of reconstructing user interaction history x from latent vector z
D_KL(q‖p)
Penalty for the latent space drifting from the Gaussian prior
β
Hyperparameter controlling how strongly to enforce the prior
Plain English
Learn a compressed representation of each user that reconstructs their interaction history, while keeping the latent space close to a standard Gaussian.
05
M1 · L6 — Key Takeaways

What to remember

01

P(A|B) = conditional

P(r_{ui}|u,i,θ) = probability of this rating, given this user, item, and parameters. The | is "given".

02

𝔼 = weighted average

𝔼_{(u,i)~𝒟}[ℒ] = average loss over randomly sampled training pairs. Approximated by mini-batches in practice.

03

KL = gap between distributions

Q = what you have. P = what you want. KL = 0 means identical. In VAE-CF: keeps latent space organised.

Next: M1 · L7 — Putting It All Together: Full Equation Decoding

06
← → arrow keys to navigate