M1 · Lesson 6 — Math Notation Literacy

Probability &
Statistics Notation

VAE-CF, Bayesian MF, diffusion-based RS — entire model families
require reading P(·), 𝔼[·], and KL divergence fluently.

01

M1 · L6 — Core Notation

The building blocks

Probability notation
you must recognise

Notation	Read As	Meaning	RS Context
P(A)	"probability of A"	Likelihood of event A occurring	P(click) = probability user clicks an item
P(A\|B)	"prob of A given B"	Conditional probability — A given B occurred	P(r_{ui}\|u,i,θ) = prob of rating given user, item, params
𝔼[X]	"expectation of X"	Weighted average of X over its distribution	𝔼_{(u,i)~𝒟}[ℒ] = expected loss over training data
x ~ P	"x distributed as P"	x is sampled from distribution P	z ~ 𝒩(0,I) = z drawn from standard Gaussian
𝒩(μ, σ²)	"Normal distribution"	Gaussian with mean μ and variance σ²	Prior over embeddings in Bayesian MF
KL(Q‖P)	"KL divergence from P to Q"	Gap between distributions Q and P	Regularisation in VAE-CF latent space

02

M1 · L6 — Expectation

𝔼[X] — the weighted average

Expectation — "on average,
what value does X take?"

\[ \mathbb{E}[X] = \sum_x x \cdot P(X = x) \]

The subscript on 𝔼 tells you what's being randomly sampled:

\[ \mathbb{E}_{(u,i)\sim\mathcal{D}}\left[\mathcal{L}(u, i, heta) ight] \]

𝔼_{(u,i)~𝒟}

Sample (u,i) pairs randomly from training data 𝒟

[ℒ(u,i,θ)]

Compute the loss for that sampled pair

Plain English

The average loss you'd see if you randomly picked a training pair.

In practice: We can't compute 𝔼 exactly over all data — so we approximate it with mini-batches. SGD optimises the expected loss by sampling random batches and using each batch as an estimate.

Common usage in RS papers

"We minimise the expected BPR loss 𝔼_{(u,i,j)~𝒟}[ℒ_BPR(u,i,j)]" — this is standard phrasing. The expectation signals they're doing mini-batch SGD.

03

M1 · L6 — KL Divergence

The gap between two distributions

KL divergence —
how different are Q and P?

\[ D_{KL}(Q \| P) = \sum_x Q(x) \ln rac{Q(x)}{P(x)} \]

KL = 0 means Q and P are identical
KL > 0 always (never negative)
Not symmetric: D_KL(Q‖P) ≠ D_KL(P‖Q)
Q = what you have. P = what you want.

Plain English: KL = gap between your model's distribution and your target distribution. In VAE-CF: gap between encoder output and the Gaussian prior you want.

VAE-CF context

P = 𝒩(0,I)

The PRIOR — your target distribution (tidy Gaussian you want)

Q = q(z|x)

The POSTERIOR — what your encoder actually produces for user x

D_KL(Q‖P)

How far the encoder output drifts from the target Gaussian

Why Gaussian prior? The KL between two Gaussians has a closed-form solution, making training tractable. It's a convenience choice, not a law — other priors are valid research contributions.

04

M1 · L6 — VAE-CF ELBO

The most important probabilistic equation in RS

The VAE-CF ELBO —
fully decoded

\[ \mathcal{L}_{ELBO} = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})}\left[\ln p(\mathbf{x}|\mathbf{z}) ight] - eta \cdot D_{KL}\left(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}) ight) \]

Every symbol

𝔼_{q(z|x)}[·]

Average over latent vectors z sampled from the encoder

ln p(x|z)

Log-likelihood of reconstructing user interaction history x from latent vector z

D_KL(q‖p)

Penalty for the latent space drifting from the Gaussian prior

β

Hyperparameter controlling how strongly to enforce the prior

Plain English

Learn a compressed representation of each user that reconstructs their interaction history, while keeping the latent space close to a standard Gaussian.

05

M1 · L6 — Key Takeaways

What to remember

01

P(A|B) = conditional

P(r_{ui}|u,i,θ) = probability of this rating, given this user, item, and parameters. The | is "given".

02

𝔼 = weighted average

𝔼_{(u,i)~𝒟}[ℒ] = average loss over randomly sampled training pairs. Approximated by mini-batches in practice.

03

KL = gap between distributions

Q = what you have. P = what you want. KL = 0 means identical. In VAE-CF: keeps latent space organised.

Next: M1 · L7 — Putting It All Together: Full Equation Decoding

06

Probability &Statistics Notation