VAE-CF, Bayesian MF, diffusion-based RS — entire model families
require reading P(·), 𝔼[·], and KL divergence fluently.
| Notation | Read As | Meaning | RS Context |
|---|---|---|---|
| P(A) | "probability of A" | Likelihood of event A occurring | P(click) = probability user clicks an item |
| P(A|B) | "prob of A given B" | Conditional probability — A given B occurred | P(r_{ui}|u,i,θ) = prob of rating given user, item, params |
| 𝔼[X] | "expectation of X" | Weighted average of X over its distribution | 𝔼_{(u,i)~𝒟}[ℒ] = expected loss over training data |
| x ~ P | "x distributed as P" | x is sampled from distribution P | z ~ 𝒩(0,I) = z drawn from standard Gaussian |
| 𝒩(μ, σ²) | "Normal distribution" | Gaussian with mean μ and variance σ² | Prior over embeddings in Bayesian MF |
| KL(Q‖P) | "KL divergence from P to Q" | Gap between distributions Q and P | Regularisation in VAE-CF latent space |
The subscript on 𝔼 tells you what's being randomly sampled:
In practice: We can't compute 𝔼 exactly over all data — so we approximate it with mini-batches. SGD optimises the expected loss by sampling random batches and using each batch as an estimate.
"We minimise the expected BPR loss 𝔼_{(u,i,j)~𝒟}[ℒ_BPR(u,i,j)]" — this is standard phrasing. The expectation signals they're doing mini-batch SGD.
Plain English: KL = gap between your model's distribution and your target distribution. In VAE-CF: gap between encoder output and the Gaussian prior you want.
Why Gaussian prior? The KL between two Gaussians has a closed-form solution, making training tractable. It's a convenience choice, not a law — other priors are valid research contributions.
P(r_{ui}|u,i,θ) = probability of this rating, given this user, item, and parameters. The | is "given".
𝔼_{(u,i)~𝒟}[ℒ] = average loss over randomly sampled training pairs. Approximated by mini-batches in practice.
Q = what you have. P = what you want. KL = 0 means identical. In VAE-CF: keeps latent space organised.
Next: M1 · L7 — Putting It All Together: Full Equation Decoding