Week 8  ·  Lesson 7

Collaborative Filtering

Teaching machines to predict what you will like next

We move beyond structured tabular data into the world of recommendation systems — one of the most commercially impactful applications of deep learning.

CP3501 – Deep Learning James Cook University Semester 1, 2025

Recap: Where We Have Been

Weeks 1–6

  • Transfer learning with images (ResNet, FastAI)
  • ML fundamentals — loss, metrics, overfitting
  • Gradient descent — the engine of learning
  • NLP with Transformers (Hugging Face)
  • Tabular deep learning (Titanic)
Week 7

Embeddings — turning discrete categories into dense numeric vectors. This is the key idea we build on today.

The common thread

Build It → Understand It → Apply It
  • We always start with working code
  • Then we open the hood
  • Then we reason about design choices
Today

Collaborative Filtering — a domain where embeddings are the entire model, not just a preprocessing step.

What is a Recommender System?

A recommender system predicts a user's preference or rating for items they have not yet encountered.

User Past ratings, clicks, history Recommender Model (learns from all users) Item A predicted: ★4.5 Item B predicted: ★2.1 Item C ★? Recommend top-ranked items

Real-world examples

  • Netflix — "Because you watched..."
  • Spotify — Discover Weekly playlist
  • Amazon — "Customers also bought"
  • YouTube — next video autoplay

Two main families

  • Content-based — use item features (genre, director…)
  • Collaborative Filtering — use patterns across many users
  • No item features needed — just the rating data

The Rating Matrix

Collaborative filtering starts with a user–item rating matrix. Each cell is a known rating; most cells are unknown.

User \ Movie Inception Toy Story The Matrix Frozen Interstellar
Alice5?4?5
Bob?5?4?
Carol34?52
Dave??5?4
The core problem

Given the known ratings (numbers), predict the missing ones (?). This is a matrix completion problem.

Key insight

Alice and Dave both love Sci-Fi (Inception, The Matrix, Interstellar). So Alice's rating for The Matrix is a good signal for Dave's missing ratings too.

The Sparsity Problem

In reality the matrix is extremely sparse. Netflix has ~200M users and ~15,000 titles. Each user rates a tiny fraction.

Sparsity example
  • 200,000,000 users × 15,000 titles = 3 trillion possible ratings
  • A user who rates 50 movies fills only 0.00033% of their row
  • Over 99.99% of the matrix is empty!
Why CF still works

Even sparse data contains strong patterns. Users who agree on rated items tend to agree on unrated ones too.

Users → Items → Known rating (~2% of cells)

Check Your Understanding

Knowledge Check 1
Collaborative Filtering makes predictions based on:
Try again
Knowledge Check 2
In a user–item rating matrix, most cells are empty. What does each empty cell represent?
Try again

Matrix Factorisation: The Core Idea

We decompose the rating matrix R into two smaller matrices: one for users, one for items.

R users × items (sparse, large) e.g. 200M × 15K U User Embeddings users × k factors e.g. 200M × 50 × Vᵀ Item Embeddings k factors × items e.g. 50 × 15K Key insight k = number of latent factors (e.g. k = 50, much smaller than users or items) Each factor captures some hidden preference pattern
User embedding row

A vector of k numbers that encodes Alice's taste profile across k hidden dimensions.

Item embedding column

A vector of k numbers encoding how much a movie "contains" each hidden dimension.

Latent Factors: What Do They Capture?

We never label the latent dimensions — the model discovers them automatically. But after training we can interpret what they seem to represent.

Factor 1: Sci-Fi / Action ←→ Family / Animation Inception Matrix Frozen Toy Story + Factor 2: Dark / Serious ←→ Light-hearted / Fun Requiem Inception Toy Story Shrek + …and k−2 more factors discovered automatically. The model is never told what these axes mean.
The magic of latent factors

Each user gets a k-dimensional vector. Each item gets a k-dimensional vector. A user who scores high on "Sci-Fi" will be matched with movies that also score high on "Sci-Fi" — even though neither label was ever given to the model.

Predicting Ratings: The Dot Product

To predict how much user u will like item i, we compute the dot product of their embedding vectors.

ui  =  uu · vi  =  Σk uuk × vik
0.8 −0.3 0.6 Alice (user vector) k = 3 factors · 0.7 0.5 0.9 Inception (item vector) k = 3 factors = (0.8×0.7) + (−0.3×0.5) + (0.6×0.9) = 0.56 − 0.15 + 0.54 = 0.95 → predicted ★4.75
High dot product

User and item point in similar directions — they share common "tastes". High predicted rating.

Low (or negative) dot product

User and item point in different directions — mismatched preferences. Low predicted rating.

Learning Embeddings via Gradient Descent

The embedding values are just learnable parameters — exactly like weights in a neural network.

Training loop

  • Forward: look up user embedding + item embedding → dot product → predicted rating
  • Loss: compare prediction to actual rating (MSE)
  • Backward: compute gradients w.r.t. embedding values
  • Update: nudge embeddings to reduce loss
Loss = MSE = (1/N) Σ (rui − r̂ui
What gets learned?
  • All user embedding vectors (U matrix)
  • All item embedding vectors (V matrix)
  • No other model weights — this is the whole model
Number of parameters

If there are n users, m items and k factors:

n×k + m×k parameters total

e.g. 1000 users, 500 items, k=50 → 75,000 params

Adding Bias Terms

Some users always rate high (generous). Some movies always rate low (niche). We capture this with bias terms.

ui  =  uu · vi  +  bu  +  bi

User bias bu

  • Alice tends to give 4–5 stars → high bu
  • Bob tends to give 1–2 stars → low bu
  • Captures overall generosity / harshness

Item bias bi

  • The Godfather gets high ratings from everyone → high bi
  • A niche documentary gets lower average ratings → lower bi
Why bias matters

Without bias, a generous user would need abnormally large embedding values just to account for their rating habit — confusing the latent factor signal.

FastAI handles this automatically

collab_learner(dls, n_factors=50, y_range=(0,5.5), use_nn=False) includes bias by default.

FastAI: Loading the MovieLens Dataset

FastAI includes the MovieLens 100K dataset — 100,000 ratings from 943 users on 1,682 movies. Perfect for in-class training.

from fastai.collab import *
from fastai.tabular.all import *

# Load the MovieLens dataset (built-in, no internet needed)
path = untar_data(URLs.ML_SAMPLE)   # small 100-sample version
# OR for the full dataset:
path = untar_data(URLs.ML_100k)

# The ratings file: userId, movieId, rating, timestamp
ratings = pd.read_csv(path/'u.data', delimiter='\t',
                      header=None,
                      names=['user', 'movie', 'rating', 'timestamp'])
ratings.head()
What we need
  • A column of user IDs
  • A column of item IDs
  • A column of ratings (our target)
  • That is all CF needs — no item features!
Note on Colab

URLs.ML_SAMPLE is the safest option for in-class use — it downloads quickly and trains in seconds.

FastAI: Building the DataLoader

# CollabDataLoaders handles the embedding index lookup for you
dls = CollabDataLoaders.from_df(
    ratings,
    user_name='user',        # column of user IDs
    item_name='movie',       # column of item IDs
    rating_name='rating',    # column of ratings (target)
    valid_pct=0.2,           # 20% validation split
    seed=42
)

dls.show_batch()   # preview: user, movie, rating rows
What CollabDataLoaders does internally
  • Assigns a contiguous integer index to each unique user ID
  • Assigns a contiguous integer index to each unique movie ID
  • These indices are used to look up the embedding row for each user/item
  • The original IDs (e.g. userId=874) are not used directly — the index is
Embedding lookup: E[index] → returns the k-dimensional vector for that user/item

FastAI: Training the Collaborative Filter

# Create a matrix factorisation learner (dot-product model)
learn = collab_learner(
    dls,
    n_factors=50,        # k = 50 latent dimensions
    y_range=(0, 5.5),    # clamp predictions to valid rating range
    wd=0.1              # weight decay (L2 regularisation)
)

# Find a good learning rate
learn.lr_find()

# Train with 1-cycle policy
learn.fit_one_cycle(5, 5e-3, wd=0.1)
y_range explained

Applies a sigmoid scaled to (0, 5.5). This prevents predicting ratings below 0 or far above 5. We use 5.5 not 5.0 to avoid the sigmoid saturating at the boundary.

Weight decay

Prevents embedding values from growing too large — a key regularisation technique for CF. Without it, the model can memorise training ratings.

Check Your Understanding

Knowledge Check 3
What is n_factors in collab_learner?
Try again
Knowledge Check 4
Why do we set y_range=(0, 5.5) rather than (0, 5.0) when ratings go from 0 to 5?
Try again

Interpreting Learned Embeddings

After training, we can extract and analyse the embedding vectors to understand what the model has learned.

# Extract item (movie) embeddings
movie_emb = learn.model.i_weight.weight   # shape: (n_movies, n_factors)

# Find the most "extreme" movies on the first principal component
movie_pca = movie_emb.pca(3)    # reduce to 3D for visualisation
fac0, fac1, fac2 = movie_pca.t()

# Plot: movies with highest / lowest values on factor 0
idxs = fac0.argsort()
[learn.dls.classes['title'][i] for i in idxs[:5]]     # lowest
[learn.dls.classes['title'][i] for i in idxs[-5:]]    # highest
What you typically find
  • Factor extremes often align with recognisable genres or quality levels
  • Movies close together in embedding space tend to be recommended interchangeably
  • This is latent factor discovery — no labels were ever given

Neural Collaborative Filtering

Instead of a simple dot product, we can concatenate the embeddings and pass them through a neural network.

User Embedding k dims Item Embedding k dims concat 2k dims Linear + ReLU 128 units Linear + ReLU 64 units Sigmoid → rating scaled 0–5.5 Neural CF Architecture
# Switch to neural CF with use_nn=True
learn_nn = collab_learner(dls, use_nn=True,
                          emb_szs={'userId': 50, 'movieId': 50},
                          layers=[128, 64],  # MLP hidden layer sizes
                          y_range=(0, 5.5))
Dot product vs neural CF

The dot product model is simpler and often competitive. Neural CF adds capacity to learn non-linear interaction patterns, but needs more data and careful tuning.

Limitations: The Cold Start Problem

New user cold start

A brand-new user has no ratings. Their embedding is random (or zeroed). The model cannot make good predictions yet.

Solutions: ask for initial ratings ("onboarding"), use demographic fallback, or use popular-item heuristics.

New item cold start

A newly added movie has no ratings. No embedding can be learned. The model will never recommend it — even if it is excellent.

Solutions: use content features to bootstrap, or hybrid content + CF approaches.

Has ratings? New user/item? No Use learned embedding Yes Cold Start Problem! Hybrid model or popularity heuristic

Ethics in Recommender Systems

Recommender systems operate at massive scale — their design decisions affect millions of people.

Filter bubbles

  • CF amplifies existing preferences — users see more of what they already like
  • Can reinforce narrow worldviews, limit exposure to diverse content
  • Related: news recommendation and political polarisation

Popularity bias

  • Popular items accumulate more ratings → better embeddings → more recommendations
  • Niche content is systematically under-recommended
  • Disadvantages new creators and minority-interest content
Engagement vs wellbeing

Optimising for clicks or watch-time is not the same as optimising for user wellbeing. Models can learn to exploit psychological biases (outrage, fear) because they drive engagement metrics.

Privacy

CF requires storing detailed behavioural data about every user. Even "anonymous" IDs can be re-identified from rating patterns.

SLO4 link

Always document what your recommendation objective optimises, and what harms may result from that choice.

Week 8 Summary

Key concepts today

  • Collaborative Filtering — predict ratings from shared user-item patterns
  • Rating matrix — sparse, most values unknown
  • Matrix factorisation — decompose into user × item embeddings
  • Latent factors — hidden dimensions learned from data
  • Dot product — similarity between user and item embeddings
  • Bias terms — capture global generosity / quality effects
  • Neural CF — replace dot product with an MLP
  • Cold start — the fundamental limitation of CF

FastAI tools used

  • CollabDataLoaders.from_df()
  • collab_learner()
  • learn.model.i_weight.weight
  • use_nn=True for neural CF
Next up — Practical Workshop

You will train a CF model on MovieLens, inspect the learned embeddings, and compare dot-product vs neural CF performance.

Coming up — Week 9

Convolutional Neural Networks — how spatial structure in images is captured by learned filters.