CP3501 Deep Learning · Week 5

Natural Language
Processing &
Transformers

From text to predictions — using pretrained models to solve real-world NLP tasks

FastAI Lesson 4 Hugging Face Transformers JCU Brisbane
1 / 40
Recap

Where We've Been

The Deep Learning Recipe

Every deep learning task we've seen shares the same core loop — only the input data changes.

Raw Data
images, tables, or text
Preprocess
turn data into numbers
Model
learns patterns from numbers
Prediction
a useful output

The key insight: The training loop — forward pass, compute loss, backpropagate, update weights — is identical whether you're classifying cat photos or analysing legal documents.

2 / 40
Recap

Data Types

Three Modalities, Same Principles

Data type Raw form Preprocessing Model type
Vision Image files Resize → pixel values CNN
Tabular CSV rows Normalise, one-hot encode Linear + ReLU layers
Text ← Today Sentences Tokenise → integer IDs Transformer

Same loss function. Same gradient descent. Same optimiser. Text just needs a different kind of preprocessing before the numbers go in.

3 / 40
Library Switch

Tools

Why We're Switching to Hugging Face

fastai (Lessons 1–3)

  • Very high-level — one function does a lot
  • Great for vision and tabular data
  • Ideal for building intuition quickly

🤗 Hugging Face (Today)

  • Mid-level — you see each step clearly
  • Best-in-class NLP models
  • The industry standard for text AI

Why both matter: Seeing the same ideas expressed in two different libraries deepens understanding. fastai integrates Hugging Face models anyway — skills transfer both ways.

4 / 40
Library Switch

Translation Guide

fastai → Hugging Face: What Maps to What

Everything you learned still applies. The concepts have new names.

fastai concept Hugging Face equivalent What it does
DataBlockDataset + .map()Prepares and transforms data
LearnerTrainerBundles model, data, loss, metrics
fit_one_cycle()TrainingArguments (cosine + warmup)Runs the training loop
learn.predict()trainer.predict()Gets predictions from a trained model
fine_tune()trainer.train()Adapts pretrained weights to your task
5 / 40
NLP Foundations

What Is NLP?

Teaching Computers to Work with Language

Natural Language Processing (NLP) is the field of making computers understand, generate, and work with human text and speech.

Tasks NLP can solve

  • Is this review positive or negative?
  • Is this email spam?
  • Do these two sentences mean the same thing?
  • Who wrote this document?

Why it's hard

  • The same word means different things in different contexts
  • Sarcasm, idioms, abbreviations
  • Computers need numbers — not words
6 / 40
NLP Foundations

The Pipeline

How Text Becomes a Prediction

Every NLP model follows this sequence. We'll build each step today.

Raw Text Tokeniser → integer IDs Embedding Matrix (learned) Transformer Layers Classifier Head "table leg" [1, 54453, 435…] [128 × 768] [128 × 768] [1 score]

The numbers in brackets show the shape of the data as it flows through. 128 = sequence length. 768 = hidden size of BERT-base.

7 / 40
NLP Foundations

Preprocessing

Why Text Needs Tokenisation

Neural networks only work with numbers. We need a reliable way to convert any string of text into a fixed list of integers. This process is called tokenisation.

1
Split the text into pieces called tokens A token is roughly a word, part of a word, or a punctuation mark
2
Look up each token's ID in the vocabulary Every model has a vocabulary — a dictionary mapping token → number
3
Pad or truncate to a fixed length All inputs must be the same length so they can be processed in batches
8 / 40
NLP Foundations

Vocabulary Design

The Problem with Using Whole Words

There are over a million English words. If each word gets its own ID, the vocabulary becomes impossibly large.

Character-level

Split into individual letters.
Tiny vocabulary (~100) but loses meaning — 'c','a','t' doesn't tell you it's an animal.

Word-level

Keep whole words. "Unbelievable" gets its own slot. Problem: new or rare words have no ID — out-of-vocabulary.

Sub-word ✓ Best

Split at meaningful boundaries. "unbelievable" → "un" + "believ" + "able". Handles new words, manageable vocabulary (~30k).

Modern NLP uses sub-word tokenisation. BERT uses WordPiece. GPT uses BPE (Byte-Pair Encoding). Both give small vocabularies that handle any text.

9 / 40
NLP Foundations

Sub-word Demo

Watching the Tokeniser Work

The underscore ( _ ) marks the start of a new word. Everything without an underscore is a continuation.

Input → "Good AI from fast.ai"

_Good
_A
I
_from
_fast
.
ai

Input → "A platypus is an ornithorhynchus"

_A
_platypus
_is
_an
_or
nith
or
hyn
chus

Those tokens become integer IDs

1
54453
435
294
336
5753
346
2
10 / 40
Pretraining

How BERT Learned Language

The Masked Language Model Trick

BERT was trained on hundreds of millions of sentences from Wikipedia and books — with zero human labels. Here's how:

1
Take a normal sentence "The patient was admitted to the hospital with chest pain."
2
Randomly hide 15% of the words "The patient was admitted to the [MASK] with chest pain."
3
Train the model to predict the hidden words To predict "hospital", the model must learn grammar, facts, and context

Result: After seeing billions of sentences, BERT's weights encode rich knowledge about language, facts, and meaning — ready to be fine-tuned for your specific task.

11 / 40
Pretraining

Fine-tuning

Small, Focused Edits — Not Starting from Scratch

Fine-tuning takes a pretrained model and gently adjusts its weights for your specific task. Most weights barely change.

Training from scratch

  • All weights start random
  • Needs millions of labelled examples
  • Takes days or weeks to train
  • Requires huge compute budget

Fine-tuning ✓ What we do

  • Weights start already good from pretraining
  • Works with a few thousand examples
  • Takes minutes on a free GPU
  • State-of-the-art results
12 / 40
Today's Task

The Problem

Kaggle: US Patent Phrase-to-Phrase Matching

A real competition where a company wants to automatically measure how similar two patent phrases are. This is exactly the kind of real-world NLP task you'll encounter in industry.

Anchor phrase:  "table leg"

Target phrase:  "supporting member"

Context (patent section):  A47B (Furniture)

Goal — predict similarity score:  0 = different, 0.5 = related, 1.0 = identical meaning

13 / 40
Today's Task

The Data

What's in the Dataset?

Column Example Meaning
anchorabatementThe first phrase to compare
targetact of abatingThe second phrase
contextA47Patent classification section
score0.5Human-labelled similarity (0–1)

36,473 rows — but only 733 unique anchor phrases. Many targets paired with the same anchor.

106 unique contexts — the same phrase might mean different things in different patent sections.

14 / 40
Today's Task

Problem Framing

Turning Similarity into Classification

BERT is excellent at reading a single input and producing a score. We need to turn our three-column problem (anchor, target, context) into one input string.

1
Concatenate all three fields into one string Add artificial labels so the model can tell fields apart
TEXT1: A47; TEXT2: act of abating; ANC1: abatement
2
The model reads this as a standard text classification problem Output: a single number between 0 and 1 representing similarity

This is a common NLP pattern: Creative problem framing lets you apply a general-purpose model to any specific task.

15 / 40
Today's Task

Real-World Applications

Why Text Classification Powers Modern NLP

The same pipeline we're building today solves all of these problems.

Application Input (the "document") Output (the "class")
Sentiment analysisMovie reviewPositive / Negative
Spam detectionEmail messageSpam / Not spam
Email triageCustomer messageSales / Support / Complaint
Legal discoveryLegal documentIn-scope / Out-of-scope
Phrase matching ← TodayAnchor + target + contextSimilarity score 0–1
16 / 40
Building the Pipeline

Load the Data and Have a Look

Always start by exploring your data. Use pandas to load the CSV and check what's inside before writing any model code.

1
Load the CSV file pandas read_csv() gives you a table (called a DataFrame) of all 36k rows
2
Look at the first few rows df.head() — are the column names what you expect? Do the values make sense?
3
Check for patterns How many unique anchors? Any missing values? What's the score distribution?

Rule of thumb: Spend at least as long exploring your data as building your model. Surprises in the data are the most common cause of poor results.

17 / 40
Building the Pipeline

Build the Input String

We combine anchor, target, and context into a single text column so BERT can read the whole thing at once.

What we create

df['input'] = "TEXT1: " + df.context + "; TEXT2: " + df.target + "; ANC1: " + df.anchor

First 3 rows of the new input column

TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement
TEXT1: A47; TEXT2: act of abating; ANC1: abatement
TEXT1: A47; TEXT2: active catalyst; ANC1: abatement

The artificial labels (TEXT1:, TEXT2:, ANC1:) help the model tell the three fields apart without extra code. This is prompt engineering.

18 / 40
Building the Pipeline

Convert to a Hugging Face Dataset

Hugging Face has its own Dataset format that's optimised for NLP tasks — it's faster and supports efficient batched transformations.

pandas DataFrame

  • Great for exploration and cleaning
  • Loaded entirely into memory
  • Use for the early data wrangling steps

HF Dataset ← Switch here

  • Memory-mapped — handles large files
  • Vectorised transforms — tokenise 36k rows in 6 seconds
  • Required by the Trainer class

One line converts between them: Dataset.from_pandas(df)

19 / 40
Building the Pipeline

Choose a Pretrained Model First

This decision must come before tokenisation — because the tokeniser must match the vocabulary of the pretrained model you choose.

⚠️ Common mistake: Tokenising with the wrong vocabulary. Always load the tokeniser that belongs to your chosen checkpoint.

1
Search Hugging Face Hub There are ~44,000 pretrained checkpoints. Search "patent" for domain-specific models.
2
Good general-purpose default for this task microsoft/deberta-v3-small
3
For production: search for domain-specific models A model pretrained on patent text will outperform a general one
20 / 40
Building the Pipeline

Load the Matching Tokeniser

AutoTokenizer is Hugging Face's smart loader — you give it the checkpoint name and it downloads the correct vocabulary and tokenisation rules automatically.

Loading the tokeniser

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(
    "microsoft/deberta-v3-small"
)

This downloads the vocabulary + rules used when the original model was pretrained. Using a different tokeniser would be like giving the model a text in a different alphabet.

21 / 40
Building the Pipeline

Tokenise the Whole Dataset

We apply our tokeniser to every row using .map() — Hugging Face's vectorised transform that processes all 36k rows in ~6 seconds.

What the tok_func does to each row:

  • Takes the input string we built in Step 2
  • Splits it into tokens, looks up each ID in the vocabulary
  • Pads short sequences to length 128 with a [PAD] token
  • Truncates long sequences at 128 tokens
  • Returns input_ids and attention_mask

attention_mask: A list of 1s and 0s telling the model which tokens are real words (1) and which are padding (0). Padding tokens should be ignored during learning.

22 / 40
Building the Pipeline

Tokenised Row

What One Tokenised Example Looks Like

Let's inspect row 0 after tokenisation to see exactly what the model receives.

Original text

'TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement'

input_ids (first 8 of 128)

1
54453
435
294
336
5753
346
54453
…128 total

Token ID 1 = [CLS] (start of sequence). Token 2 = [SEP] (separator). These special tokens help BERT know where the input begins and ends.

23 / 40
Building the Pipeline

Set Up the Labels

Hugging Face's Trainer expects the target column to be literally named "labels". Our column is named "score" — so we rename it.

The score column contains values like 0, 0.25, 0.5, 0.75, 1.0
After renaming it becomes the labels column
The Trainer reads labels automatically and uses them to compute loss

Binary / multi-class / regression all use the same "labels" key. The model's output size (num_labels) determines whether it's treated as classification or regression.

Since our scores are continuous (0.0 to 1.0), this is a regression problem. We set num_labels=1 when building the model.

24 / 40
Conceptual Interlude

Before we run any training, we need to understand how to measure whether our model is actually learning — or just memorising.

Why We Need a Validation Set

If you only measure performance on the data the model trained on, you'll always see improvement — even if the model is useless on new data.

The core problem: Training loss always goes down. But that doesn't mean the model is learning anything generalizable. It might just be memorising the training data.

Solution: Hold back a portion of data that the model never sees during training. Measure performance on this validation set to detect overfitting early.

25 / 40
Validation Theory

Overfitting vs Underfitting

The Goldilocks Problem

A model can fail in two opposite directions. You need it "just right".

UNDERFITTING Too simple — misses pattern JUST RIGHT ✓ Good generalisation OVERFITTING Memorised training data
26 / 40
Validation Theory

Data Splits

Three Splits, Three Roles

Training Set (~75%)

Model weights are updated using this data. The model learns here.

Validation Set (~25%)

Checked after each epoch to detect overfitting. Guides hyperparameter tuning.

Test Set (Locked)

Opened once at the end to report final performance. Never used during development.

On Kaggle: The test set is the competition data. The private leaderboard is only revealed after the competition closes — enforcing honest evaluation.

A good model shows: training loss ↓ and validation loss ↓ together. Danger sign: training loss keeps falling while validation loss starts rising.

27 / 40
Validation Theory

Splitting Strategy

Not All Splits Are Equal

A random split is fine for most tasks — but some datasets need more care.

Dataset type ❌ Bad split ✓ Good split
Time-series (sales) Random rows → future leaks into training Last N weeks as validation
Face recognition Random photos → same person in train & val Group by person ID
Recommender systems Random rows → same user in both sets Group by user ID
Patent phrases ← Us Random OK, but check anchor duplicates

Rule: Ask "what would the model see in the real world that it never saw during training?" Design your split to simulate that.

28 / 40
Validation Theory

Cross-validation

Cross-validation vs a Fixed Hold-out Set

k-fold cross-validation

  • Split data into k equal parts
  • Train k times, each with a different fold as validation
  • Average the k scores for a stable estimate
  • Use for: comparing models / benchmarking
  • Downside: k× slower

Fixed hold-out set ← What we use

  • Split once, keep the same val set throughout
  • Tune hyperparameters against it
  • Use for: iterative development and competitions
  • Downside: risk of overfitting to the specific split

⚠️ If you tune too many hyperparameters against the same validation set, you will slowly overfit to it. This is "leakage through hyperparameter search".

29 / 40
Metrics

Two Different Things

Metric vs Loss — Don't Confuse Them

Loss function

  • Used internally during training to compute gradients
  • Must be smooth and differentiable
  • Examples: MSE, MAE, Cross-entropy
  • Optimiser minimises this every step

Metric

  • What stakeholders actually care about
  • Can be non-smooth or non-differentiable
  • Examples: Accuracy, F1 score, Pearson r
  • You report this on validation and test sets

⚠️ A model can lower its loss every epoch while the metric stays flat. Always track and report the metric — not just the loss — when evaluating progress.

30 / 40
Metrics

Our Competition Metric

Pearson Correlation Coefficient (r)

Kaggle measures our model using Pearson r — a number between −1 and +1 that measures how well our predicted scores move in the same direction as the true scores.

r valueWhat it looks like in a scatter plot
+1.0Perfect — predictions match truth exactly
+0.68Good — clear upward trend visible
+0.43Moderate — cloud with slight slope
+0.20Weak — slope barely visible
0.0None — predictions are random relative to truth

Pearson r measures linear correlation. It doesn't care about the absolute scale of predictions — only whether high predictions correspond to high truth values.

31 / 40
Metrics

Worked Example

Calculating Pearson r by Hand

Let's see what r looks like with 5 prediction/truth pairs, and what happens when we add an outlier.

Pair True score Predicted Difference
10.250.30+0.05 ✓
20.500.45−0.05 ✓
30.750.80+0.05 ✓
41.000.95−0.05 ✓
50.000.05+0.05 ✓
+Outlier0.500.00−0.50 ✗✗✗

Without outlier: r = 0.999 — near perfect

With outlier: r = 0.43 — one bad prediction drags r down dramatically

32 / 40
Metrics

Practical Implication

Outliers Can Wreck Your Kaggle Score

Because Pearson r is sensitive to large errors on individual rows, you need to make sure no single prediction is wildly wrong.

1
After training: inspect the raw model outputs BERT's output head has no bounds — it can produce values outside [0, 1]
2
Clip predictions to the valid range Force all predictions to stay between 0.0 and 1.0
3
This simple step immediately boosts your leaderboard score Without clipping, a few extreme values can destroy your Pearson r

📌 Lesson: When Pearson r is your metric, eliminating large errors on any single row is more important than improving average accuracy.

33 / 40
Building the Pipeline

Configure the Training Settings

TrainingArguments is a config object — it holds all your training hyperparameters in one place. Think of it as the equivalent of fastai's fit_one_cycle() options.

SettingWhat it doesOur value
learning_rateHow big each gradient step is8e-5
num_train_epochsHow many times to go through all training data4
warmup_ratioFraction of steps to ramp LR up slowly at start0.1
lr_scheduler_typeHow LR changes over training (cosine = smooth decay)cosine
weight_decayRegularisation — prevents individual weights getting too large0.01
evaluation_strategyWhen to check validation performanceper epoch
34 / 40
Building the Pipeline

Build the Model Object

We load the pretrained model and attach a new randomly-initialised output head sized to our task (1 output = 1 similarity score).

Pretrained DeBERTa-v3-small Weights already trained on 100GB+ of text + New Linear Head 768 inputs → 1 output (random init)

The pretrained layers hold valuable knowledge. We keep them and update them slowly.

The new head starts random and must learn quickly from task-specific data.

AutoModelForSequenceClassification handles all of this automatically — just pass the checkpoint name and num_labels=1.

35 / 40
Building the Pipeline

Create the Trainer

The Trainer is the Hugging Face equivalent of fastai's Learner. It bundles everything together so training is one function call.

The Trainer needs:

  • model — the DeBERTa model with classification head
  • args — the TrainingArguments config from Step 8
  • train_dataset — tokenised training rows
  • eval_dataset — tokenised validation rows
  • compute_metrics — our Pearson r function
  • data_collator — handles dynamic padding for each batch

Once created, training starts with trainer.train(). That's it. The Trainer handles batching, GPU transfer, gradient accumulation, logging, and checkpointing.

36 / 40
Results

Training Snapshot

Running the Training Loop

One epoch through 27,000 training rows takes about 5 minutes on a free Kaggle GPU. After training, you see a live log like this:

Epoch 1/4
train_loss: 0.241  |  eval_loss: 0.198  |  pearson_r: 0.783
Epoch 2/4
train_loss: 0.187  |  eval_loss: 0.164  |  pearson_r: 0.821
Epoch 3/4
train_loss: 0.152  |  eval_loss: 0.148  |  pearson_r: 0.834

📈 Both training loss and validation loss are decreasing together — the model is generalising, not overfitting. This is exactly what we want to see.

37 / 40
Results

After 1 Epoch

r = 0.834 — Why Does It Work So Fast?

0.834 Pearson r on validation set after just 1 epoch of fine-tuning

This result is remarkable. We trained for less than 5 minutes and achieved strong performance. The reason is transfer learning.

1
DeBERTa already knows English Pretraining on billions of sentences gives the model rich semantic knowledge
2
Fine-tuning only needs small adjustments The model barely needs to change — it already understands meaning and similarity
3
This is the power of pretraining + fine-tuning Billions of dollars of compute, distilled into a free downloadable checkpoint
38 / 40
Results

Submission

Post-processing Predictions and Submitting

1
Run predictions on the test set trainer.predict() returns raw float scores — these can go outside [0, 1]
2
Always inspect outputs before submitting Are any values negative? Greater than 1? This is a sign something went wrong.
3
Clip to [0, 1] with torch.clamp() This simple step immediately improves your correlation score on the leaderboard
4
Save to CSV and upload to Kaggle The submission file needs two columns: text_id and score
39 / 40
Ethics

SLO4 — Responsible AI

Ethics Check: NLP Models

Before deploying any NLP model, you must ask these questions about your data and use case.

Potential misuse risks

Plagiarism detection systems
Surveillance of communications
IP theft / competitive intelligence
Automated rejection without explanation

Required practices

Document data provenance
Record model limitations
Declare intended use
Test for domain bias

Patent data question: Who owns the text in a patent corpus? Training on publicly-filed patents is generally allowed, but check jurisdiction-specific rules before deploying commercially.

Model card: When you submit your project, document what your model can and cannot do, where it might fail, and how it should — and shouldn't — be used.

40 / 40
1 / 40