CP3501 Deep Learning · Week 5

Natural Language
Processing &
Transformers

From text to predictions — using pretrained models to solve real-world NLP tasks

FastAI Lesson 4 Hugging Face Transformers JCU Brisbane

1 / 40

Recap

Where We've Been

The Deep Learning Recipe

Every deep learning task we've seen shares the same core loop — only the input data changes.

Raw Data

images, tables, or text

→

Preprocess

turn data into numbers

→

Model

learns patterns from numbers

→

Prediction

a useful output

The key insight: The training loop — forward pass, compute loss, backpropagate, update weights — is identical whether you're classifying cat photos or analysing legal documents.

2 / 40

Recap

Data Types

Three Modalities, Same Principles

Data type	Raw form	Preprocessing	Model type
Vision	Image files	Resize → pixel values	CNN
Tabular	CSV rows	Normalise, one-hot encode	Linear + ReLU layers
Text ← Today	Sentences	Tokenise → integer IDs	Transformer

Same loss function. Same gradient descent. Same optimiser. Text just needs a different kind of preprocessing before the numbers go in.

3 / 40

Library Switch

Tools

Why We're Switching to Hugging Face

fastai (Lessons 1–3)

Very high-level — one function does a lot
Great for vision and tabular data
Ideal for building intuition quickly

🤗 Hugging Face (Today)

Mid-level — you see each step clearly
Best-in-class NLP models
The industry standard for text AI

Why both matter: Seeing the same ideas expressed in two different libraries deepens understanding. fastai integrates Hugging Face models anyway — skills transfer both ways.

4 / 40

Library Switch

Translation Guide

fastai → Hugging Face: What Maps to What

Everything you learned still applies. The concepts have new names.

fastai concept	Hugging Face equivalent	What it does
DataBlock	Dataset + .map()	Prepares and transforms data
Learner	Trainer	Bundles model, data, loss, metrics
fit_one_cycle()	TrainingArguments (cosine + warmup)	Runs the training loop
learn.predict()	trainer.predict()	Gets predictions from a trained model
fine_tune()	trainer.train()	Adapts pretrained weights to your task

5 / 40

NLP Foundations

What Is NLP?

Teaching Computers to Work with Language

Natural Language Processing (NLP) is the field of making computers understand, generate, and work with human text and speech.

Tasks NLP can solve

Is this review positive or negative?
Is this email spam?
Do these two sentences mean the same thing?
Who wrote this document?

Why it's hard

The same word means different things in different contexts
Sarcasm, idioms, abbreviations
Computers need numbers — not words

6 / 40

NLP Foundations

The Pipeline

How Text Becomes a Prediction

Every NLP model follows this sequence. We'll build each step today.

The numbers in brackets show the shape of the data as it flows through. 128 = sequence length. 768 = hidden size of BERT-base.

7 / 40

NLP Foundations

Preprocessing

Why Text Needs Tokenisation

Neural networks only work with numbers. We need a reliable way to convert any string of text into a fixed list of integers. This process is called tokenisation.

1

Split the text into pieces called tokens A token is roughly a word, part of a word, or a punctuation mark

2

Look up each token's ID in the vocabulary Every model has a vocabulary — a dictionary mapping token → number

3

Pad or truncate to a fixed length All inputs must be the same length so they can be processed in batches

8 / 40

NLP Foundations

Vocabulary Design

The Problem with Using Whole Words

There are over a million English words. If each word gets its own ID, the vocabulary becomes impossibly large.

Character-level

Split into individual letters.
Tiny vocabulary (~100) but loses meaning — 'c','a','t' doesn't tell you it's an animal.

Word-level

Keep whole words. "Unbelievable" gets its own slot. Problem: new or rare words have no ID — out-of-vocabulary.

Sub-word ✓ Best

Split at meaningful boundaries. "unbelievable" → "un" + "believ" + "able". Handles new words, manageable vocabulary (~30k).

Modern NLP uses sub-word tokenisation. BERT uses WordPiece. GPT uses BPE (Byte-Pair Encoding). Both give small vocabularies that handle any text.

9 / 40

NLP Foundations

Sub-word Demo

Watching the Tokeniser Work

The underscore ( _ ) marks the start of a new word. Everything without an underscore is a continuation.

Input → "Good AI from fast.ai"

_Good

_A

I

_from

_fast

.

ai

Input → "A platypus is an ornithorhynchus"

_A

_platypus

_is

_an

_or

nith

or

hyn

chus

Those tokens become integer IDs

1

54453

435

294

336

5753

346

2

10 / 40

Pretraining

How BERT Learned Language

The Masked Language Model Trick

BERT was trained on hundreds of millions of sentences from Wikipedia and books — with zero human labels. Here's how:

1

Take a normal sentence "The patient was admitted to the hospital with chest pain."

2

Randomly hide 15% of the words "The patient was admitted to the [MASK] with chest pain."

3

Train the model to predict the hidden words To predict "hospital", the model must learn grammar, facts, and context

Result: After seeing billions of sentences, BERT's weights encode rich knowledge about language, facts, and meaning — ready to be fine-tuned for your specific task.

11 / 40

Pretraining

Fine-tuning

Small, Focused Edits — Not Starting from Scratch

Fine-tuning takes a pretrained model and gently adjusts its weights for your specific task. Most weights barely change.

Training from scratch

All weights start random
Needs millions of labelled examples
Takes days or weeks to train
Requires huge compute budget

Fine-tuning ✓ What we do

Weights start already good from pretraining
Works with a few thousand examples
Takes minutes on a free GPU
State-of-the-art results

12 / 40

Today's Task

The Problem

Kaggle: US Patent Phrase-to-Phrase Matching

A real competition where a company wants to automatically measure how similar two patent phrases are. This is exactly the kind of real-world NLP task you'll encounter in industry.

Anchor phrase: "table leg"

Target phrase: "supporting member"

Context (patent section): A47B (Furniture)

Goal — predict similarity score: 0 = different, 0.5 = related, 1.0 = identical meaning

13 / 40

Today's Task

The Data

What's in the Dataset?

Column	Example	Meaning
anchor	abatement	The first phrase to compare
target	act of abating	The second phrase
context	A47	Patent classification section
score	0.5	Human-labelled similarity (0–1)

36,473 rows — but only 733 unique anchor phrases. Many targets paired with the same anchor.

106 unique contexts — the same phrase might mean different things in different patent sections.

14 / 40

Today's Task

Problem Framing

Turning Similarity into Classification

BERT is excellent at reading a single input and producing a score. We need to turn our three-column problem (anchor, target, context) into one input string.

1

Concatenate all three fields into one string Add artificial labels so the model can tell fields apart

      TEXT1: A47; TEXT2: act of abating; ANC1: abatement
    

2

The model reads this as a standard text classification problem Output: a single number between 0 and 1 representing similarity

This is a common NLP pattern: Creative problem framing lets you apply a general-purpose model to any specific task.

15 / 40

Today's Task

Real-World Applications

Why Text Classification Powers Modern NLP

The same pipeline we're building today solves all of these problems.

Application	Input (the "document")	Output (the "class")
Sentiment analysis	Movie review	Positive / Negative
Spam detection	Email message	Spam / Not spam
Email triage	Customer message	Sales / Support / Complaint
Legal discovery	Legal document	In-scope / Out-of-scope
Phrase matching ← Today	Anchor + target + context	Similarity score 0–1

16 / 40

Building the Pipeline

Step 1 of 10

Load the Data and Have a Look

Always start by exploring your data. Use pandas to load the CSV and check what's inside before writing any model code.

1

Load the CSV file pandas read_csv() gives you a table (called a DataFrame) of all 36k rows

2

Look at the first few rows df.head() — are the column names what you expect? Do the values make sense?

3

Check for patterns How many unique anchors? Any missing values? What's the score distribution?

Rule of thumb: Spend at least as long exploring your data as building your model. Surprises in the data are the most common cause of poor results.

17 / 40

Building the Pipeline

Step 2 of 10

Build the Input String

We combine anchor, target, and context into a single text column so BERT can read the whole thing at once.

What we create

        df['input'] = "TEXT1: " + df.context + "; TEXT2: " + df.target + "; ANC1: " + df.anchor
      

First 3 rows of the new input column

        TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement

        TEXT1: A47; TEXT2: act of abating; ANC1: abatement

        TEXT1: A47; TEXT2: active catalyst; ANC1: abatement

The artificial labels (TEXT1:, TEXT2:, ANC1:) help the model tell the three fields apart without extra code. This is prompt engineering.

18 / 40

Building the Pipeline

Step 3 of 10

Convert to a Hugging Face Dataset

Hugging Face has its own Dataset format that's optimised for NLP tasks — it's faster and supports efficient batched transformations.

pandas DataFrame

Great for exploration and cleaning
Loaded entirely into memory
Use for the early data wrangling steps

HF Dataset ← Switch here

Memory-mapped — handles large files
Vectorised transforms — tokenise 36k rows in 6 seconds
Required by the Trainer class

One line converts between them: Dataset.from_pandas(df)

19 / 40

Building the Pipeline

Step 4 of 10

Choose a Pretrained Model First

This decision must come before tokenisation — because the tokeniser must match the vocabulary of the pretrained model you choose.

⚠️ Common mistake: Tokenising with the wrong vocabulary. Always load the tokeniser that belongs to your chosen checkpoint.

1

Search Hugging Face Hub There are ~44,000 pretrained checkpoints. Search "patent" for domain-specific models.

2

Good general-purpose default for this task microsoft/deberta-v3-small

3

For production: search for domain-specific models A model pretrained on patent text will outperform a general one

20 / 40

Building the Pipeline

Step 5 of 10

Load the Matching Tokeniser

AutoTokenizer is Hugging Face's smart loader — you give it the checkpoint name and it downloads the correct vocabulary and tokenisation rules automatically.

Loading the tokeniser

        from transformers import AutoTokenizer

        tok = AutoTokenizer.from_pretrained(

            "microsoft/deberta-v3-small"

        )

This downloads the vocabulary + rules used when the original model was pretrained. Using a different tokeniser would be like giving the model a text in a different alphabet.

21 / 40

Building the Pipeline

Step 6 of 10

Tokenise the Whole Dataset

We apply our tokeniser to every row using .map() — Hugging Face's vectorised transform that processes all 36k rows in ~6 seconds.

What the tok_func does to each row:

Takes the input string we built in Step 2
Splits it into tokens, looks up each ID in the vocabulary
Pads short sequences to length 128 with a [PAD] token
Truncates long sequences at 128 tokens
Returns input_ids and attention_mask

attention_mask: A list of 1s and 0s telling the model which tokens are real words (1) and which are padding (0). Padding tokens should be ignored during learning.

22 / 40

Building the Pipeline

Tokenised Row

What One Tokenised Example Looks Like

Let's inspect row 0 after tokenisation to see exactly what the model receives.

Original text

        'TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement'
      

input_ids (first 8 of 128)

1

54453

435

294

336

5753

346

54453

…128 total

Token ID 1 = [CLS] (start of sequence). Token 2 = [SEP] (separator). These special tokens help BERT know where the input begins and ends.

23 / 40

Building the Pipeline

Step 7 of 10

Set Up the Labels

Hugging Face's Trainer expects the target column to be literally named "labels". Our column is named "score" — so we rename it.

The score column contains values like 0, 0.25, 0.5, 0.75, 1.0
After renaming it becomes the labels column
The Trainer reads labels automatically and uses them to compute loss

Binary / multi-class / regression all use the same "labels" key. The model's output size (num_labels) determines whether it's treated as classification or regression.

Since our scores are continuous (0.0 to 1.0), this is a regression problem. We set num_labels=1 when building the model.

24 / 40

⏸

Conceptual Interlude

Before we run any training, we need to understand how to measure whether our model is actually learning — or just memorising.

Why We Need a Validation Set

If you only measure performance on the data the model trained on, you'll always see improvement — even if the model is useless on new data.

The core problem: Training loss always goes down. But that doesn't mean the model is learning anything generalizable. It might just be memorising the training data.

Solution: Hold back a portion of data that the model never sees during training. Measure performance on this validation set to detect overfitting early.

25 / 40

Validation Theory

Overfitting vs Underfitting

The Goldilocks Problem

A model can fail in two opposite directions. You need it "just right".

26 / 40

Validation Theory

Data Splits

Three Splits, Three Roles

Training Set (~75%)

Model weights are updated using this data. The model learns here.

Validation Set (~25%)

Checked after each epoch to detect overfitting. Guides hyperparameter tuning.

Test Set (Locked)

Opened once at the end to report final performance. Never used during development.

On Kaggle: The test set is the competition data. The private leaderboard is only revealed after the competition closes — enforcing honest evaluation.

A good model shows: training loss ↓ and validation loss ↓ together. Danger sign: training loss keeps falling while validation loss starts rising.

27 / 40

Validation Theory

Splitting Strategy

Not All Splits Are Equal

A random split is fine for most tasks — but some datasets need more care.

Dataset type	❌ Bad split	✓ Good split
Time-series (sales)	Random rows → future leaks into training	Last N weeks as validation
Face recognition	Random photos → same person in train & val	Group by person ID
Recommender systems	Random rows → same user in both sets	Group by user ID
Patent phrases ← Us	—	Random OK, but check anchor duplicates

Rule: Ask "what would the model see in the real world that it never saw during training?" Design your split to simulate that.

28 / 40

Validation Theory

Cross-validation

Cross-validation vs a Fixed Hold-out Set

k-fold cross-validation

Split data into k equal parts
Train k times, each with a different fold as validation
Average the k scores for a stable estimate
Use for: comparing models / benchmarking
Downside: k× slower

Fixed hold-out set ← What we use

Split once, keep the same val set throughout
Tune hyperparameters against it
Use for: iterative development and competitions
Downside: risk of overfitting to the specific split

⚠️ If you tune too many hyperparameters against the same validation set, you will slowly overfit to it. This is "leakage through hyperparameter search".

29 / 40

Metrics

Two Different Things

Metric vs Loss — Don't Confuse Them

Loss function

Used internally during training to compute gradients
Must be smooth and differentiable
Examples: MSE, MAE, Cross-entropy
Optimiser minimises this every step

Metric

What stakeholders actually care about
Can be non-smooth or non-differentiable
Examples: Accuracy, F1 score, Pearson r
You report this on validation and test sets

⚠️ A model can lower its loss every epoch while the metric stays flat. Always track and report the metric — not just the loss — when evaluating progress.

30 / 40

Metrics

Our Competition Metric

Pearson Correlation Coefficient (r)

Kaggle measures our model using Pearson r — a number between −1 and +1 that measures how well our predicted scores move in the same direction as the true scores.

r value	What it looks like in a scatter plot
+1.0	Perfect — predictions match truth exactly
+0.68	Good — clear upward trend visible
+0.43	Moderate — cloud with slight slope
+0.20	Weak — slope barely visible
0.0	None — predictions are random relative to truth

Pearson r measures linear correlation. It doesn't care about the absolute scale of predictions — only whether high predictions correspond to high truth values.

31 / 40

Metrics

Worked Example

Calculating Pearson r by Hand

Let's see what r looks like with 5 prediction/truth pairs, and what happens when we add an outlier.

Pair	True score	Predicted	Difference
1	0.25	0.30	+0.05 ✓
2	0.50	0.45	−0.05 ✓
3	0.75	0.80	+0.05 ✓
4	1.00	0.95	−0.05 ✓
5	0.00	0.05	+0.05 ✓
+Outlier	0.50	0.00	−0.50 ✗✗✗

Without outlier: r = 0.999 — near perfect

With outlier: r = 0.43 — one bad prediction drags r down dramatically

32 / 40

Metrics

Practical Implication

Outliers Can Wreck Your Kaggle Score

Because Pearson r is sensitive to large errors on individual rows, you need to make sure no single prediction is wildly wrong.

1

After training: inspect the raw model outputs BERT's output head has no bounds — it can produce values outside [0, 1]

2

Clip predictions to the valid range Force all predictions to stay between 0.0 and 1.0

3

This simple step immediately boosts your leaderboard score Without clipping, a few extreme values can destroy your Pearson r

📌 Lesson: When Pearson r is your metric, eliminating large errors on any single row is more important than improving average accuracy.

33 / 40

Building the Pipeline

Step 8 of 10

Configure the Training Settings

TrainingArguments is a config object — it holds all your training hyperparameters in one place. Think of it as the equivalent of fastai's fit_one_cycle() options.

Setting	What it does	Our value
learning_rate	How big each gradient step is	8e-5
num_train_epochs	How many times to go through all training data	4
warmup_ratio	Fraction of steps to ramp LR up slowly at start	0.1
lr_scheduler_type	How LR changes over training (cosine = smooth decay)	cosine
weight_decay	Regularisation — prevents individual weights getting too large	0.01
evaluation_strategy	When to check validation performance	per epoch

34 / 40

Building the Pipeline

Step 9 of 10

Build the Model Object

We load the pretrained model and attach a new randomly-initialised output head sized to our task (1 output = 1 similarity score).

The pretrained layers hold valuable knowledge. We keep them and update them slowly.

The new head starts random and must learn quickly from task-specific data.

AutoModelForSequenceClassification handles all of this automatically — just pass the checkpoint name and num_labels=1.

35 / 40

Building the Pipeline

Step 10 of 10

Create the Trainer

The Trainer is the Hugging Face equivalent of fastai's Learner. It bundles everything together so training is one function call.

The Trainer needs:

model — the DeBERTa model with classification head
args — the TrainingArguments config from Step 8
train_dataset — tokenised training rows
eval_dataset — tokenised validation rows
compute_metrics — our Pearson r function
data_collator — handles dynamic padding for each batch

Once created, training starts with trainer.train(). That's it. The Trainer handles batching, GPU transfer, gradient accumulation, logging, and checkpointing.

36 / 40

Results

Training Snapshot

Running the Training Loop

One epoch through 27,000 training rows takes about 5 minutes on a free Kaggle GPU. After training, you see a live log like this:

      Epoch 1/4

      train_loss: 0.241  |  eval_loss: 0.198  |  pearson_r: 0.783

      Epoch 2/4

      train_loss: 0.187  |  eval_loss: 0.164  |  pearson_r: 0.821

      Epoch 3/4

      train_loss: 0.152  |  eval_loss: 0.148  |  pearson_r: 0.834

📈 Both training loss and validation loss are decreasing together — the model is generalising, not overfitting. This is exactly what we want to see.

37 / 40

Results

After 1 Epoch

r = 0.834 — Why Does It Work So Fast?

0.834 Pearson r on validation set after just 1 epoch of fine-tuning

This result is remarkable. We trained for less than 5 minutes and achieved strong performance. The reason is transfer learning.

1

DeBERTa already knows English Pretraining on billions of sentences gives the model rich semantic knowledge

2

Fine-tuning only needs small adjustments The model barely needs to change — it already understands meaning and similarity

3

This is the power of pretraining + fine-tuning Billions of dollars of compute, distilled into a free downloadable checkpoint

38 / 40

Results

Submission

Post-processing Predictions and Submitting

1

Run predictions on the test set trainer.predict() returns raw float scores — these can go outside [0, 1]

2

Always inspect outputs before submitting Are any values negative? Greater than 1? This is a sign something went wrong.

3

Clip to [0, 1] with torch.clamp() This simple step immediately improves your correlation score on the leaderboard

4

Save to CSV and upload to Kaggle The submission file needs two columns: text_id and score

39 / 40

Ethics

SLO4 — Responsible AI

Ethics Check: NLP Models

Before deploying any NLP model, you must ask these questions about your data and use case.

Potential misuse risks

Plagiarism detection systems

Surveillance of communications

IP theft / competitive intelligence

Automated rejection without explanation

Required practices

Document data provenance

Record model limitations

Declare intended use

Test for domain bias

Patent data question: Who owns the text in a patent corpus? Training on publicly-filed patents is generally allowed, but check jurisdiction-specific rules before deploying commercially.

Model card: When you submit your project, document what your model can and cannot do, where it might fail, and how it should — and shouldn't — be used.

40 / 40

Natural LanguageProcessing &Transformers

Where We've Been

The Deep Learning Recipe

Data Types

Three Modalities, Same Principles

Tools

Why We're Switching to Hugging Face

fastai (Lessons 1–3)

🤗 Hugging Face (Today)

Translation Guide

fastai → Hugging Face: What Maps to What

What Is NLP?

Teaching Computers to Work with Language

Tasks NLP can solve

Why it's hard

The Pipeline

How Text Becomes a Prediction

Preprocessing

Why Text Needs Tokenisation

Vocabulary Design

The Problem with Using Whole Words

Character-level

Word-level

Sub-word ✓ Best

Sub-word Demo

Watching the Tokeniser Work

How BERT Learned Language

The Masked Language Model Trick

Fine-tuning

Small, Focused Edits — Not Starting from Scratch

Training from scratch

Fine-tuning ✓ What we do

The Problem

Kaggle: US Patent Phrase-to-Phrase Matching

The Data

What's in the Dataset?

Problem Framing

Turning Similarity into Classification

Real-World Applications

Why Text Classification Powers Modern NLP

Load the Data and Have a Look

Build the Input String

Convert to a Hugging Face Dataset

pandas DataFrame

HF Dataset ← Switch here

Choose a Pretrained Model First

Load the Matching Tokeniser

Tokenise the Whole Dataset

Tokenised Row

What One Tokenised Example Looks Like

Set Up the Labels

Why We Need a Validation Set

Overfitting vs Underfitting

The Goldilocks Problem

Data Splits

Three Splits, Three Roles

Training Set (~75%)

Validation Set (~25%)

Test Set (Locked)

Splitting Strategy

Not All Splits Are Equal

Cross-validation

Cross-validation vs a Fixed Hold-out Set

k-fold cross-validation

Fixed hold-out set ← What we use

Two Different Things

Metric vs Loss — Don't Confuse Them

Loss function

Metric

Our Competition Metric

Pearson Correlation Coefficient (r)

Worked Example

Calculating Pearson r by Hand

Practical Implication

Outliers Can Wreck Your Kaggle Score

Configure the Training Settings

Build the Model Object

Create the Trainer

Training Snapshot

Running the Training Loop

Natural Language
Processing &
Transformers