From text to predictions — using pretrained models to solve real-world NLP tasks
Every deep learning task we've seen shares the same core loop — only the input data changes.
The key insight: The training loop — forward pass, compute loss, backpropagate, update weights — is identical whether you're classifying cat photos or analysing legal documents.
| Data type | Raw form | Preprocessing | Model type |
|---|---|---|---|
| Vision | Image files | Resize → pixel values | CNN |
| Tabular | CSV rows | Normalise, one-hot encode | Linear + ReLU layers |
| Text ← Today | Sentences | Tokenise → integer IDs | Transformer |
Same loss function. Same gradient descent. Same optimiser. Text just needs a different kind of preprocessing before the numbers go in.
Why both matter: Seeing the same ideas expressed in two different libraries deepens understanding. fastai integrates Hugging Face models anyway — skills transfer both ways.
Everything you learned still applies. The concepts have new names.
| fastai concept | Hugging Face equivalent | What it does |
|---|---|---|
| DataBlock | Dataset + .map() | Prepares and transforms data |
| Learner | Trainer | Bundles model, data, loss, metrics |
| fit_one_cycle() | TrainingArguments (cosine + warmup) | Runs the training loop |
| learn.predict() | trainer.predict() | Gets predictions from a trained model |
| fine_tune() | trainer.train() | Adapts pretrained weights to your task |
Natural Language Processing (NLP) is the field of making computers understand, generate, and work with human text and speech.
Every NLP model follows this sequence. We'll build each step today.
The numbers in brackets show the shape of the data as it flows through. 128 = sequence length. 768 = hidden size of BERT-base.
Neural networks only work with numbers. We need a reliable way to convert any string of text into a fixed list of integers. This process is called tokenisation.
There are over a million English words. If each word gets its own ID, the vocabulary becomes impossibly large.
Split into individual letters.
Tiny vocabulary (~100) but loses meaning — 'c','a','t' doesn't tell you it's an animal.
Keep whole words. "Unbelievable" gets its own slot. Problem: new or rare words have no ID — out-of-vocabulary.
Split at meaningful boundaries. "unbelievable" → "un" + "believ" + "able". Handles new words, manageable vocabulary (~30k).
Modern NLP uses sub-word tokenisation. BERT uses WordPiece. GPT uses BPE (Byte-Pair Encoding). Both give small vocabularies that handle any text.
The underscore ( _ ) marks the start of a new word. Everything without an underscore is a continuation.
Input → "Good AI from fast.ai"
Input → "A platypus is an ornithorhynchus"
Those tokens become integer IDs
BERT was trained on hundreds of millions of sentences from Wikipedia and books — with zero human labels. Here's how:
Result: After seeing billions of sentences, BERT's weights encode rich knowledge about language, facts, and meaning — ready to be fine-tuned for your specific task.
Fine-tuning takes a pretrained model and gently adjusts its weights for your specific task. Most weights barely change.
A real competition where a company wants to automatically measure how similar two patent phrases are. This is exactly the kind of real-world NLP task you'll encounter in industry.
Anchor phrase: "table leg"
Target phrase: "supporting member"
Context (patent section): A47B (Furniture)
Goal — predict similarity score: 0 = different, 0.5 = related, 1.0 = identical meaning
| Column | Example | Meaning |
|---|---|---|
| anchor | abatement | The first phrase to compare |
| target | act of abating | The second phrase |
| context | A47 | Patent classification section |
| score | 0.5 | Human-labelled similarity (0–1) |
36,473 rows — but only 733 unique anchor phrases. Many targets paired with the same anchor.
106 unique contexts — the same phrase might mean different things in different patent sections.
BERT is excellent at reading a single input and producing a score. We need to turn our three-column problem (anchor, target, context) into one input string.
This is a common NLP pattern: Creative problem framing lets you apply a general-purpose model to any specific task.
The same pipeline we're building today solves all of these problems.
| Application | Input (the "document") | Output (the "class") |
|---|---|---|
| Sentiment analysis | Movie review | Positive / Negative |
| Spam detection | Email message | Spam / Not spam |
| Email triage | Customer message | Sales / Support / Complaint |
| Legal discovery | Legal document | In-scope / Out-of-scope |
| Phrase matching ← Today | Anchor + target + context | Similarity score 0–1 |
Always start by exploring your data. Use pandas to load the CSV and check what's inside before writing any model code.
Rule of thumb: Spend at least as long exploring your data as building your model. Surprises in the data are the most common cause of poor results.
We combine anchor, target, and context into a single text column so BERT can read the whole thing at once.
What we create
First 3 rows of the new input column
The artificial labels (TEXT1:, TEXT2:, ANC1:) help the model tell the three fields apart without extra code. This is prompt engineering.
Hugging Face has its own Dataset format that's optimised for NLP tasks — it's faster and supports efficient batched transformations.
One line converts between them: Dataset.from_pandas(df)
This decision must come before tokenisation — because the tokeniser must match the vocabulary of the pretrained model you choose.
⚠️ Common mistake: Tokenising with the wrong vocabulary. Always load the tokeniser that belongs to your chosen checkpoint.
AutoTokenizer is Hugging Face's smart loader — you give it the checkpoint name and it downloads the correct vocabulary and tokenisation rules automatically.
Loading the tokeniser
This downloads the vocabulary + rules used when the original model was pretrained. Using a different tokeniser would be like giving the model a text in a different alphabet.
We apply our tokeniser to every row using .map() — Hugging Face's vectorised transform that processes all 36k rows in ~6 seconds.
What the tok_func does to each row:
attention_mask: A list of 1s and 0s telling the model which tokens are real words (1) and which are padding (0). Padding tokens should be ignored during learning.
Let's inspect row 0 after tokenisation to see exactly what the model receives.
Original text
input_ids (first 8 of 128)
Token ID 1 = [CLS] (start of sequence). Token 2 = [SEP] (separator). These special tokens help BERT know where the input begins and ends.
Hugging Face's Trainer expects the target column to be literally named "labels". Our column is named "score" — so we rename it.
The score column contains values like 0, 0.25, 0.5, 0.75, 1.0
After renaming it becomes the labels column
The Trainer reads labels automatically and uses them to compute loss
Binary / multi-class / regression all use the same "labels" key. The model's output size (num_labels) determines whether it's treated as classification or regression.
Since our scores are continuous (0.0 to 1.0), this is a regression problem. We set num_labels=1 when building the model.
Before we run any training, we need to understand how to measure whether our model is actually learning — or just memorising.
If you only measure performance on the data the model trained on, you'll always see improvement — even if the model is useless on new data.
The core problem: Training loss always goes down. But that doesn't mean the model is learning anything generalizable. It might just be memorising the training data.
Solution: Hold back a portion of data that the model never sees during training. Measure performance on this validation set to detect overfitting early.
A model can fail in two opposite directions. You need it "just right".
Model weights are updated using this data. The model learns here.
Checked after each epoch to detect overfitting. Guides hyperparameter tuning.
Opened once at the end to report final performance. Never used during development.
On Kaggle: The test set is the competition data. The private leaderboard is only revealed after the competition closes — enforcing honest evaluation.
A good model shows: training loss ↓ and validation loss ↓ together. Danger sign: training loss keeps falling while validation loss starts rising.
A random split is fine for most tasks — but some datasets need more care.
| Dataset type | ❌ Bad split | ✓ Good split |
|---|---|---|
| Time-series (sales) | Random rows → future leaks into training | Last N weeks as validation |
| Face recognition | Random photos → same person in train & val | Group by person ID |
| Recommender systems | Random rows → same user in both sets | Group by user ID |
| Patent phrases ← Us | — | Random OK, but check anchor duplicates |
Rule: Ask "what would the model see in the real world that it never saw during training?" Design your split to simulate that.
⚠️ If you tune too many hyperparameters against the same validation set, you will slowly overfit to it. This is "leakage through hyperparameter search".
⚠️ A model can lower its loss every epoch while the metric stays flat. Always track and report the metric — not just the loss — when evaluating progress.
Kaggle measures our model using Pearson r — a number between −1 and +1 that measures how well our predicted scores move in the same direction as the true scores.
| r value | What it looks like in a scatter plot |
|---|---|
| +1.0 | Perfect — predictions match truth exactly |
| +0.68 | Good — clear upward trend visible |
| +0.43 | Moderate — cloud with slight slope |
| +0.20 | Weak — slope barely visible |
| 0.0 | None — predictions are random relative to truth |
Pearson r measures linear correlation. It doesn't care about the absolute scale of predictions — only whether high predictions correspond to high truth values.
Let's see what r looks like with 5 prediction/truth pairs, and what happens when we add an outlier.
| Pair | True score | Predicted | Difference |
|---|---|---|---|
| 1 | 0.25 | 0.30 | +0.05 ✓ |
| 2 | 0.50 | 0.45 | −0.05 ✓ |
| 3 | 0.75 | 0.80 | +0.05 ✓ |
| 4 | 1.00 | 0.95 | −0.05 ✓ |
| 5 | 0.00 | 0.05 | +0.05 ✓ |
| +Outlier | 0.50 | 0.00 | −0.50 ✗✗✗ |
Without outlier: r = 0.999 — near perfect
With outlier: r = 0.43 — one bad prediction drags r down dramatically
Because Pearson r is sensitive to large errors on individual rows, you need to make sure no single prediction is wildly wrong.
📌 Lesson: When Pearson r is your metric, eliminating large errors on any single row is more important than improving average accuracy.
TrainingArguments is a config object — it holds all your training hyperparameters in one place. Think of it as the equivalent of fastai's fit_one_cycle() options.
| Setting | What it does | Our value |
|---|---|---|
| learning_rate | How big each gradient step is | 8e-5 |
| num_train_epochs | How many times to go through all training data | 4 |
| warmup_ratio | Fraction of steps to ramp LR up slowly at start | 0.1 |
| lr_scheduler_type | How LR changes over training (cosine = smooth decay) | cosine |
| weight_decay | Regularisation — prevents individual weights getting too large | 0.01 |
| evaluation_strategy | When to check validation performance | per epoch |
We load the pretrained model and attach a new randomly-initialised output head sized to our task (1 output = 1 similarity score).
The pretrained layers hold valuable knowledge. We keep them and update them slowly.
The new head starts random and must learn quickly from task-specific data.
AutoModelForSequenceClassification handles all of this automatically — just pass the checkpoint name and num_labels=1.
The Trainer is the Hugging Face equivalent of fastai's Learner. It bundles everything together so training is one function call.
The Trainer needs:
Once created, training starts with trainer.train(). That's it. The Trainer handles batching, GPU transfer, gradient accumulation, logging, and checkpointing.
One epoch through 27,000 training rows takes about 5 minutes on a free Kaggle GPU. After training, you see a live log like this:
📈 Both training loss and validation loss are decreasing together — the model is generalising, not overfitting. This is exactly what we want to see.
This result is remarkable. We trained for less than 5 minutes and achieved strong performance. The reason is transfer learning.
Before deploying any NLP model, you must ask these questions about your data and use case.
Potential misuse risks
Required practices
Patent data question: Who owns the text in a patent corpus? Training on publicly-filed patents is generally allowed, but check jurisdiction-specific rules before deploying commercially.
Model card: When you submit your project, document what your model can and cannot do, where it might fail, and how it should — and shouldn't — be used.