Week 5 · Assessment Quiz

NLP, Transformers & Hugging Face

25 multiple-choice questions on tokenisation, Transformers, BERT, GPT and the Hugging Face ecosystem, plus 5 short-answer questions.

📋 30 questions total ⭐ 30 marks 🕐 No time limit 🔒 Answers not revealed

Tokenisation in NLP is the process of:

ACompressing text into a binary format for efficient storage BSplitting raw text into smaller units (tokens) that the model can process numerically CTranslating text from one language to another DRemoving stopwords and punctuation from input text

A vocabulary in the context of an NLP model refers to:

AThe total number of unique sentences in the training corpus BThe linguistic rules the model uses to parse grammar CThe fixed set of tokens (words or subwords) the model knows, each assigned a unique integer ID DThe list of all named entities recognised by the model

A word embedding represents a word as:

AA one-hot binary vector with a 1 at the word's vocabulary index BA dense real-valued vector that encodes semantic meaning in a continuous space CThe ASCII codes of each character in the word DA probability distribution over all possible next words

The Transformer architecture is primarily built around:

ASelf-attention mechanisms that allow each token to attend to all other tokens in the sequence BRecurrent connections that process tokens one at a time from left to right CConvolutional filters that detect local n-gram patterns in text DHopfield networks with associative memory retrieval

BERT stands for:

ABidirectional Encoder Representations from Text BBinary Encoded Recurrent Transformer CBatch-Enhanced Representation Training DBidirectional Encoder Representations from Transformers

In Hugging Face, pipeline("sentiment-analysis"):

AReturns the raw token IDs for the input text BLoads a pre-trained sentiment model and tokenizer and runs end-to-end inference in one call CFine-tunes a language model on a sentiment dataset you provide DGenerates text continuations expressing a given sentiment

Fine-tuning a pre-trained language model means:

AContinuing to train the pre-trained model on a smaller task-specific dataset with a lower learning rate BTraining a new model from random weights on your domain data CFreezing the entire model and only changing its tokenizer for your dataset DEvaluating the model without any gradient updates

Subword tokenisation (e.g. Byte-Pair Encoding) is preferred over word-level tokenisation because:

AIt produces shorter sequences that are faster to process BIt avoids the need for positional encoding CIt handles rare and out-of-vocabulary words by splitting them into known subword pieces DIt requires a smaller vocabulary, so embeddings take less memory

In BERT, the [CLS] token is used as:

AA separator between two sentence segments BAn aggregate representation of the entire input, often used for classification tasks CA padding token to fill sequences to a fixed length DA mask indicator for the masked language modelling pre-training objective

Q10

The attention mechanism allows a Transformer to:

ADynamically weight the importance of every other token in the sequence when computing a representation for each token BProcess sequences of unlimited length without any positional encoding CReplace recurrent connections with convolutional filters DApply dropout specifically to the attention weights during training

Q11

Positional encoding is added to token embeddings in Transformers to:

AReduce the dimensionality of the embeddings before the attention computation BEnsure all token embeddings have unit length CGive the model information about the position of each token, since self-attention is order-agnostic DInitialise the query, key, and value weight matrices

Q12

AutoTokenizer.from_pretrained("bert-base-uncased") loads:

AA pre-trained BERT model with its weights BThe tokenizer (vocabulary and tokenisation rules) associated with bert-base-uncased CA configuration file describing the BERT architecture only DA fine-tuned BERT model trained on the GLUE benchmark

Q13

Transfer learning in NLP involves:

AUsing a large pre-trained language model (e.g. BERT, GPT) as a starting point and adapting it to a specific task BTraining a model from scratch on domain-specific text data CTranslating training data into multiple languages to expand the dataset DUsing the output of one NLP model as input features for a classical ML classifier

Q14

A language model is a model that:

ATranslates text between languages using parallel corpora BClassifies text into categories using labelled examples CAssigns probabilities to sequences of tokens — effectively learning the statistical structure of language DDetects the language of an input text and routes it to the appropriate classifier

Q15

AutoModelForSequenceClassification is used for:

AGenerating long-form text responses to prompts BLoading a Transformer model with a classification head on top for tasks like sentiment analysis CPerforming named entity recognition with span-level predictions DTokenising text into subword units

Q16

In Hugging Face tokenizers, padding is used to:

AIncrease the vocabulary size to handle rare words BSpeed up training by reducing the sequence length CMake sequences in a batch the same length by adding special padding tokens DApply data augmentation by inserting random tokens

Q17

An encoder-only Transformer model (e.g. BERT) is best suited for:

AUnderstanding tasks such as classification, NER, and question answering that require representing the full input BGenerating open-ended text continuations from a prompt CMachine translation where the output language differs from the input DImage captioning where visual features are combined with text

Q18

A token ID is:

AThe character-level ASCII code of the first letter of a token BThe unique integer assigned to a token in the model's vocabulary CThe position of the token within the input sequence DThe confidence score the model assigns to that token being correct

Q19

The attention mask in Hugging Face tokenizers tells the model:

AWhich heads in multi-head attention to use for each layer BThe temperature parameter used in softmax during generation CWhich tokens are real input (1) and which are padding (0), so padding does not influence the attention computation DThe positional encoding values for each token in the sequence

Q20

GPT-style (decoder-only) models are primarily designed for:

AAutoregressive text generation — predicting the next token given all previous tokens BBidirectional understanding of full input sequences CSentence-pair classification tasks like natural language inference DImage-text contrastive learning

Q21

BERT is pre-trained using:

ANext sentence prediction only — predicting which sentence follows a given one BMasked language modelling (predicting randomly masked tokens) and next sentence prediction CAutoregressive language modelling — predicting the next word from left to right DContrastive learning between matched and mismatched sentence pairs

Q22

model.eval() before inference in PyTorch/Hugging Face:

ALoads the model weights from disk before running predictions BFreezes the model so its weights cannot be updated CSwitches the model to inference mode — disabling dropout and using running statistics for batch normalisation DConverts the model to half precision (float16) for faster inference

Q23

What does tokenizer("Hello world", return_tensors="pt") return?

AA plain Python list of token strings BA dictionary containing input_ids, attention_mask, and possibly token_type_ids as PyTorch tensors CA single integer representing the sentence embedding DThe model's predicted label for the input text

Q24

Named Entity Recognition (NER) is the task of:

AIdentifying and classifying spans of text as entities such as person names, locations, and organisations BPredicting the next word in a sequence of text CRanking documents by relevance to a given query DDetermining whether two sentences express the same meaning

Q25

The key architectural difference between RNNs and Transformers is that Transformers:

AProcess fewer tokens per second and require larger datasets to train BCannot handle variable-length input sequences CProcess all tokens in parallel using self-attention rather than sequentially, enabling much better parallelisation DUse convolutional layers instead of matrix multiplications

Answer each question in 2–4 sentences. Precise technical language is expected. Code snippets are welcome where relevant.

Q26

Explain what tokenisation is and why subword tokenisation (such as Byte-Pair Encoding) is preferred over splitting on whitespace alone.written

Your answer

0 / 700

Q27

Describe the self-attention mechanism in a Transformer. What are queries, keys, and values, and how are they used to compute the attention output?written

Your answer

0 / 700

Q28

What is the difference between BERT and GPT in terms of architecture (encoder-only vs decoder-only) and the tasks each is best suited for?written

Your answer

0 / 700

Q29

Show how you would use the Hugging Face pipeline API to perform sentiment analysis on a list of sentences. What does the output look like?written

Your answer

Include a brief code sketch in your answer.

0 / 700

Q30

Explain what fine-tuning a pre-trained language model means. What data is needed, what is trained, and why is it more efficient than training from scratch?written

Your answer

0 / 700

Your full name

Complete all 30 questions then click Submit. Your MCQ score (25/25) will be shown. Short answers are marked separately.

NLP, Transformers & Hugging Face

Multiple Choice (25 marks)

Short Answer (5 marks — marked by lecturer)