The Transformer is a neural network architecture introduced in the paper "Attention Is All You Need" (2017). It revolutionized NLP by enabling parallel processing of sequences and capturing long-range dependencies without recurrence. This visualization shows the complete architecture with both encoder and decoder components.
Tokens are first converted to vector representations (embeddings). Then, positional encodings are added to incorporate information about the position of each token in the sequence.
Token Embedding: Maps each token to a vector (typically 512 to 1024 dimensions)
Positional Encoding: Uses sine and cosine functions of different frequencies
Combined: Token Embedding + Positional Encoding → Final Embedding
Self-attention allows the model to weigh the importance of different tokens when encoding each token. Multi-head attention performs this process in parallel with different learned projections.
For each token: Create query (Q), key (K), and value (V) vectors
Attention scores: Calculate how much to attend to each token (Q·KT)
Multiple heads: 8-16 parallel attention mechanisms focusing on different aspects
Output: Weighted sum of value vectors based on attention scores
Cross-attention allows the decoder to focus on relevant parts of the input sequence. It uses queries from the decoder and keys/values from the encoder.
Queries: From the decoder (current token being processed)
Keys and Values: From the encoder output (input sequence representation)
Mechanism: Similar to self-attention, but connects encoder to decoder
Benefit: Allows decoder to focus on relevant parts of the input sequence
Each position is processed independently by a two-layer feed-forward network, applying non-linear transformations to the attention outputs.
Structure: Two linear transformations with a ReLU activation in between
First projection: Input dimension → Larger dimension (typically 4x input size)
Second projection: Larger dimension → Original input dimension
Applied: Independently to each position in the sequence
Normalizes the inputs across the feature dimension, helping stabilize and accelerate training by preventing internal covariate shift.
Function: Normalizes each token's embedding to have mean=0 and variance=1
Formula: LayerNorm(x) = γ * (x - μ) / (σ + ε) + β
Placement: Applied before main operations in each sublayer
Benefit: Stabilizes training and reduces dependence on careful initialization
Allows information to flow directly from earlier layers to later layers, helping combat vanishing gradients and enabling training of deeper networks.
Formula: Output = Sublayer(x) + x
Benefit: Helps gradient flow during backpropagation
Implementation: Add the input to the output of each sublayer
Usage: Applied around both self-attention and feed-forward networks
Projects the decoder output to the vocabulary space, preparing for the prediction of the next token.
Input: Decoder output vectors (dimension = model size)
Output: Logits for each token in vocabulary (dimension = vocabulary size)
Parameters: Weight matrix of size [model_size × vocabulary_size]
Note: Often shares weights with the embedding layer
Converts logits into a probability distribution over the vocabulary, allowing the model to predict the most likely next token.
Function: Converts raw scores to probabilities
Formula: softmax(xi) = exi / Σ exj
Output: Probability distribution (all values sum to 1)
Usage: During training, compared to true next token for loss calculation
1. Parallelization: Unlike RNNs, transformers process all tokens simultaneously, enabling much faster training.
2. Long-range dependencies: Self-attention directly connects any two positions in a sequence, regardless of their distance.
3. Multi-head attention: Multiple attention mechanisms allow the model to focus on different aspects of the input simultaneously.
4. Positional encoding: Since the model has no inherent notion of sequence order, positional information is added explicitly.
Convert text to tokens, each representing a word, subword, or character.
Map tokens to vectors and add positional information.
Process input sequence through multiple layers of self-attention and feed-forward networks.
Generate output tokens one at a time, using self-attention and cross-attention to the encoder's output.