Interactive Transformer Architecture Visualization

The Transformer is a neural network architecture introduced in the paper "Attention Is All You Need" (2017). It revolutionized NLP by enabling parallel processing of sequences and capturing long-range dependencies without recurrence. This visualization shows the complete architecture with both encoder and decoder components.

3
Input Sequence
Output Sequence
Encoder
Input Embeddings + Positional Encoding
Decoder
Output Embeddings + Positional Encoding
Linear Layer
Softmax

Token & Positional Embeddings

Tokens are first converted to vector representations (embeddings). Then, positional encodings are added to incorporate information about the position of each token in the sequence.

Token Embedding: Maps each token to a vector (typically 512 to 1024 dimensions)

Positional Encoding: Uses sine and cosine functions of different frequencies

Combined: Token Embedding + Positional Encoding → Final Embedding

Multi-Head Self-Attention

Self-attention allows the model to weigh the importance of different tokens when encoding each token. Multi-head attention performs this process in parallel with different learned projections.

For each token: Create query (Q), key (K), and value (V) vectors

Attention scores: Calculate how much to attend to each token (Q·KT)

Multiple heads: 8-16 parallel attention mechanisms focusing on different aspects

Output: Weighted sum of value vectors based on attention scores

Cross-Attention

Cross-attention allows the decoder to focus on relevant parts of the input sequence. It uses queries from the decoder and keys/values from the encoder.

Queries: From the decoder (current token being processed)

Keys and Values: From the encoder output (input sequence representation)

Mechanism: Similar to self-attention, but connects encoder to decoder

Benefit: Allows decoder to focus on relevant parts of the input sequence

Feed-Forward Network

Each position is processed independently by a two-layer feed-forward network, applying non-linear transformations to the attention outputs.

Structure: Two linear transformations with a ReLU activation in between

First projection: Input dimension → Larger dimension (typically 4x input size)

Second projection: Larger dimension → Original input dimension

Applied: Independently to each position in the sequence

Layer Normalization

Normalizes the inputs across the feature dimension, helping stabilize and accelerate training by preventing internal covariate shift.

Function: Normalizes each token's embedding to have mean=0 and variance=1

Formula: LayerNorm(x) = γ * (x - μ) / (σ + ε) + β

Placement: Applied before main operations in each sublayer

Benefit: Stabilizes training and reduces dependence on careful initialization

Residual Connections

Allows information to flow directly from earlier layers to later layers, helping combat vanishing gradients and enabling training of deeper networks.

Formula: Output = Sublayer(x) + x

Benefit: Helps gradient flow during backpropagation

Implementation: Add the input to the output of each sublayer

Usage: Applied around both self-attention and feed-forward networks

Linear Layer

Projects the decoder output to the vocabulary space, preparing for the prediction of the next token.

Input: Decoder output vectors (dimension = model size)

Output: Logits for each token in vocabulary (dimension = vocabulary size)

Parameters: Weight matrix of size [model_size × vocabulary_size]

Note: Often shares weights with the embedding layer

Softmax Layer

Converts logits into a probability distribution over the vocabulary, allowing the model to predict the most likely next token.

Function: Converts raw scores to probabilities

Formula: softmax(xi) = exi / Σ exj

Output: Probability distribution (all values sum to 1)

Usage: During training, compared to true next token for loss calculation

Key Innovations of the Transformer Architecture

1. Parallelization: Unlike RNNs, transformers process all tokens simultaneously, enabling much faster training.

2. Long-range dependencies: Self-attention directly connects any two positions in a sequence, regardless of their distance.

3. Multi-head attention: Multiple attention mechanisms allow the model to focus on different aspects of the input simultaneously.

4. Positional encoding: Since the model has no inherent notion of sequence order, positional information is added explicitly.

1

Tokenization

Convert text to tokens, each representing a word, subword, or character.

2

Embedding

Map tokens to vectors and add positional information.

3

Encoder Processing

Process input sequence through multiple layers of self-attention and feed-forward networks.

4

Decoder Processing

Generate output tokens one at a time, using self-attention and cross-attention to the encoder's output.