Transformer Architecture Visualization

Token & Positional Embeddings

Tokens are first converted to vector representations (embeddings). Then, positional encodings are added to incorporate information about the position of each token in the sequence.

Token Embedding: Maps each token to a vector (typically 512 to 1024 dimensions)

Positional Encoding: Uses sine and cosine functions of different frequencies

Combined: Token Embedding + Positional Encoding → Final Embedding

Multi-Head Self-Attention

Self-attention allows the model to weigh the importance of different tokens when encoding each token. Multi-head attention performs this process in parallel with different learned projections.

For each token: Create query (Q), key (K), and value (V) vectors

Attention scores: Calculate how much to attend to each token (Q·K^T)

Multiple heads: 8-16 parallel attention mechanisms focusing on different aspects

Output: Weighted sum of value vectors based on attention scores

Cross-Attention

Cross-attention allows the decoder to focus on relevant parts of the input sequence. It uses queries from the decoder and keys/values from the encoder.

Queries: From the decoder (current token being processed)

Keys and Values: From the encoder output (input sequence representation)

Mechanism: Similar to self-attention, but connects encoder to decoder

Benefit: Allows decoder to focus on relevant parts of the input sequence

Feed-Forward Network

Each position is processed independently by a two-layer feed-forward network, applying non-linear transformations to the attention outputs.

Structure: Two linear transformations with a ReLU activation in between

First projection: Input dimension → Larger dimension (typically 4x input size)

Second projection: Larger dimension → Original input dimension

Applied: Independently to each position in the sequence

Layer Normalization

Normalizes the inputs across the feature dimension, helping stabilize and accelerate training by preventing internal covariate shift.

Function: Normalizes each token's embedding to have mean=0 and variance=1

Formula: LayerNorm(x) = γ * (x - μ) / (σ + ε) + β

Placement: Applied before main operations in each sublayer

Benefit: Stabilizes training and reduces dependence on careful initialization

Residual Connections

Allows information to flow directly from earlier layers to later layers, helping combat vanishing gradients and enabling training of deeper networks.

Formula: Output = Sublayer(x) + x

Benefit: Helps gradient flow during backpropagation

Implementation: Add the input to the output of each sublayer

Usage: Applied around both self-attention and feed-forward networks

Linear Layer

Projects the decoder output to the vocabulary space, preparing for the prediction of the next token.

Input: Decoder output vectors (dimension = model size)

Output: Logits for each token in vocabulary (dimension = vocabulary size)

Parameters: Weight matrix of size [model_size × vocabulary_size]

Note: Often shares weights with the embedding layer

Softmax Layer

Converts logits into a probability distribution over the vocabulary, allowing the model to predict the most likely next token.

Function: Converts raw scores to probabilities

Formula: softmax(x_i) = e^x_i / Σ e^x_j

Output: Probability distribution (all values sum to 1)

Usage: During training, compared to true next token for loss calculation

Interactive Transformer Architecture Visualization

Token & Positional Embeddings

Multi-Head Self-Attention

Cross-Attention

Feed-Forward Network

Layer Normalization

Residual Connections

Linear Layer

Softmax Layer

Key Innovations of the Transformer Architecture

Tokenization

Embedding

Encoder Processing

Decoder Processing