Interactive Positional Encoding Visualization

Transformers rely on self-attention mechanisms that process all tokens simultaneously. To capture the order of tokens, transformers add positional encodings to token embeddings. These encode position information using sine and cosine functions of different frequencies.

6 tokens
Sine
Cosine

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))

PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:

- pos is the position of the token in the sequence

- i is the dimension index

- d_model is the embedding dimension size

Why Sinusoidal Functions?

1. Unique patterns: Each position gets a unique encoding

2. Fixed-length: Works for any sequence length without retraining

3. Smooth interpolation: Can handle positions not seen during training

Varying Frequencies

Different dimensions use different frequencies:

• Lower dimensions (e.g., 0, 2): Change slowly across positions

• Higher dimensions (e.g., 16, 32): Change rapidly across positions

This creates a unique "fingerprint" for each position