Transformers rely on self-attention mechanisms that process all tokens simultaneously. To capture the order of tokens, transformers add positional encodings to token embeddings. These encode position information using sine and cosine functions of different frequencies.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
- pos is the position of the token in the sequence
- i is the dimension index
- d_model is the embedding dimension size
1. Unique patterns: Each position gets a unique encoding
2. Fixed-length: Works for any sequence length without retraining
3. Smooth interpolation: Can handle positions not seen during training
Different dimensions use different frequencies:
• Lower dimensions (e.g., 0, 2): Change slowly across positions
• Higher dimensions (e.g., 16, 32): Change rapidly across positions
This creates a unique "fingerprint" for each position