The attention mechanism is the core innovation of transformer models. It allows the model to focus on different parts of the input sequence when producing each output element. Instead of processing a sequence step by step (like RNNs), attention computes a weighted sum of all input elements based on their relevance.
Attention(Q, K, V) = softmax(QKT / √dk)V
Where:
- Q: Query vector (what we're looking for)
- K: Key vector (what might match our query)
- V: Value vector (information to retrieve)
- dk: Dimension of the key vector (scaling factor)
Each token embedding is linearly projected to create query, key, and value vectors using learned weights.
For each query token, compute its dot product with all key tokens to measure compatibility.
Scale the scores by 1/√dk and apply softmax to get a probability distribution over all tokens.
Use the attention weights to create a weighted sum of the value vectors, producing the output.