Attention Mechanism Visualization

The attention mechanism is the core innovation of transformer models. It allows the model to focus on different parts of the input sequence when producing each output element. Instead of processing a sequence step by step (like RNNs), attention computes a weighted sum of all input elements based on their relevance.

Under the Hood: How Attention Works

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Where:

- Q: Query vector (what we're looking for)

- K: Key vector (what might match our query)

- V: Value vector (information to retrieve)

- d_k: Dimension of the key vector (scaling factor)

Self-Attention Algorithm Step by Step

Project to Q, K, V

Each token embedding is linearly projected to create query, key, and value vectors using learned weights.

Calculate Attention Scores

For each query token, compute its dot product with all key tokens to measure compatibility.

Scale and Softmax

Scale the scores by 1/√d_k and apply softmax to get a probability distribution over all tokens.

Weighted Sum

Use the attention weights to create a weighted sum of the value vectors, producing the output.