Interactive Attention Mechanism Visualization

The attention mechanism is the core innovation of transformer models. It allows the model to focus on different parts of the input sequence when producing each output element. Instead of processing a sequence step by step (like RNNs), attention computes a weighted sum of all input elements based on their relevance.

Low attention
Medium attention
High attention

Under the Hood: How Attention Works

Query, Key, Value Vectors for Selected Token

Query (Q):
Key (K):
Value (V):
Attention:
Output:

Attention(Q, K, V) = softmax(QKT / √dk)V

Where:

- Q: Query vector (what we're looking for)

- K: Key vector (what might match our query)

- V: Value vector (information to retrieve)

- dk: Dimension of the key vector (scaling factor)

Self-Attention Algorithm Step by Step

1

Project to Q, K, V

Each token embedding is linearly projected to create query, key, and value vectors using learned weights.

2

Calculate Attention Scores

For each query token, compute its dot product with all key tokens to measure compatibility.

3

Scale and Softmax

Scale the scores by 1/√dk and apply softmax to get a probability distribution over all tokens.

4

Weighted Sum

Use the attention weights to create a weighted sum of the value vectors, producing the output.