Why Attention Is All You Need — A Dimensional and Mathematical Intuition Guide

1. Introduction

In 2017, Vaswani et al. dropped a paper titled “Attention Is All You Need,” and it quietly rewired the entire field of deep learning. Within a few years, its architecture — the Transformer — became the foundation for nearly every modern AI system: GPTs, BERT, diffusion models, even vision networks.

Before this paper, sequence modeling relied on recurrent networks (RNNs and LSTMs) that processed data step-by-step, passing information forward through time. That meant slow training, limited parallelism, and the infamous problem of forgetting information from distant tokens.

The Transformer proposed a radical shift:

Forget time; learn relationships.

Instead of iterating over tokens sequentially, each token could directly “attend” to every other token in the sequence, capturing context in a single forward pass. This attention-based mechanism not only removed recurrence but also made training fully parallelizable — perfect for GPUs.

In this post, we’ll rebuild the intuition and math behind the paper:

How RNNs evolved into attention mechanisms?
What “self-attention” really computes?
How dimensionality flows through Q, K, VQ, K, VQ, K, V projections?
Why multiple heads and feedforward layers matter?
And how does the encoder–decoder structure tie it all together?

By the end, you should be able to visualize every transformation in terms of both meaning and shape, and truly see why attention was, and still is, all we needed.

2. RNNs — What They Were and Why They Broke

Before the Transformer, nearly every sequential model used Recurrent Neural Networks (RNNs).
RNNs process sequences token by token while maintaining a hidden "memory" of what came before.

2.1 What are RNNs?

At each time step t, an RNN updates a hidden state hₜ using the current input xₜ and the previous hidden state hₜ₋₁:

hₜ = f(Wₓ · xₜ + Wₕ · hₜ₋₁)
yₜ = Wᵧ · hₜ

Here:

xₜ → input vector at step t
hₜ → hidden state (the model’s internal memory)
f → activation function (usually tanh or ReLU)

This creates a chain of dependencies — every output depends on all previous steps.

2.2 The Core Problems

1. Sequential Dependency
Each step depends on the previous one. You can’t compute step t+1 until t is finished.

This makes training and inference very slow and non-parallelizable.

2. Vanishing and Exploding Gradients
During backpropagation, gradients pass through many time steps.

If weights are small, gradients vanish, and early tokens are forgotten.
If weights are large, gradients explode and training becomes unstable.

3. Information Decay
The hidden state is a single fixed-size vector that must store all past context.
Older information fades as new information arrives — much like trying to remember the start of a long sentence.

4. Long Inference Time
Inference must also be sequential. You can’t predict multiple tokens at once because each depends on the last output.

3. Transformer Intuition — From Memory Chains to Attention Maps

Recurrent models view sequences as chains: information flows step by step. The Transformer introduced a new way of thinking — instead of passing information through time, it lets every token directly connect to every other token.

This is the essence of attention.

3.1 The Core Idea

In an RNN, the token at position t can only use information passed from earlier positions.
In a Transformer, the token at position t can "look" at every other token in the sequence, including itself, and decide which ones are relevant.

This means:

No recurrence or time dependency.
All tokens are processed in parallel.
Context is learned by comparing tokens directly.

3.2 The Intuitive Analogy

Think of reading a sentence like “The animal didn’t cross the street because it was too tired.”

When you read the word “it”, you don’t have to replay the entire sentence sequentially. You instantly recall the relevant part — “the animal”.
That’s exactly what attention does: each token attends to the parts of the sequence that matter most for understanding its own meaning.

3.3 Computation as Relationships

The Transformer encodes these relationships through a set of learnable projections:

Each token’s embedding is projected into three spaces: Query (Q), Key (K), and Value (V).
The query of one token measures how much it relates to the keys of all other tokens.
The result is a weighted combination of their values, forming a new representation for that token.

Mathematically, for each token:

Attention weights = softmax(Q · Kᵀ)
Output = Attention weights × V

This mechanism directly models pairwise relationships between tokens, rather than relying on sequential memory.

3.4 Why This Matters

The Transformer’s self-attention lets the model:

Capture global dependencies between tokens (not limited by distance).
Train in parallel, since all tokens attend simultaneously.
Retain long-term context efficiently.

In short, attention turns sequential data into a fully connected relationship graph between tokens, computed in a single step.

3.5 The Shift in Perspective

Before the Transformer, “sequence” implied “time”.
After it, “sequence” became a set of relationships.

The model doesn’t think in terms of steps; it thinks in terms of contextual relevance.
This shift is what enabled modern large language models — systems that learn meaning by understanding the relationships between words, not their positions in a timeline.

4. Input Representation

Before attention can operate, the raw tokens of a sequence must be converted into vectors that the model can process. This is done in two steps: token embeddings and positional encodings.

4.1 Token Embeddings

Each word or token is mapped to a dense vector of dimension d_model.
If the input sequence has n tokens, the embedding matrix X has shape:

X ∈ ℝ^(n × d_model)

These embeddings capture semantic meaning — similar words have similar vector representations.
At this stage, there is no positional information; the model doesn’t know which token comes first or last.

4.2 Positional Encodings

Since the Transformer does not process tokens sequentially, we need to inject information about token positions in the sequence.

The paper uses sinusoidal positional encodings:

For each position pos and dimension i:

PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

This produces a vector PE of the same dimension as the token embeddings (d_model).
These encodings allow the model to distinguish order and learn relative positions without recurrence.

4.3 Combining Embeddings and Positional Encodings

The final input to the Transformer is the sum of token embeddings and positional encodings:

E = X + PE

Shape of E: n × d_model
This combined representation contains both semantic meaning and positional information.

4.4 Intuition

Each token now has a vector that tells the model:
- What the token is (embedding)
- Where it is in the sequence (positional encoding)
The Transformer can now apply attention, knowing both content and position.
Sinusoids are used instead of learned embeddings because they allow the model to extrapolate to longer sequences than seen during training.

5. Self-Attention Mechanism (Single Head)

The key innovation of the Transformer is self-attention, a mechanism that allows each token in a sequence to consider all other tokens when forming its representation. Unlike RNNs, which rely on sequential steps to propagate information, self-attention provides each token with direct access to the entire sequence in a single step.

5.1 From Embeddings to Queries, Keys, and Values

Starting from the input embeddings E (shape n × d_model), the model generates three separate projections for each token: Query (Q), Key (K), and Value (V).

Query (Q) represents what the token is “looking for”
Key (K) represents the content of the token to be compared against queries
Value (V) carries the actual information of the token

These projections are obtained by multiplying E with learnable weight matrices:

Q = E · W_Q
K = E · W_K
V = E · W_V

W_Q, W_K, W_V ∈ ℝ^(d_model × d_k)
Resulting shapes: Q, K, V ∈ ℝ^(n × d_k)

Here, d_k is typically smaller than d_model for efficiency, but all tokens are now ready for interaction.

5.2 Computing Attention

Self-attention measures how much each token should attend to every other token. This is done in three steps:

Compute similarity scores between queries and keys:

Scores = Q · Kᵀ      # shape: n × n

Scale the scores by √d_k to prevent excessively large values that destabilize gradients:

Scores_scaled = Scores / √d_k

NOTE THAT THE 6X6 MATRIX DENOTES RELATIONSHIPS OF EACH TOKEN WITH OTHER TOKENS IN THE STRING. WE CAN MANIPULATE THIS MANUALLY AS WELL. THIS WILL BE USED IN MASKING FUTURE TOKENS IN THE DECODER BLOCK

Apply softmax to convert scores into attention weights:

Weights = softmax(Scores_scaled)

Multiply the weights by the values to get the output:

Output = Weights · V    # shape: n × d_v

Each row in the output corresponds to a contextualized vector for that token.
In essence, each token gathers information from the entire sequence, weighted by relevance.

5.3 Intuition

Imagine the sentence: “The animal didn’t cross the street because it was tired.”

When processing the token “it,” self-attention allows it to look at every other word.
It assigns higher weights to “animal” (its antecedent) and lower weights to unrelated tokens like “street” or “cross.”

Unlike RNNs, this mechanism does not rely on sequential propagation, allowing the model to capture long-range dependencies efficiently.

5.4 Dimensional Flow

Input embeddings: E → n × d_model
Projections: Q, K, V → n × d_k
Attention scores: Q · Kᵀ → n × n
Weighted sum: Weights · V → n × d_v

Even a single attention head enables global context modeling in one step.
Every token’s new representation is a context-aware summary of the sequence.

6. Multi-Head Attention

While a single attention head allows each token to attend to the entire sequence, it has a limitation: it can only focus on one type of relationship at a time. Multi-head attention solves this by allowing the model to learn multiple types of relationships in parallel.

6.1 Why Multiple Heads?

Each attention head operates in its own subspace of the token embeddings. This allows the model to:

Capture different types of dependencies simultaneously (e.g., syntactic, semantic, positional)
Focus on multiple aspects of the sequence at the same time
Improve representation diversity and richness

For example, in the sentence “The animal didn’t cross the street because it was tired,” one head might focus on subject-verb relationships, while another focuses on pronoun references.

6.2 How It Works

Start with the input embeddings E (shape n × d_model).
For each of the h heads, project E into its own Q, K, V matrices:

Q_i = E · W_Qi
K_i = E · W_Ki
V_i = E · W_Vi

W_Qi, W_Ki, W_Vi ∈ ℝ^(d_model × d_k), where d_k = d_model / h
Each head computes attention independently:

head_i = Attention(Q_i, K_i, V_i)

Concatenate the outputs of all heads:

Concat(head_1, ..., head_h)   # shape: n × d_model

Project the concatenated output back to d_model with a matrix W_O:

MultiHeadOutput = Concat(heads) · W_O

W_O ∈ ℝ^(d_model × d_model)

6.3 Dimensional Flow

Input embeddings: n × d_model
Each head Q, K, V: n × d_k (d_k = d_model / h)
Attention per head: n × d_k
Concatenated heads: n × d_model
Final projection: n × d_model

This ensures that, no matter how many heads are used, the output has the same shape as the input, allowing residual connections and smooth stacking of layers.

6.4 Intuition

Think of each head as a specialized lens focusing on a particular type of relationship in the sequence.
By combining multiple lenses, the model develops a multi-faceted understanding of the input.
Multi-head attention is therefore a powerful way to increase model expressiveness without increasing sequence length or token dimensions.

7. Layer Normalization

After multi-head attention, each token has a new contextual representation. Before passing it through the next sublayer (like the feedforward network), it is important to stabilize and normalize these representations. This is where Layer Normalization (LayerNorm) comes in.

7.1 Why Not Batch Normalization?

Batch Normalization works by normalizing across the batch dimension. While this is effective for images and other fixed-size inputs, it has two main issues for sequences:

Sequences can have different lengths. Padding tokens introduce noise if normalized across the batch.
Each token should maintain independence; batch statistics mix token information across samples, which is undesirable for attention-based models.

LayerNorm solves both problems by normalizing across features for each token individually, not across the batch.

7.2 How LayerNorm Works

For a token representation x ∈ ℝ^d_model:

Compute the mean and variance across features:

μ = (1/d_model) * Σ x_i
σ² = (1/d_model) * Σ (x_i - μ)²

Normalize and scale:

LN(x) = γ * (x - μ) / sqrt(σ² + ε) + β

γ and β are learnable parameters (scale and shift)
ε is a small constant for numerical stability

The output has the same shape as the input (d_model), but features are normalized, which stabilizes training and improves convergence.

7.3 Intuition

LayerNorm ensures that each token’s vector has a consistent scale, preventing some features from dominating attention or the feedforward network.
Normalization is done per token, so padding or variable-length sequences do not affect other tokens.
Combined with residual connections, LayerNorm allows deeper networks to train effectively without vanishing or exploding gradients.

Check out this video for a better understanding of why LayerNorm is used rather than BatchNorm in Sequential Contexts. (The video is in Hindi, but should be easy to understand)

https://www.youtube.com/watch?v=qti0QPdaelg

7.4 Position in the Transformer

LayerNorm is applied after the residual connection in each sublayer:

Output = LayerNorm(x + Sublayer(x))

This structure is repeated for both multi-head attention and feedforward sublayers, keeping token-wise representations stable throughout the stack.

8. Feedforward Fully Connected Network

After each token passes through multi-head attention, the Transformer applies a position-wise feedforward network (FFN). Unlike attention, which mixes information across tokens, the FFN operates independently on each token, enriching its representation with nonlinear transformations.

8.1 Structure of the Feedforward Network

For a token vector x ∈ ℝ^d_model, the FFN consists of two linear layers with a ReLU activation in between:

FFN(x) = max(0, x · W1 + b1) · W2 + b2

W1 ∈ ℝ^(d_model × 4*d_model)
W2 ∈ ℝ^(4*d_model × d_model)
b1, b2 ∈ ℝ^(bias vectors)

Key points:

The hidden layer expands the dimension to 4×d_model, allowing the network to model more complex relationships.
The final layer projects back to d_model to match the residual connection.

8.2 Role and Intuition

Per-token reasoning: Each token can combine features in nonlinear ways without affecting other tokens.
Higher-dimensional context: Expanding the dimension allows the model to create richer transformations and interactions within the token vector.
Complement to attention: While attention captures relationships between tokens, the FFN processes features within a token, adding expressivity.

Think of it as giving each token its own “neural mini-network” to refine its meaning after gathering context from attention.

8.3 Dimensional Flow

Input to FFN: x → shape n × d_model
First linear layer + ReLU: → n × 4*d_model
Second linear layer: → n × d_model
Residual connection ensures the output shape remains n × d_model, compatible with stacking multiple layers.

9. Encoder Architecture

The Transformer encoder is a stack of identical layers, each designed to process the entire input sequence in parallel while capturing both token relationships and per-token transformations.

9.1 The Encoder Block

Each encoder layer consists of the following components:

Multi-Head Self-Attention (MHA)
- Allows each token to attend to every other token in the sequence.
- Captures global relationships, independent of token order (positional information comes from embeddings).
Residual Connection + Layer Normalization
- The input to the attention sublayer is added to its output:
```
  x1 = LayerNorm(x + MHA(x))
```
- Stabilizes gradients and preserves the original token information.
Feedforward Fully Connected Network (FFN)
- Processes each token independently through two linear layers with ReLU, expanding and compressing dimensions:
```
  x2 = LayerNorm(x1 + FFN(x1))
```

Each encoder block maintains the input/output shape: n × d_model, allowing multiple layers to be stacked without changing dimensionality.

9.2 Stacking Layers

The Transformer encoder consists of N identical layers stacked on top of each other.
Each layer refines the token representations by alternating between:
- Global attention (multi-head)
- Local transformation (feedforward network)
This combination ensures that after several layers, each token has a rich, context-aware representation that incorporates both relationships to all other tokens and complex feature transformations.

9.3 Intuition

Think of the encoder as a deep contextualizer:
- Multi-head attention gathers relevant information from the sequence.
- FFN processes the token’s own features.
- LayerNorm + residuals keep the flow stable.
Stacking N layers allows the model to refine both global and local representations repeatedly, increasing expressiveness without changing the sequence length or token dimension.

10. Decoder Architecture

The Transformer decoder is responsible for generating output sequences, such as translated text. It combines self-attention, cross-attention, and feedforward networks, while respecting the causal order of generation.

10.1 Masked Multi-Head Self-Attention

In the decoder, each token can only attend to previous tokens and itself.
This ensures autoregressive generation: future tokens are not seen during training or inference.
Implemented by masking the upper triangle of the attention score matrix:

Scores_masked = Q · Kᵀ / √d_k
Scores_masked[future_positions] = -∞
Weights = softmax(Scores_masked)
Output = Weights · V

The mask prevents information leakage from future tokens, enforcing causality.

NOTE THIS IS THE MANUAL MANIPULATION OF CONTEXT SCORES THAT WAS MENTIONED EARLIER IN THE SELF ATTENTION SECTION

10.2 Cross-Attention with Encoder Outputs

After masked self-attention, the decoder performs cross-attention:
- Queries (Q) come from the decoder’s previous layer outputs
- Keys (K) and Values (V) come from the encoder’s final outputs

CrossAttention(Q_dec, K_enc, V_enc)

This allows the decoder to condition its generation on the input sequence.
Intuitively, the decoder “looks at” the encoder’s representation to decide which information is relevant for generating the next token.

10.3 Feedforward Network and Residuals

Similar to the encoder, each decoder block contains a position-wise FFN with ReLU:

Output = LayerNorm(Input + FFN(Input))

Residual connections and layer normalization stabilize training and maintain the token dimension d_model.

10.4 Overall Decoder Block Flow

Masked Multi-Head Self-Attention → Add & Norm
Cross Multi-Head Attention (with encoder outputs) → Add & Norm
Feedforward Network → Add & Norm

Each decoder layer maintains the input/output shape: n × d_model, allowing stacking of N layers.
The decoder can now generate sequences autoregressively, using attention to both past outputs and the encoder’s representation.

10.5 Intuition

Masked self-attention ensures future tokens do not influence current predictions
Cross-attention allows the model to condition on the input sequence
Feedforward networks provide local per-token reasoning, just like in the encoder
Together, these components allow the decoder to generate fluent, contextually correct sequences one token at a time

11. Training vs Inference

Transformers behave differently during training and inference, and understanding this distinction is key to grasping how they generate sequences efficiently.

11.1 Training

During training, the entire target sequence is available at once.
Masking ensures causal behavior: each token can only attend to previous tokens, preventing information leakage from the future.
The main advantages of training in parallel:
- Fully parallelizable: all tokens in the sequence are processed simultaneously, leveraging GPU acceleration
- Stable gradients: longer sequences no longer suffer from vanishing information as in RNNs
- Faster convergence: context is learned for all tokens in one forward pass
Loss is computed for all tokens simultaneously, usually using cross-entropy between predicted and actual next-token distributions.

11.2 Inference

During inference, sequences are generated token by token (autoregressively).
For each new token:
1. The decoder attends to all previously generated tokens using masked self-attention
2. The decoder attends to encoder outputs via cross-attention
3. The next token is predicted based on the output distribution
This process repeats until an end-of-sequence token is produced.
Key point: generation is sequential, but the underlying attention mechanism still allows each token to consider all past context efficiently.

11.3 Intuition

Training: “See everything at once, learn relationships in parallel.”
Inference: “Predict one token at a time, using previous context.”

This separation explains why Transformers can train extremely fast compared to RNNs while still generating sequences autoregressively when needed.

12. Key Insights & Closing Thoughts

The Transformer architecture, introduced in “Attention Is All You Need”, represents a paradigm shift in sequence modeling. Here are the core takeaways:

12.1 Key Insights

No Recurrence, No Convolution: Unlike RNNs or CNNs, Transformers rely entirely on attention to model relationships between tokens.
Global Context via Self-Attention: Each token can attend to all others in the sequence, enabling long-range dependencies in a single step.
Parallel Training: Training is fully parallelizable, solving the sequential bottleneck of RNNs.
Separation of Concerns:
- Attention handles global, cross-token context
- Feedforward networks handle per-token transformations and feature reasoning
LayerNorm + Residuals stabilize deep architectures, allowing many stacked layers without vanishing gradients.
Masked Decoding: Ensures autoregressive generation during inference, while allowing the model to learn efficiently in parallel during training.

12.2 Closing Thoughts

Transformers have reshaped NLP and AI by providing a scalable, interpretable, and highly expressive architecture.
The same attention mechanisms extend beyond text: Vision Transformers, audio modeling, and even diffusion models use similar principles.
Intuitive takeaway: “Attention is the language of relationships.” Each token communicates with others, forming a rich, context-aware understanding of the sequence.

All sections have been put together using this video.

https://www.youtube.com/watch?v=bCz4OMemCcA&t=2500s

If you still have any queries, you can reach out to me on my LinkedIn / GitHub / Twitter.

Cheers!

Command Palette