Why Attention Is All You Need — A Dimensional and Mathematical Intuition Guide

1. Introduction
In 2017, Vaswani et al. dropped a paper titled “Attention Is All You Need,” and it quietly rewired the entire field of deep learning. Within a few years, its architecture — the Transformer — became the foundation for nearly every modern AI system: GPTs, BERT, diffusion models, even vision networks.
Before this paper, sequence modeling relied on recurrent networks (RNNs and LSTMs) that processed data step-by-step, passing information forward through time. That meant slow training, limited parallelism, and the infamous problem of forgetting information from distant tokens.
The Transformer proposed a radical shift:
Forget time; learn relationships.
Instead of iterating over tokens sequentially, each token could directly “attend” to every other token in the sequence, capturing context in a single forward pass. This attention-based mechanism not only removed recurrence but also made training fully parallelizable — perfect for GPUs.
In this post, we’ll rebuild the intuition and math behind the paper:
How RNNs evolved into attention mechanisms?
What “self-attention” really computes?
How dimensionality flows through Q, K, VQ, K, VQ, K, V projections?
Why multiple heads and feedforward layers matter?
And how does the encoder–decoder structure tie it all together?
By the end, you should be able to visualize every transformation in terms of both meaning and shape, and truly see why attention was, and still is, all we needed.
2. RNNs — What They Were and Why They Broke
Before the Transformer, nearly every sequential model used Recurrent Neural Networks (RNNs).
RNNs process sequences token by token while maintaining a hidden "memory" of what came before.
2.1 What are RNNs?
At each time step t, an RNN updates a hidden state hₜ using the current input xₜ and the previous hidden state hₜ₋₁:
hₜ = f(Wₓ · xₜ + Wₕ · hₜ₋₁)
yₜ = Wᵧ · hₜ
Here:
xₜ → input vector at step t
hₜ → hidden state (the model’s internal memory)
f → activation function (usually tanh or ReLU)
This creates a chain of dependencies — every output depends on all previous steps.
2.2 The Core Problems
1. Sequential Dependency
Each step depends on the previous one. You can’t compute step t+1 until t is finished.
- This makes training and inference very slow and non-parallelizable.
2. Vanishing and Exploding Gradients
During backpropagation, gradients pass through many time steps.
If weights are small, gradients vanish, and early tokens are forgotten.
If weights are large, gradients explode and training becomes unstable.
3. Information Decay
The hidden state is a single fixed-size vector that must store all past context.
Older information fades as new information arrives — much like trying to remember the start of a long sentence.
4. Long Inference Time
Inference must also be sequential. You can’t predict multiple tokens at once because each depends on the last output.
3. Transformer Intuition — From Memory Chains to Attention Maps
Recurrent models view sequences as chains: information flows step by step. The Transformer introduced a new way of thinking — instead of passing information through time, it lets every token directly connect to every other token.
This is the essence of attention.
3.1 The Core Idea
In an RNN, the token at position t can only use information passed from earlier positions.
In a Transformer, the token at position t can "look" at every other token in the sequence, including itself, and decide which ones are relevant.
This means:
No recurrence or time dependency.
All tokens are processed in parallel.
Context is learned by comparing tokens directly.
3.2 The Intuitive Analogy
Think of reading a sentence like “The animal didn’t cross the street because it was too tired.”
When you read the word “it”, you don’t have to replay the entire sentence sequentially. You instantly recall the relevant part — “the animal”.
That’s exactly what attention does: each token attends to the parts of the sequence that matter most for understanding its own meaning.
3.3 Computation as Relationships
The Transformer encodes these relationships through a set of learnable projections:
Each token’s embedding is projected into three spaces: Query (Q), Key (K), and Value (V).
The query of one token measures how much it relates to the keys of all other tokens.
The result is a weighted combination of their values, forming a new representation for that token.
Mathematically, for each token:
Attention weights = softmax(Q · Kᵀ)
Output = Attention weights × V
This mechanism directly models pairwise relationships between tokens, rather than relying on sequential memory.
3.4 Why This Matters
The Transformer’s self-attention lets the model:
Capture global dependencies between tokens (not limited by distance).
Train in parallel, since all tokens attend simultaneously.
Retain long-term context efficiently.
In short, attention turns sequential data into a fully connected relationship graph between tokens, computed in a single step.
3.5 The Shift in Perspective
Before the Transformer, “sequence” implied “time”.
After it, “sequence” became a set of relationships.
The model doesn’t think in terms of steps; it thinks in terms of contextual relevance.
This shift is what enabled modern large language models — systems that learn meaning by understanding the relationships between words, not their positions in a timeline.
4. Input Representation
Before attention can operate, the raw tokens of a sequence must be converted into vectors that the model can process. This is done in two steps: token embeddings and positional encodings.
4.1 Token Embeddings
Each word or token is mapped to a dense vector of dimension
d_model.If the input sequence has
ntokens, the embedding matrixXhas shape:
X ∈ ℝ^(n × d_model)
These embeddings capture semantic meaning — similar words have similar vector representations.
At this stage, there is no positional information; the model doesn’t know which token comes first or last.
4.2 Positional Encodings
Since the Transformer does not process tokens sequentially, we need to inject information about token positions in the sequence.
The paper uses sinusoidal positional encodings:
- For each position
posand dimensioni:
PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
This produces a vector
PEof the same dimension as the token embeddings (d_model).These encodings allow the model to distinguish order and learn relative positions without recurrence.
4.3 Combining Embeddings and Positional Encodings
The final input to the Transformer is the sum of token embeddings and positional encodings:
E = X + PE
Shape of
E:n × d_modelThis combined representation contains both semantic meaning and positional information.

4.4 Intuition
Each token now has a vector that tells the model:
What the token is (embedding)
Where it is in the sequence (positional encoding)
The Transformer can now apply attention, knowing both content and position.
Sinusoids are used instead of learned embeddings because they allow the model to extrapolate to longer sequences than seen during training.
5. Self-Attention Mechanism (Single Head)
The key innovation of the Transformer is self-attention, a mechanism that allows each token in a sequence to consider all other tokens when forming its representation. Unlike RNNs, which rely on sequential steps to propagate information, self-attention provides each token with direct access to the entire sequence in a single step.
5.1 From Embeddings to Queries, Keys, and Values
Starting from the input embeddings E (shape n × d_model), the model generates three separate projections for each token: Query (Q), Key (K), and Value (V).
Query (Q) represents what the token is “looking for”
Key (K) represents the content of the token to be compared against queries
Value (V) carries the actual information of the token
These projections are obtained by multiplying E with learnable weight matrices:
Q = E · W_Q
K = E · W_K
V = E · W_V
W_Q, W_K, W_V ∈ ℝ^(d_model × d_k)
Resulting shapes: Q, K, V ∈ ℝ^(n × d_k)
Here, d_k is typically smaller than d_model for efficiency, but all tokens are now ready for interaction.
5.2 Computing Attention
Self-attention measures how much each token should attend to every other token. This is done in three steps:
- Compute similarity scores between queries and keys:
Scores = Q · Kᵀ # shape: n × n
- Scale the scores by √d_k to prevent excessively large values that destabilize gradients:
Scores_scaled = Scores / √d_k

NOTE THAT THE 6X6 MATRIX DENOTES RELATIONSHIPS OF EACH TOKEN WITH OTHER TOKENS IN THE STRING. WE CAN MANIPULATE THIS MANUALLY AS WELL. THIS WILL BE USED IN MASKING FUTURE TOKENS IN THE DECODER BLOCK
- Apply softmax to convert scores into attention weights:
Weights = softmax(Scores_scaled)
- Multiply the weights by the values to get the output:
Output = Weights · V # shape: n × d_v

Each row in the output corresponds to a contextualized vector for that token.
In essence, each token gathers information from the entire sequence, weighted by relevance.
5.3 Intuition
Imagine the sentence: “The animal didn’t cross the street because it was tired.”
When processing the token “it,” self-attention allows it to look at every other word.
It assigns higher weights to “animal” (its antecedent) and lower weights to unrelated tokens like “street” or “cross.”
Unlike RNNs, this mechanism does not rely on sequential propagation, allowing the model to capture long-range dependencies efficiently.
5.4 Dimensional Flow
Input embeddings:
E→ n × d_modelProjections: Q, K, V → n × d_k
Attention scores: Q · Kᵀ → n × n
Weighted sum: Weights · V → n × d_v
Even a single attention head enables global context modeling in one step.
Every token’s new representation is a context-aware summary of the sequence.
6. Multi-Head Attention
While a single attention head allows each token to attend to the entire sequence, it has a limitation: it can only focus on one type of relationship at a time. Multi-head attention solves this by allowing the model to learn multiple types of relationships in parallel.
6.1 Why Multiple Heads?
Each attention head operates in its own subspace of the token embeddings. This allows the model to:
Capture different types of dependencies simultaneously (e.g., syntactic, semantic, positional)
Focus on multiple aspects of the sequence at the same time
Improve representation diversity and richness
For example, in the sentence “The animal didn’t cross the street because it was tired,” one head might focus on subject-verb relationships, while another focuses on pronoun references.
6.2 How It Works
Start with the input embeddings
E(shapen × d_model).For each of the
hheads, projectEinto its own Q, K, V matrices:
Q_i = E · W_Qi
K_i = E · W_Ki
V_i = E · W_Vi
W_Qi, W_Ki, W_Vi ∈ ℝ^(d_model × d_k), where d_k = d_model / h
Each head computes attention independently:
head_i = Attention(Q_i, K_i, V_i)
- Concatenate the outputs of all heads:
Concat(head_1, ..., head_h) # shape: n × d_model
- Project the concatenated output back to
d_modelwith a matrix W_O:
MultiHeadOutput = Concat(heads) · W_O
W_O ∈ ℝ^(d_model × d_model)

6.3 Dimensional Flow
Input embeddings: n × d_model
Each head Q, K, V: n × d_k (d_k = d_model / h)
Attention per head: n × d_k
Concatenated heads: n × d_model
Final projection: n × d_model
This ensures that, no matter how many heads are used, the output has the same shape as the input, allowing residual connections and smooth stacking of layers.
6.4 Intuition
Think of each head as a specialized lens focusing on a particular type of relationship in the sequence.
By combining multiple lenses, the model develops a multi-faceted understanding of the input.
Multi-head attention is therefore a powerful way to increase model expressiveness without increasing sequence length or token dimensions.
7. Layer Normalization
After multi-head attention, each token has a new contextual representation. Before passing it through the next sublayer (like the feedforward network), it is important to stabilize and normalize these representations. This is where Layer Normalization (LayerNorm) comes in.
7.1 Why Not Batch Normalization?
Batch Normalization works by normalizing across the batch dimension. While this is effective for images and other fixed-size inputs, it has two main issues for sequences:
Sequences can have different lengths. Padding tokens introduce noise if normalized across the batch.
Each token should maintain independence; batch statistics mix token information across samples, which is undesirable for attention-based models.
LayerNorm solves both problems by normalizing across features for each token individually, not across the batch.
7.2 How LayerNorm Works
For a token representation x ∈ ℝ^d_model:
- Compute the mean and variance across features:
μ = (1/d_model) * Σ x_i
σ² = (1/d_model) * Σ (x_i - μ)²
- Normalize and scale:
LN(x) = γ * (x - μ) / sqrt(σ² + ε) + β
γ and β are learnable parameters (scale and shift)
ε is a small constant for numerical stability
The output has the same shape as the input (d_model), but features are normalized, which stabilizes training and improves convergence.
7.3 Intuition
LayerNorm ensures that each token’s vector has a consistent scale, preventing some features from dominating attention or the feedforward network.
Normalization is done per token, so padding or variable-length sequences do not affect other tokens.
Combined with residual connections, LayerNorm allows deeper networks to train effectively without vanishing or exploding gradients.
Check out this video for a better understanding of why LayerNorm is used rather than BatchNorm in Sequential Contexts. (The video is in Hindi, but should be easy to understand)
7.4 Position in the Transformer
- LayerNorm is applied after the residual connection in each sublayer:
Output = LayerNorm(x + Sublayer(x))
- This structure is repeated for both multi-head attention and feedforward sublayers, keeping token-wise representations stable throughout the stack.
8. Feedforward Fully Connected Network
After each token passes through multi-head attention, the Transformer applies a position-wise feedforward network (FFN). Unlike attention, which mixes information across tokens, the FFN operates independently on each token, enriching its representation with nonlinear transformations.
8.1 Structure of the Feedforward Network
For a token vector x ∈ ℝ^d_model, the FFN consists of two linear layers with a ReLU activation in between:
FFN(x) = max(0, x · W1 + b1) · W2 + b2
W1 ∈ ℝ^(d_model × 4*d_model)
W2 ∈ ℝ^(4*d_model × d_model)
b1, b2 ∈ ℝ^(bias vectors)
Key points:
The hidden layer expands the dimension to 4×d_model, allowing the network to model more complex relationships.
The final layer projects back to d_model to match the residual connection.
8.2 Role and Intuition
Per-token reasoning: Each token can combine features in nonlinear ways without affecting other tokens.
Higher-dimensional context: Expanding the dimension allows the model to create richer transformations and interactions within the token vector.
Complement to attention: While attention captures relationships between tokens, the FFN processes features within a token, adding expressivity.
Think of it as giving each token its own “neural mini-network” to refine its meaning after gathering context from attention.
8.3 Dimensional Flow
Input to FFN:
x→ shape n × d_modelFirst linear layer + ReLU: → n × 4*d_model
Second linear layer: → n × d_model
Residual connection ensures the output shape remains n × d_model, compatible with stacking multiple layers.
9. Encoder Architecture
The Transformer encoder is a stack of identical layers, each designed to process the entire input sequence in parallel while capturing both token relationships and per-token transformations.
9.1 The Encoder Block

Each encoder layer consists of the following components:
Multi-Head Self-Attention (MHA)
Allows each token to attend to every other token in the sequence.
Captures global relationships, independent of token order (positional information comes from embeddings).
Residual Connection + Layer Normalization
The input to the attention sublayer is added to its output:
x1 = LayerNorm(x + MHA(x))Stabilizes gradients and preserves the original token information.
Feedforward Fully Connected Network (FFN)
Processes each token independently through two linear layers with ReLU, expanding and compressing dimensions:
x2 = LayerNorm(x1 + FFN(x1))
- Each encoder block maintains the input/output shape: n × d_model, allowing multiple layers to be stacked without changing dimensionality.
9.2 Stacking Layers
The Transformer encoder consists of N identical layers stacked on top of each other.
Each layer refines the token representations by alternating between:
Global attention (multi-head)
Local transformation (feedforward network)
This combination ensures that after several layers, each token has a rich, context-aware representation that incorporates both relationships to all other tokens and complex feature transformations.
9.3 Intuition
Think of the encoder as a deep contextualizer:
Multi-head attention gathers relevant information from the sequence.
FFN processes the token’s own features.
LayerNorm + residuals keep the flow stable.
Stacking N layers allows the model to refine both global and local representations repeatedly, increasing expressiveness without changing the sequence length or token dimension.
10. Decoder Architecture

The Transformer decoder is responsible for generating output sequences, such as translated text. It combines self-attention, cross-attention, and feedforward networks, while respecting the causal order of generation.
10.1 Masked Multi-Head Self-Attention
In the decoder, each token can only attend to previous tokens and itself.
This ensures autoregressive generation: future tokens are not seen during training or inference.
Implemented by masking the upper triangle of the attention score matrix:
Scores_masked = Q · Kᵀ / √d_k
Scores_masked[future_positions] = -∞
Weights = softmax(Scores_masked)
Output = Weights · V
- The mask prevents information leakage from future tokens, enforcing causality.
NOTE THIS IS THE MANUAL MANIPULATION OF CONTEXT SCORES THAT WAS MENTIONED EARLIER IN THE SELF ATTENTION SECTION
10.2 Cross-Attention with Encoder Outputs
After masked self-attention, the decoder performs cross-attention:
Queries (Q) come from the decoder’s previous layer outputs
Keys (K) and Values (V) come from the encoder’s final outputs
CrossAttention(Q_dec, K_enc, V_enc)
This allows the decoder to condition its generation on the input sequence.
Intuitively, the decoder “looks at” the encoder’s representation to decide which information is relevant for generating the next token.
10.3 Feedforward Network and Residuals
- Similar to the encoder, each decoder block contains a position-wise FFN with ReLU:
Output = LayerNorm(Input + FFN(Input))
- Residual connections and layer normalization stabilize training and maintain the token dimension
d_model.
10.4 Overall Decoder Block Flow
Masked Multi-Head Self-Attention → Add & Norm
Cross Multi-Head Attention (with encoder outputs) → Add & Norm
Feedforward Network → Add & Norm
Each decoder layer maintains the input/output shape: n × d_model, allowing stacking of N layers.
The decoder can now generate sequences autoregressively, using attention to both past outputs and the encoder’s representation.
10.5 Intuition
Masked self-attention ensures future tokens do not influence current predictions
Cross-attention allows the model to condition on the input sequence
Feedforward networks provide local per-token reasoning, just like in the encoder
Together, these components allow the decoder to generate fluent, contextually correct sequences one token at a time
11. Training vs Inference
Transformers behave differently during training and inference, and understanding this distinction is key to grasping how they generate sequences efficiently.
11.1 Training
During training, the entire target sequence is available at once.
Masking ensures causal behavior: each token can only attend to previous tokens, preventing information leakage from the future.
The main advantages of training in parallel:
Fully parallelizable: all tokens in the sequence are processed simultaneously, leveraging GPU acceleration
Stable gradients: longer sequences no longer suffer from vanishing information as in RNNs
Faster convergence: context is learned for all tokens in one forward pass
Loss is computed for all tokens simultaneously, usually using cross-entropy between predicted and actual next-token distributions.
11.2 Inference
During inference, sequences are generated token by token (autoregressively).
For each new token:
The decoder attends to all previously generated tokens using masked self-attention
The decoder attends to encoder outputs via cross-attention
The next token is predicted based on the output distribution
This process repeats until an end-of-sequence token is produced.
Key point: generation is sequential, but the underlying attention mechanism still allows each token to consider all past context efficiently.
11.3 Intuition
Training: “See everything at once, learn relationships in parallel.”
Inference: “Predict one token at a time, using previous context.”
This separation explains why Transformers can train extremely fast compared to RNNs while still generating sequences autoregressively when needed.
12. Key Insights & Closing Thoughts
The Transformer architecture, introduced in “Attention Is All You Need”, represents a paradigm shift in sequence modeling. Here are the core takeaways:
12.1 Key Insights
No Recurrence, No Convolution: Unlike RNNs or CNNs, Transformers rely entirely on attention to model relationships between tokens.
Global Context via Self-Attention: Each token can attend to all others in the sequence, enabling long-range dependencies in a single step.
Parallel Training: Training is fully parallelizable, solving the sequential bottleneck of RNNs.
Separation of Concerns:
Attention handles global, cross-token context
Feedforward networks handle per-token transformations and feature reasoning
LayerNorm + Residuals stabilize deep architectures, allowing many stacked layers without vanishing gradients.
Masked Decoding: Ensures autoregressive generation during inference, while allowing the model to learn efficiently in parallel during training.
12.2 Closing Thoughts
Transformers have reshaped NLP and AI by providing a scalable, interpretable, and highly expressive architecture.
The same attention mechanisms extend beyond text: Vision Transformers, audio modeling, and even diffusion models use similar principles.
Intuitive takeaway: “Attention is the language of relationships.” Each token communicates with others, forming a rich, context-aware understanding of the sequence.
All sections have been put together using this video.
If you still have any queries, you can reach out to me on my LinkedIn / GitHub / Twitter.
Cheers!



