# Why Attention Is All You Need — A Dimensional and Mathematical Intuition Guide

## 1\. Introduction

In 2017, Vaswani et al. dropped a paper titled *“Attention Is All You Need,”* and it quietly rewired the entire field of deep learning. Within a few years, its architecture — the **Transformer** — became the foundation for nearly every modern AI system: GPTs, BERT, diffusion models, even vision networks.

Before this paper, sequence modeling relied on **recurrent networks (RNNs and LSTMs)** that processed data *step-by-step*, passing information forward through time. That meant slow training, limited parallelism, and the infamous problem of forgetting information from distant tokens.

The Transformer proposed a radical shift:

> *Forget time; learn relationships.*

Instead of iterating over tokens sequentially, each token could directly **“attend” to every other token** in the sequence, capturing context in a *single forward pass*. This attention-based mechanism not only removed recurrence but also made training fully parallelizable — perfect for GPUs.

In this post, we’ll rebuild the intuition and math behind the paper:

* How RNNs evolved into attention mechanisms?
    
* What “self-attention” really computes?
    
* How dimensionality flows through Q, K, VQ, K, VQ, K, V projections?
    
* Why multiple heads and feedforward layers matter?
    
* And how does the encoder–decoder structure tie it all together?
    

By the end, you should be able to **visualize every transformation in terms of both meaning and shape**, and truly see why *attention was, and still is, all we needed.*

## 2\. RNNs — What They Were and Why They Broke

Before the Transformer, nearly every sequential model used **Recurrent Neural Networks (RNNs)**.  
RNNs process sequences token by token while maintaining a hidden "memory" of what came before.

---

### 2.1 What are RNNs?

At each time step *t*, an RNN updates a hidden state **hₜ** using the current input **xₜ** and the previous hidden state **hₜ₋₁**:

**hₜ = f(Wₓ · xₜ + Wₕ · hₜ₋₁)**  
**yₜ = Wᵧ · hₜ**

Here:

* *xₜ* → input vector at step *t*
    
* *hₜ* → hidden state (the model’s internal memory)
    
* *f* → activation function (usually tanh or ReLU)
    

This creates a chain of dependencies — every output depends on all previous steps.

---

### 2.2 The Core Problems

**1\. Sequential Dependency**  
Each step depends on the previous one. You can’t compute step *t+1* until *t* is finished.

* This makes training and inference very slow and non-parallelizable.
    

**2\. Vanishing and Exploding Gradients**  
During backpropagation, gradients pass through many time steps.

* If weights are small, gradients vanish, and early tokens are forgotten.
    
* If weights are large, gradients explode and training becomes unstable.
    

**3\. Information Decay**  
The hidden state is a single fixed-size vector that must store *all* past context.  
Older information fades as new information arrives — much like trying to remember the start of a long sentence.

**4\. Long Inference Time**  
Inference must also be sequential. You can’t predict multiple tokens at once because each depends on the last output.

## 3\. Transformer Intuition — From Memory Chains to Attention Maps

Recurrent models view sequences as chains: information flows step by step. The Transformer introduced a new way of thinking — instead of passing information through time, it lets every token directly connect to every other token.

This is the essence of **attention**.

---

### 3.1 The Core Idea

In an RNN, the token at position *t* can only use information passed from earlier positions.  
In a Transformer, the token at position *t* can "look" at every other token in the sequence, including itself, and decide **which ones are relevant**.

This means:

* No recurrence or time dependency.
    
* All tokens are processed **in parallel**.
    
* Context is learned by comparing tokens directly.
    

---

### 3.2 The Intuitive Analogy

Think of reading a sentence like “The animal didn’t cross the street because it was too tired.”

When you read the word “it”, you don’t have to replay the entire sentence sequentially. You instantly recall the relevant part — “the animal”.  
That’s exactly what attention does: each token **attends** to the parts of the sequence that matter most for understanding its own meaning.

---

### 3.3 Computation as Relationships

The Transformer encodes these relationships through a set of **learnable projections**:

* Each token’s embedding is projected into three spaces: **Query (Q)**, **Key (K)**, and **Value (V)**.
    
* The query of one token measures how much it relates to the keys of all other tokens.
    
* The result is a weighted combination of their values, forming a new representation for that token.
    

Mathematically, for each token:

* Attention weights = softmax(Q · Kᵀ)
    
* Output = Attention weights × V
    

This mechanism directly models pairwise relationships between tokens, rather than relying on sequential memory.

---

### 3.4 Why This Matters

The Transformer’s self-attention lets the model:

* Capture **global dependencies** between tokens (not limited by distance).
    
* Train **in parallel**, since all tokens attend simultaneously.
    
* Retain **long-term context** efficiently.
    

In short, attention turns sequential data into a **fully connected relationship graph** between tokens, computed in a single step.

---

### 3.5 The Shift in Perspective

Before the Transformer, “sequence” implied “time”.  
After it, “sequence” became a **set of relationships**.

The model doesn’t think in terms of steps; it thinks in terms of **contextual relevance**.  
This shift is what enabled modern large language models — systems that learn meaning by understanding the *relationships between words*, not their positions in a timeline.

## 4\. Input Representation

Before attention can operate, the raw tokens of a sequence must be converted into vectors that the model can process. This is done in two steps: **token embeddings** and **positional encodings**.

---

### 4.1 Token Embeddings

* Each word or token is mapped to a **dense vector** of dimension `d_model`.
    
* If the input sequence has `n` tokens, the embedding matrix `X` has shape:
    

```elixir
X ∈ ℝ^(n × d_model)
```

* These embeddings capture semantic meaning — similar words have similar vector representations.
    
* At this stage, there is **no positional information**; the model doesn’t know which token comes first or last.
    

---

### 4.2 Positional Encodings

Since the Transformer **does not process tokens sequentially**, we need to inject information about **token positions** in the sequence.

The paper uses **sinusoidal positional encodings**:

* For each position `pos` and dimension `i`:
    

```elixir
PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
```

* This produces a vector `PE` of the same dimension as the token embeddings (`d_model`).
    
* These encodings allow the model to **distinguish order** and learn relative positions without recurrence.
    

---

### 4.3 Combining Embeddings and Positional Encodings

The final input to the Transformer is the **sum** of token embeddings and positional encodings:

```elixir
E = X + PE
```

* Shape of `E`: `n × d_model`
    
* This combined representation contains both **semantic meaning** and **positional information**.
    

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1759737286219/c14a4dd7-cac4-4c76-bb91-d592cc82c617.png align="center")

---

### 4.4 Intuition

* Each token now has a vector that tells the model:
    
    * *What the token is* (embedding)
        
    * *Where it is in the sequence* (positional encoding)
        
* The Transformer can now apply **attention**, knowing both content and position.
    
* Sinusoids are used instead of learned embeddings because they allow the model to **extrapolate to longer sequences** than seen during training.
    

## 5\. Self-Attention Mechanism (Single Head)

The key innovation of the Transformer is **self-attention**, a mechanism that allows each token in a sequence to consider all other tokens when forming its representation. Unlike RNNs, which rely on sequential steps to propagate information, self-attention provides each token with **direct access to the entire sequence** in a single step.

---

### 5.1 From Embeddings to Queries, Keys, and Values

Starting from the input embeddings `E` (shape `n × d_model`), the model generates three separate projections for each token: **Query (Q)**, **Key (K)**, and **Value (V)**.

* **Query (Q)** represents what the token is “looking for”
    
* **Key (K)** represents the content of the token to be compared against queries
    
* **Value (V)** carries the actual information of the token
    

These projections are obtained by multiplying `E` with learnable weight matrices:

```elixir
Q = E · W_Q
K = E · W_K
V = E · W_V
```

* W\_Q, W\_K, W\_V ∈ ℝ^(d\_model × d\_k)
    
* Resulting shapes: Q, K, V ∈ ℝ^(n × d\_k)
    

Here, `d_k` is typically smaller than `d_model` for efficiency, but all tokens are now ready for interaction.

---

### 5.2 Computing Attention

Self-attention measures **how much each token should attend to every other token**. This is done in three steps:

1. Compute similarity scores between queries and keys:
    

```elixir
Scores = Q · Kᵀ      # shape: n × n
```

2. Scale the scores by √d\_k to prevent excessively large values that destabilize gradients:
    

```elixir
Scores_scaled = Scores / √d_k
```

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1759737382028/fa2a01e0-735a-44d8-8500-ee4e2f100b6b.png align="center")

**NOTE** THAT THE 6X6 MATRIX DENOTES RELATIONSHIPS OF EACH TOKEN WITH OTHER TOKENS IN THE STRING. WE CAN MANIPULATE THIS MANUALLY AS WELL. THIS WILL BE USED IN MASKING FUTURE TOKENS IN THE DECODER BLOCK

3. Apply softmax to convert scores into attention weights:
    

```elixir
Weights = softmax(Scores_scaled)
```

4. Multiply the weights by the values to get the output:
    

```elixir
Output = Weights · V    # shape: n × d_v
```

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1759737416795/747e8ab4-119d-4441-98e9-548fb45f7859.png align="center")

* Each row in the output corresponds to a **contextualized vector** for that token.
    
* In essence, each token gathers information from the entire sequence, weighted by relevance.
    

---

### 5.3 Intuition

Imagine the sentence: “The animal didn’t cross the street because it was tired.”

When processing the token “it,” self-attention allows it to look at every other word.  
It assigns higher weights to “animal” (its antecedent) and lower weights to unrelated tokens like “street” or “cross.”

Unlike RNNs, this mechanism **does not rely on sequential propagation**, allowing the model to capture long-range dependencies efficiently.

---

### 5.4 Dimensional Flow

* **Input embeddings:** `E` → n × d\_model
    
* **Projections:** Q, K, V → n × d\_k
    
* **Attention scores:** Q · Kᵀ → n × n
    
* **Weighted sum:** Weights · V → n × d\_v
    

Even a single attention head enables **global context modeling** in one step.  
Every token’s new representation is a **context-aware summary** of the sequence.

## 6\. Multi-Head Attention

While a single attention head allows each token to attend to the entire sequence, it has a limitation: it can only focus on one type of relationship at a time. **Multi-head attention** solves this by allowing the model to learn multiple types of relationships in parallel.

---

### 6.1 Why Multiple Heads?

Each attention head operates in its own subspace of the token embeddings. This allows the model to:

* Capture different types of dependencies simultaneously (e.g., syntactic, semantic, positional)
    
* Focus on multiple aspects of the sequence at the same time
    
* Improve representation diversity and richness
    

For example, in the sentence “The animal didn’t cross the street because it was tired,” one head might focus on **subject-verb relationships**, while another focuses on **pronoun references**.

---

### 6.2 How It Works

1. Start with the input embeddings `E` (shape `n × d_model`).
    
2. For each of the `h` heads, project `E` into its own **Q, K, V** matrices:
    

```elixir
Q_i = E · W_Qi
K_i = E · W_Ki
V_i = E · W_Vi
```

* W\_Qi, W\_Ki, W\_Vi ∈ ℝ^(d\_model × d\_k), where d\_k = d\_model / h
    
* Each head computes attention independently:
    

```elixir
head_i = Attention(Q_i, K_i, V_i)
```

3. Concatenate the outputs of all heads:
    

```elixir
Concat(head_1, ..., head_h)   # shape: n × d_model
```

4. Project the concatenated output back to `d_model` with a matrix W\_O:
    

```elixir
MultiHeadOutput = Concat(heads) · W_O
```

* W\_O ∈ ℝ^(d\_model × d\_model)
    
    ![](https://cdn.hashnode.com/res/hashnode/image/upload/v1759737465037/484314c8-4732-41e5-be85-443ba92ea7c0.png align="center")
    

---

### 6.3 Dimensional Flow

* **Input embeddings:** n × d\_model
    
* **Each head Q, K, V:** n × d\_k (d\_k = d\_model / h)
    
* **Attention per head:** n × d\_k
    
* **Concatenated heads:** n × d\_model
    
* **Final projection:** n × d\_model
    

This ensures that, no matter how many heads are used, the output has the same shape as the input, allowing **residual connections** and smooth stacking of layers.

---

### 6.4 Intuition

* Think of each head as a **specialized lens** focusing on a particular type of relationship in the sequence.
    
* By combining multiple lenses, the model develops a **multi-faceted understanding** of the input.
    
* Multi-head attention is therefore a powerful way to **increase model expressiveness without increasing sequence length or token dimensions**.
    

## 7\. Layer Normalization

After multi-head attention, each token has a new contextual representation. Before passing it through the next sublayer (like the feedforward network), it is important to **stabilize and normalize** these representations. This is where **Layer Normalization (LayerNorm)** comes in.

---

### 7.1 Why Not Batch Normalization?

Batch Normalization works by normalizing across the **batch dimension**. While this is effective for images and other fixed-size inputs, it has two main issues for sequences:

* Sequences can have **different lengths**. Padding tokens introduce noise if normalized across the batch.
    
* Each token should maintain **independence**; batch statistics mix token information across samples, which is undesirable for attention-based models.
    

LayerNorm solves both problems by normalizing **across features for each token individually**, not across the batch.

---

### 7.2 How LayerNorm Works

For a token representation `x ∈ ℝ^d_model`:

1. Compute the mean and variance across features:
    

```elixir
μ = (1/d_model) * Σ x_i
σ² = (1/d_model) * Σ (x_i - μ)²
```

2. Normalize and scale:
    

```elixir
LN(x) = γ * (x - μ) / sqrt(σ² + ε) + β
```

* γ and β are learnable parameters (scale and shift)
    
* ε is a small constant for numerical stability
    

The output has the **same shape as the input** (`d_model`), but features are normalized, which stabilizes training and improves convergence.

---

### 7.3 Intuition

* LayerNorm ensures that **each token’s vector has a consistent scale**, preventing some features from dominating attention or the feedforward network.
    
* Normalization is done **per token**, so padding or variable-length sequences do not affect other tokens.
    
* Combined with **residual connections**, LayerNorm allows deeper networks to train effectively without vanishing or exploding gradients.
    

Check out this video for a better understanding of why LayerNorm is used rather than BatchNorm in Sequential Contexts. (The video is in Hindi, but should be easy to understand)

%[https://www.youtube.com/watch?v=qti0QPdaelg] 

---

### 7.4 Position in the Transformer

* LayerNorm is applied **after the residual connection** in each sublayer:
    

```elixir
Output = LayerNorm(x + Sublayer(x))
```

* This structure is repeated for both **multi-head attention** and **feedforward sublayers**, keeping token-wise representations stable throughout the stack.
    

## 8\. Feedforward Fully Connected Network

After each token passes through multi-head attention, the Transformer applies a **position-wise feedforward network (FFN)**. Unlike attention, which mixes information across tokens, the FFN operates **independently on each token**, enriching its representation with nonlinear transformations.

---

### 8.1 Structure of the Feedforward Network

For a token vector `x ∈ ℝ^d_model`, the FFN consists of **two linear layers with a ReLU activation** in between:

```elixir
FFN(x) = max(0, x · W1 + b1) · W2 + b2
```

* W1 ∈ ℝ^(d\_model × 4\*d\_model)
    
* W2 ∈ ℝ^(4\*d\_model × d\_model)
    
* b1, b2 ∈ ℝ^(bias vectors)
    

Key points:

* The hidden layer expands the dimension to **4×d\_model**, allowing the network to model more complex relationships.
    
* The final layer projects back to **d\_model** to match the residual connection.
    

---

### 8.2 Role and Intuition

* **Per-token reasoning**: Each token can combine features in nonlinear ways without affecting other tokens.
    
* **Higher-dimensional context**: Expanding the dimension allows the model to create richer transformations and interactions within the token vector.
    
* **Complement to attention**: While attention captures **relationships between tokens**, the FFN processes **features within a token**, adding expressivity.
    

Think of it as giving each token its own “neural mini-network” to refine its meaning after gathering context from attention.

---

### 8.3 Dimensional Flow

1. Input to FFN: `x` → shape n × d\_model
    
2. First linear layer + ReLU: → n × 4\*d\_model
    
3. Second linear layer: → n × d\_model
    
4. Residual connection ensures the **output shape remains n × d\_model**, compatible with stacking multiple layers.
    

## 9\. Encoder Architecture

The Transformer encoder is a **stack of identical layers**, each designed to process the entire input sequence in parallel while capturing both **token relationships** and **per-token transformations**.

---

### 9.1 The Encoder Block

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1759737654555/f92048e1-3dda-4979-a3dc-856ee091ba70.png align="center")

Each encoder layer consists of the following components:

1. **Multi-Head Self-Attention (MHA)**
    
    * Allows each token to attend to every other token in the sequence.
        
    * Captures global relationships, independent of token order (positional information comes from embeddings).
        
2. **Residual Connection + Layer Normalization**
    
    * The input to the attention sublayer is added to its output:
        
        ```elixir
        x1 = LayerNorm(x + MHA(x))
        ```
        
    * Stabilizes gradients and preserves the original token information.
        
3. **Feedforward Fully Connected Network (FFN)**
    
    * Processes each token independently through two linear layers with ReLU, expanding and compressing dimensions:
        
        ```elixir
        x2 = LayerNorm(x1 + FFN(x1))
        ```
        

* Each encoder block maintains the input/output shape: **n × d\_model**, allowing multiple layers to be stacked without changing dimensionality.
    

---

### 9.2 Stacking Layers

* The Transformer encoder consists of **N identical layers** stacked on top of each other.
    
* Each layer refines the token representations by alternating between:
    
    * **Global attention** (multi-head)
        
    * **Local transformation** (feedforward network)
        
* This combination ensures that after several layers, each token has a **rich, context-aware representation** that incorporates both **relationships to all other tokens** and **complex feature transformations**.
    

---

### 9.3 Intuition

* Think of the encoder as a **deep contextualizer**:
    
    * Multi-head attention gathers relevant information from the sequence.
        
    * FFN processes the token’s own features.
        
    * LayerNorm + residuals keep the flow stable.
        
* Stacking N layers allows the model to **refine both global and local representations** repeatedly, increasing expressiveness without changing the sequence length or token dimension.
    

## 10\. Decoder Architecture

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1759737689342/f0ec3aee-9853-4c4b-912f-f2fb2dc3786d.png align="center")

The Transformer decoder is responsible for **generating output sequences**, such as translated text. It combines **self-attention**, **cross-attention**, and **feedforward networks**, while respecting the **causal order** of generation.

---

### 10.1 Masked Multi-Head Self-Attention

* In the decoder, each token can **only attend to previous tokens** and itself.
    
* This ensures **autoregressive generation**: future tokens are not seen during training or inference.
    
* Implemented by **masking the upper triangle** of the attention score matrix:
    

```elixir
Scores_masked = Q · Kᵀ / √d_k
Scores_masked[future_positions] = -∞
Weights = softmax(Scores_masked)
Output = Weights · V
```

* The mask prevents information leakage from future tokens, enforcing causality.
    

**NOTE** THIS IS THE MANUAL MANIPULATION OF CONTEXT SCORES THAT WAS MENTIONED EARLIER IN THE SELF ATTENTION SECTION

---

### 10.2 Cross-Attention with Encoder Outputs

* After masked self-attention, the decoder performs **cross-attention**:
    
    * Queries (Q) come from the decoder’s previous layer outputs
        
    * Keys (K) and Values (V) come from the encoder’s final outputs
        

```elixir
CrossAttention(Q_dec, K_enc, V_enc)
```

* This allows the decoder to **condition its generation** on the input sequence.
    
* Intuitively, the decoder “looks at” the encoder’s representation to decide which information is relevant for generating the next token.
    

---

### 10.3 Feedforward Network and Residuals

* Similar to the encoder, each decoder block contains a **position-wise FFN** with ReLU:
    

```elixir
Output = LayerNorm(Input + FFN(Input))
```

* Residual connections and layer normalization stabilize training and maintain the token dimension `d_model`.
    

---

### 10.4 Overall Decoder Block Flow

1. **Masked Multi-Head Self-Attention** → Add & Norm
    
2. **Cross Multi-Head Attention** (with encoder outputs) → Add & Norm
    
3. **Feedforward Network** → Add & Norm
    

* Each decoder layer maintains the input/output shape: **n × d\_model**, allowing stacking of N layers.
    
* The decoder can now generate sequences **autoregressively**, using attention to both past outputs and the encoder’s representation.
    

---

### 10.5 Intuition

* Masked self-attention ensures **future tokens do not influence current predictions**
    
* Cross-attention allows the model to **condition on the input sequence**
    
* Feedforward networks provide **local per-token reasoning**, just like in the encoder
    
* Together, these components allow the decoder to generate fluent, contextually correct sequences **one token at a time**
    

## 11\. Training vs Inference

Transformers behave differently during **training** and **inference**, and understanding this distinction is key to grasping how they generate sequences efficiently.

---

### 11.1 Training

* During training, the **entire target sequence is available** at once.
    
* Masking ensures **causal behavior**: each token can only attend to previous tokens, preventing information leakage from the future.
    
* The main advantages of training in parallel:
    
    * **Fully parallelizable**: all tokens in the sequence are processed simultaneously, leveraging GPU acceleration
        
    * **Stable gradients**: longer sequences no longer suffer from vanishing information as in RNNs
        
    * **Faster convergence**: context is learned for all tokens in one forward pass
        
* Loss is computed for all tokens simultaneously, usually using **cross-entropy** between predicted and actual next-token distributions.
    

---

### 11.2 Inference

* During inference, sequences are generated **token by token** (autoregressively).
    
* For each new token:
    
    1. The decoder attends to **all previously generated tokens** using masked self-attention
        
    2. The decoder attends to **encoder outputs** via cross-attention
        
    3. The next token is predicted based on the output distribution
        
* This process repeats until an **end-of-sequence token** is produced.
    
* Key point: **generation is sequential**, but the underlying attention mechanism still allows each token to consider **all past context efficiently**.
    

---

### 11.3 Intuition

* **Training**: “See everything at once, learn relationships in parallel.”
    
* **Inference**: “Predict one token at a time, using previous context.”
    

This separation explains why Transformers can **train extremely fast** compared to RNNs while still generating sequences **autoregressively** when needed.

## 12\. Key Insights & Closing Thoughts

The Transformer architecture, introduced in *“Attention Is All You Need”*, represents a paradigm shift in sequence modeling. Here are the core takeaways:

---

### 12.1 Key Insights

* **No Recurrence, No Convolution**: Unlike RNNs or CNNs, Transformers rely entirely on attention to model relationships between tokens.
    
* **Global Context via Self-Attention**: Each token can attend to all others in the sequence, enabling long-range dependencies in a single step.
    
* **Parallel Training**: Training is fully parallelizable, solving the sequential bottleneck of RNNs.
    
* **Separation of Concerns**:
    
    * **Attention** handles global, cross-token context
        
    * **Feedforward networks** handle per-token transformations and feature reasoning
        
* **LayerNorm + Residuals** stabilize deep architectures, allowing many stacked layers without vanishing gradients.
    
* **Masked Decoding**: Ensures autoregressive generation during inference, while allowing the model to learn efficiently in parallel during training.
    

---

### 12.2 Closing Thoughts

* Transformers have reshaped NLP and AI by providing a **scalable, interpretable, and highly expressive architecture**.
    
* The same attention mechanisms extend beyond text: **Vision Transformers, audio modeling, and even diffusion models** use similar principles.
    
* Intuitive takeaway: **“Attention is the language of relationships.”** Each token communicates with others, forming a rich, context-aware understanding of the sequence.
    

---

All sections have been put together using this video.

%[https://www.youtube.com/watch?v=bCz4OMemCcA&t=2500s] 

If you still have any queries, you can reach out to me on my [**LinkedIn**](https://www.linkedin.com/in/sbk2k1/) / [**GitHub**](https://github.com/sbk2k1) / [**Twitter**](https://twitter.com/sbk_2k1).

Cheers!