April 23, 2026

Why Transformers Can't Tell Position Apart — and How RoPE Fixes It

Self-attention is blind to order. Shuffle the words in a sentence and you get identical attention scores. Positional embeddings solve this — but the way they do it determines whether your model can handle long contexts at inference time.

Johannes Hayer

johanneshayer

PreviousWhat Actually Happens Inside a Transformer Block

NextWhat Is KV Cache and Why Does It Make LLM Inference Fast?