Why Transformers Can't Tell Position Apart — and How RoPE Fixes It
Self-attention is blind to order. Shuffle the words in a sentence and you get identical attention scores. Positional embeddings solve this — but the way they do it determines whether your model can handle long contexts at inference time.
Johannes Hayer
johanneshayer