Published by Google in 2017, this paper introduced the transformer architecture with self-attention mechanisms that replaced recurrent neural networks. Every modern LLM — GPT, Claude, Gemini, Llama — is built on this foundation. Reading it (even at a high level) gives you fundamental insight into why prompts work the way they do: attention patterns, positional encoding, and how models process context.