This post is a walkthrough of how LLMs work. Modern LLMs are mostly built by stacking transformer blocks over and over, so understanding the transformer machinery gets you most of the way there. I’ll cover the core mechanisms inside modern transformer-based LLMs, without all that sticky math stuff. Don’t get me wrong, you should learn the math, but this can serve as an introduction. Most modern LLMs share the same transformer-family skeleton. The differences come from what each one was trained on, the scale and configuration choices, and the post-training done on top. By the end, you should be able to read many modern LLM papers or model cards and know which piece of the architecture each section is talking about.
Here’s the path:
- Tokens, how a string of text becomes a sequence of integers
- Embeddings, how those integers get meaning
- Positional encoding, how the model knows what order the tokens came in
- Attention, how tokens share information with each other
- Multi-head attention, how the model tracks many kinds of relationships at once
- The feed-forward network, where a large share of the model’s stored structure lives
- The residual stream and layer normalization, what makes deep stacks trainable
- Predicting the next token, what the model actually outputs and how the generation loop works
- Architecture vs trained weights, what’s broadly shared across modern LLMs, and what’s different..”