AI

Synapse

AI in a Shell

Navigate
  • Dashboard
  • Chat
Settings
  • Get Help
  • Log In
    1/11
    Card 11 / 11

    The Fundamental Problem: Computers Need Numbers

    Nina notices something odd. She asks her model to summarize two articles — one about a "river bank" and one about a "bank loan." The summaries are weirdly similar. The model seems to treat "bank" the same way in both.

    Why? Because computers don't understand words. They understand numbers. The word "bank" means something completely different next to "river" than it does next to "loan" — but to a machine, it's just a string of characters until someone figures out how to turn meaning into numbers.

    For decades, this was the central challenge of language AI. Every technique — from the earliest word counters to modern Transformers — is a progressively better answer to this one question.

    This isn't historical trivia. It's the story of why LLMs behave the way they do when you prompt them.

    Card 22 / 11

    Bag-of-Words: The Naive First Attempt

    The earliest approach was brutally simple: count the words.

    In the Bag-of-Words model, a document becomes a vector of word counts. "Bank" and "loan" together score high for finance. "Bank" and "river" score high for geography. The technique dominated early NLP — spam filters, search engines, and document classifiers all relied on it through the 2000s.

    But it has a fatal flaw: it destroys all context. Word order, grammar, and sentence structure vanish completely. "The dog bit the man" and "The man bit the dog" produce identical vectors — same words, same counts. The model has no idea that the meaning changed entirely.

    For classification tasks where topic matters more than nuance, this was good enough. But for anything requiring real understanding — summarization, translation, conversation — counting words is a dead end. Something needed to capture not just which words appear, but what they mean.

    Discuss3 / 11
    Synapse
    "

    "
    Checkpoint
    4 / 11
    Checkpoint

    A spam filter uses Bag-of-Words to classify emails. Which email would it struggle to classify correctly?

    Card 55 / 11

    From Counting to Meaning: Embeddings & Attention

    In 2013, Word2Vec had a breakthrough: position words in a vector space based on how they're used. Words in similar contexts land close together. The famous result: king − man + woman ≈ queen. Meaning encoded as math.

    But Word2Vec gave every word one fixed vector. "Bank" got the same vector whether next to "river" or "loan." The model couldn't tell which meaning was active.

    The fix: attention — let words look at each other. When processing "bank" near "money," attention shifts the meaning toward finance. Near "river," toward geography. Same word, different vector depending on context.

    This was the key step between counting words and understanding language. You'll explore embeddings and attention in depth in the Tokens & Embeddings and Transformer Internals courses — for now, the important thing is that this solved the context problem and opened the door to the Transformer.

    Card 66 / 11

    From Sequential to Parallel: The Transformer

    Attention solved the context problem — but the models using it had a fatal bottleneck.

    Early attention was paired with Recurrent Neural Networks (RNNs) — models that process text one token at a time, left to right. To process token 50, you must first process tokens 1 through 49, in order. No amount of GPUs could speed up that sequential chain.

    In 2017, a Google team published Attention Is All You Need. The claim: you don't need recurrence at all. Attention alone is sufficient.

    The Transformer processes all tokens simultaneously. Every token attends to every other token in the same forward pass. On GPU hardware, this parallelism translates directly into speed — making it feasible to train on the entire internet.

    This single architectural change unlocked the scaling era. The Transformer is the foundation of every major LLM today: GPT, Claude, Gemini, LLaMA.

    Checkpoint
    7 / 11
    Checkpoint

    Why did the Transformer architecture unlock a new era of language model scale, while RNNs + attention plateaued?

    Card 88 / 11

    Two Shapes: BERT vs GPT

    Not all Transformers are the same. The architecture has two main variants.

    Encoder-only models (BERT-style): Process text in both directions. Every token attends to tokens before AND after it — excellent at understanding text (classification, sentiment, search). But they can't generate new text.

    Decoder-only models (GPT-style): Process text left to right only. Each token only sees previous tokens — called causal attention. This restriction enables text generation: the model predicts each next token without "cheating" by looking ahead.

    Modern LLMs — GPT-4, Claude, LLaMA — are all decoder-only. When you send a prompt and get a streaming response, that's a decoder doing causal attention, one token at a time.

    Encoder models still matter for search and classification. But the LLMs you interact with daily are all decoders. When you call the Claude or GPT API, that's a decoder generating text token by token. When a search engine understands your query semantically, an encoder is doing the matching. Many production systems combine both.

    Checkpoint
    9 / 11
    Checkpoint

    Nina is building two features: a semantic search that finds relevant docs, and a chatbot that generates answers. Which Transformer variant should she use for each?

    Card 1010 / 11

    The Arc Completes: Why LLMs Are What They Are

    Now the story connects.

    Decoder-only Transformers scaled remarkably well. More data, more parameters, more compute — and emergent capabilities appeared: arithmetic, code generation, reasoning. This is the scaling law: larger models become qualitatively more capable in predictable ways.

    The race to scale these models produced what we now call LLMs.

    When an LLM hallucinates a policy that doesn't exist, that's lossy compression — 10TB squeezed into 140GB. When it streams responses word by word, that's causal attention predicting one token at a time. When it seems to understand your question, that's contextual embeddings attending to relevant parts of your prompt.

    Every behavior traces back to this arc: Bag-of-Words → Word2Vec → Attention → Transformer → Scale.

    Nina now had the history. But what IS an LLM, physically — and what are these "tokens" the whole system is built on?

    Discuss11 / 11
    Synapse
    "

    Think about an AI tool you use daily — can you identify a moment where it genuinely understands context (attention at work) versus when it just pattern-matches on surface keywords (echoes of Bag-of-Words)?

    "