Why AI Can't Count the R's in Strawberry

The viral meme that exposed how LLMs actually process text—and what it reveals about every AI failure mode.

Illustration for Why AI Can't Count the R's in Strawberry
strawberry-tokenization-problem LLMs confidently say strawberry has two R's because they've never seen letters—only tokens. This tokenization blindness predicts where AI will fail. strawberry, tokenization, LLM, ChatGPT, BPE, byte pair encoding, AI limitations, GPT failures

Ask ChatGPT how many R's are in "strawberry." It says two. There are three. The model isn't wrong because it's dumb. It's wrong because it has never seen the letter R. It's seen tokens that happen to contain R. That distinction explains half the failures you'll hit in production.

TL;DR

LLMs can't count letters because they've never seen them. Tokenization compresses text into chunks, hiding individual characters. The strawberry problem predicts every failure mode requiring sub-token access.

January 2026: Even with GPT-5 class models, the underlying tokenizer architecture remains unchanged. The physics haven't changed—only the masks.

The strawberry question went viral in August 2024 because it exposed something uncomfortable. These models that write poetry, debug code, and pass bar exams can't count letters in a ten-letter word. Not because they're dumb. Because they've never seen letters at all.

I've watched three production deployments fail on string manipulation tasks that passed demo. The strawberry problem isn't theoretical. It's the same architecture bug that cost one team six weeks of debugging before they understood what they were actually asking the model to do.

The expectation that LLMs should count letters is reasonable. These models pass the bar exam, write functional code, and explain quantum physics. If they can do that, surely they can count to three? The logic is sound. The assumption is wrong. Letter-counting isn't a simpler version of bar-exam reasoning. It's a fundamentally different operation that the architecture cannot perform, no matter how capable it becomes at everything else.

What LLMs Actually See

When you type "strawberry," you see ten letters. The model sees something else entirely: tokens. Here's what GPT-4's tokenizer actually does:

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")

# What you type
word = "strawberry"

# What the model sees
tokens = enc.encode(word)
print(f"Tokens: {tokens}")  # [496, 675, 15717]

# Decode each token to see the splits
for t in tokens:
    print(f"  {t} -> '{enc.decode([t])}'")
# Output:
#   496 -> 'str'
#   675 -> 'aw'
#   15717 -> 'berry'

What You See vs. What the Model Sees

Human View (Characters)s t r a w b e r r y10 letters, 3 R's visible
Model View (Tokens)[496] [675] [15717]3 tokens, 0 R's visible
Token Decodes TostrawberryR's hidden inside opaque IDs

Three tokens. None of them contain an isolated "r." The model sees [496, 675, 15717]. It knows these tokens often appear together and represent a fruit. But it has no mechanism to decompose them back into individual characters. The R's are hidden inside opaque integer IDs.

The model isn't refusing to count. It can't count. Asking it to count R's is like asking someone to count the number of 2's in 1,247 if they can only see the whole number as a single symbol. The data isn't there.

Tokens are the atoms of LLM perception. The model was trained to predict the next token, not the next letter. It has no concept of individual characters living inside those tokens. This isn't a flaw. It's a tradeoff. And like all tradeoffs, someone pays.

Character-level tokenization would create impossibly long sequences. A 1,000-word document would become 5,000+ tokens, overwhelming the context window and training compute. Byte Pair Encoding (BPE) compresses text into manageable chunks. The cost? The model loses access to the characters themselves. Tokenization is a payday loan: you get speed and efficiency now, you pay in debugging costs when sub-token access matters.

Why BPE Creates This Blindness

Byte Pair Encoding, the algorithm behind most modern tokenizers, works by iteratively merging the most common character pairs. Start with raw characters. Find the most frequent pair. Merge them into a new token. Repeat until you hit your vocabulary size.

The result is a vocabulary of around 50,000 tokens that efficiently compress common words and subwords. "The" is one token. "Quantum" might be two. "Strawberry" becomes three tokens because that split appeared often in training data.

Once merged, the original characters are gone. The model sees integer IDs. It knows these IDs often appear together. It might know they represent a fruit. But it has no mechanism to decompose them back into S-T-R-A-W-B-E-R-R-Y and count the R's.

It's pattern-matching on what "counting R's in strawberry" answers usually look like in training data. And apparently, a lot of that training data said two. Every confident wrong answer is the model agreeing with itself about what sounds right.

The Strawberry Test Reveals More Than Spelling

What does this blindness predict about other failures?

  • Arithmetic on long numbers. 847,293 × 156 gets tokenized arbitrarily. The model can't access individual digits reliably. It's pattern-matching on what multiplication answers look like.
  • Anagrams and word games. Unscrambling letters requires character-level access the model doesn't have.
  • Precise string manipulation. "Reverse this string" fails on anything non-trivial because reversal requires character access.
  • Non-English text. Languages with different character densities get tokenized inefficiently. A Chinese sentence might become 3x more tokens than equivalent English, burning context window and degrading performance.

The strawberry problem isn't isolated. It's a symptom. Every task requiring sub-token access will exhibit similar failure modes. This is why AI coding assistants struggle with certain refactoring tasks that seem trivial to humans.

Why Chain-of-Thought Helps (Sometimes)

If you ask the model to spell out "strawberry" letter by letter first, then count, it often succeeds. Why?

By writing out S-T-R-A-W-B-E-R-R-Y, the model creates new tokens in its context that represent individual letters it can now "see." The counting task becomes possible because the letters exist as separate tokens in the prompt.

This is chain-of-thought reasoning working around architectural limitations. The model doesn't suddenly gain character access. It generates character-level output that it can then process. It's teaching itself to see what it couldn't see before.

But this workaround is fragile. It requires the user to know the trick. It consumes extra tokens. And it fails when the model makes spelling errors during the decomposition step, which happens more often with unusual words.

The Irony of "Project Strawberry"

OpenAI internally codenamed their advanced reasoning model "Strawberry." Reportedly as an inside joke about finally solving this problem. When o1-preview released in September 2024, it did answer the R-counting question correctly. The marketing practically wrote itself.

But the fix wasn't better tokenization. It was better training on chain-of-thought reasoning, plus (likely) specific tuning on this exact failure mode that had become embarrassingly public. The underlying architecture still can't see letters natively. It's just been trained to work around its own blindness more reliably.

The strawberry problem wasn't solved. It was patched. OpenAI named their reasoning model after a bug they couldn't fix. That's not confidence. That's marketing turning a fundamental limitation into a punchline. The next viral failure mode will require another patch.

The Tokenization Test

Before trusting any LLM for text processing, try this:

  1. Generate a random 20-character alphanumeric string
  2. Ask the model to reverse it
  3. Verify character-by-character

If it fails, and most will, you've found the boundary. Everything on the other side of that boundary is pattern matching dressed as precision. This is physics. The model literally cannot see the data you're asking about. You can't negotiate with architecture.

The Business Cost of Token Blindness

This isn't academic. Here's where tokenization bites in production:

  • Account number confusion. If your invoicing AI processes "Account 10" and "Account 100," the tokenizer might represent both similarly. One digit difference, potentially $90,000 in misdirected funds.
  • Serial number validation. "SN-A1B2C3" and "SN-A1B2C4" look nearly identical to a tokenizer. Your inventory system just shipped the wrong part.
  • Medical dosage parsing. "10mg" vs "100mg" is a tokenization boundary. In healthcare, that's a liability lawsuit.
  • Code variable names. userCount and userCounts might tokenize identically. Your AI just introduced a bug it can't see. And since AI tools have no institutional memory, it'll introduce the same bug next sprint.

Every one of these failures passes the demo. The model sounds confident. The output looks plausible. The bug only surfaces in production, when real money or real consequences are on the line.

What This Means for Production Systems

If you're building on LLMs, the strawberry problem should inform your architecture. This is the same architectural blind spot that makes LLMs dangerous in unexpected ways—they pattern-match on training data, not on ground truth. As Simon Willison notes, many LLM limitations trace back to tokenization decisions:

  • Don't trust character-level operations. Spelling checks, exact string matching, character counting: validate these externally.
  • Don't trust arithmetic on large numbers. Use code execution or calculators. The model is guessing, not computing.
  • Test edge cases that require sub-token reasoning. If your use case involves any form of precise text manipulation, the model will fail in ways you won't predict.
  • Remember: fluent doesn't mean correct. The model that can't count R's will explain its wrong answer with perfect confidence. Vendor accuracy claims don't account for these failure modes.

The same pattern-matching that produces remarkably useful outputs also produces remarkably confident wrong answers. The strawberry test just happens to be obvious enough that humans notice immediately.

When This Won't Matter

For most LLM use cases, tokenization blindness is irrelevant:

  • Summarization doesn't require character-level access.
  • Translation works at the semantic level.
  • Code generation operates on syntax patterns, not character manipulation.
  • Creative writing benefits from token-level fluency.

The strawberry problem matters when you cross from semantic tasks to syntactic precision. Most users never cross that line. The ones who do often don't realize they've crossed it until something breaks.

The Bottom Line

LLMs have never seen the letter R. They've seen tokens that happen to contain R, but the R itself is invisible. Asking them to count letters is asking them to reason about data they can't access.

The strawberry problem isn't about spelling. It's about the gap between what these systems appear to understand and what they actually process. Pattern matching at scale produces outputs that look like reasoning. But the moment you need actual reasoning, real access to the underlying data, the illusion breaks.

Every confident explanation of a wrong answer is this same gap. The model matches patterns for "how an explanation should sound." Soundness isn't the optimization target. Plausibility is. Sometimes those align. With strawberry, they didn't. Understanding this predicts where the demo-to-production gap will bite you hardest.

"LLMs have never seen the letter R. They've seen tokens that happen to contain R, but the R itself is invisible."

Sources

AI Architecture Review

Before your LLM goes to production, let me find the strawberry problems hiding in your pipeline.

Book Review

Beat the Benchmark Gap?

If your production accuracy actually matches what the vendor promised, tell me your secret.

Send a Reply →