Skip to content

Tokenization

glossary beginner 3 min
Sources verified Dec 27, 2025

How AI models break text into smaller pieces (tokens) for processing.

Simple Definition

Tokenization is how AI models break text into smaller pieces called "tokens" before processing. A token is typically 3-4 characters—roughly ¾ of a word. Understanding tokens helps you estimate costs and manage context limits.

Technical Definition

AI models don't read text character-by-character or word-by-word. They use subword tokenization:

Text Token Count Notes
"Hello" 1 token Common word
"tokenization" 3 tokens "token" + "iz" + "ation"
"GPT" 1 token Common abbreviation
"supercalifragilistic" 7 tokens Rare word, many subwords

Rules of thumb:

  • 1 token ≈ 4 characters ≈ ¾ word
  • 100 tokens ≈ 75 words
  • 1,000 tokens ≈ 750 words ≈ 1-2 pages

Why Tokens Matter

Cost: AI pricing is per-token. More tokens = higher cost.

Context limits: Models have token limits (e.g., 128K tokens). Your prompt + response must fit within this limit.

Code is token-expensive: Whitespace, brackets, and boilerplate add up. A 100-line file might be 500-1000 tokens.

Key Takeaways

  • Tokens are the units AI models use to process text
  • 1 token ≈ 4 characters ≈ ¾ word
  • Tokens affect cost and context limits
  • Code is more token-expensive than prose

Sources

Tempered AI Forged Through Practice, Not Hype

Keyboard Shortcuts

j
Next page
k
Previous page
h
Section home
/
Search
?
Show shortcuts
m
Toggle sidebar
Esc
Close modal
Shift+R
Reset all progress
? Keyboard shortcuts