Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Token Analysis (AI/LLM)

Benchmark Results

Tested with GPT-4o tokenizer (tiktoken), comparing the same Fibonacci program:

LanguageTokens
Python54
JavaScript69
Han88

Why Korean Uses More Tokens

LLM tokenizers use BPE (Byte Pair Encoding):

  1. Start with raw bytes
  2. Find the most frequent byte pairs in training data
  3. Merge them into single tokens
  4. Repeat

Since training data is predominantly English:

  • function → appears billions of times → merged into 1 token
  • 함수 → rarely appears → stays as 2-3 byte-level tokens

Per-Keyword Comparison

HanTokensEnglishTokens
함수2function1
반환2return1
변수2let1
아니면3else1
멈춰3break1
동안1while1
1true1

This Is a Tokenizer Problem, Not a Korean Problem

If BPE were trained on a Korean-heavy corpus, 함수 could be a single token. The inefficiency comes from training data distribution, not from the script itself.

Relevant work:

  • Ukrainian LLM "Lapa" replaced 80K tokens and achieved 1.5x efficiency for Ukrainian text
  • Custom BPE training on Korean programming text could close the gap