The Over-Tokenized Transformers result [2] points to a simple claim: input representation density matters. Huang et al. report a near log-linear link between input vocab size and training loss. By scaling the input side to multi-gram tokens (e.g., treating “New York City” as one token), a 400M parameter model can reach the perplexity of a 1B parameter baseline.
This result backs a fundamental framework: Local token composition carries a high-value signal. Language comes in clumps; neighbor tokens often form a single semantic unit with lower entropy than its parts. Static n-gram embeddings exploit this by storing vectors for frequent clumps. However, this approach hits practical limits.
Limit 1: Sparsity & Memory
An input table with 12 million entries can consume gigabytes of VRAM. Because language follows a Zipfian distribution, the vast majority of these entries sit idle in memory, rarely accessed but constantly costing resources.
Limit 2: Context Rigidity
A fixed token for “apple” cannot distinguish “Apple” the company from “apple” the fruit without discrete entries for every permutation. The table needs exponential entries to cover nuance, yet still fails on novel compositions.