ShortSWA-Ngram-Embedding

Yifan Zhang

The Convergence

Work on fast sequence models has historically split into two distinct tracks. On one hand, we have the hardware-centric push towards optimizing local mixing primitives, as argued in our previous analysis, Rethinking SWA [1]. On the other, recent empirical studies, such as the Over-Encoding framework by Huang et al. (ICML 2025) [2], have demonstrated that massive n-gram vocabularies yield significant performance gains.

The Thesis: We argue that these two threads are converging. Short Sliding Window Attention (ShortSWA) is effectively a dynamic, parameter-efficient realization of the "Over-Encoding" framework. By shifting from static vocabulary lookups to dynamic, window-bounded attention, ShortSWA captures the benefits of n-gram scaling laws without the prohibitive memory footprint of expanding embedding tables.

The Signal in N-grams

The Over-Tokenized Transformers result [2] points to a simple claim: input representation density matters. Huang et al. report a near log-linear link between input vocab size and training loss. By scaling the input side to multi-gram tokens (e.g., treating “New York City” as one token), a 400M parameter model can reach the perplexity of a 1B parameter baseline.

This result backs a fundamental framework: Local token composition carries a high-value signal. Language comes in clumps; neighbor tokens often form a single semantic unit with lower entropy than its parts. Static n-gram embeddings exploit this by storing vectors for frequent clumps. However, this approach hits practical limits.

Limit 1: Sparsity & Memory

An input table with 12 million entries can consume gigabytes of VRAM. Because language follows a Zipfian distribution, the vast majority of these entries sit idle in memory, rarely accessed but constantly costing resources.

Limit 2: Context Rigidity

A fixed token for “apple” cannot distinguish “Apple” the company from “apple” the fruit without discrete entries for every permutation. The table needs exponential entries to cover nuance, yet still fails on novel compositions.

ShortSWA as an Adaptive N-gram Builder

ShortSWA [1] offers a different path. While originally derived from hardware chunking (replacing fixed short convolutions with attention over a short window, e.g., $w=128$), under the Over-Encoding lens, this move carries a semantic role.

Figure 1: Static Lookup vs. Dynamic Construction. ShortSWA builds the n-gram representation on the fly.

A ShortSWA layer that attends within 128 tokens builds n-gram-like features dynamically. It does not look up a stored vector for “the quick brown fox”. It lets “fox” pull signal from “the”, “quick”, and “brown” using attention weights. The weights change with context, not with token frequency counts. We can write a rough equivalence:

$$ h_t' = \text{Attention}(h_t, h_{t-w:t}) \approx \text{Embedding}_{\text{ngram}}(x_{t-w:t}) $$

The left side builds a soft n-gram representation. It can represent many multi-grams up to length $w$, and it can shift with the sentence. The right side stands for a fixed table entry tied to one discrete n-gram.

The Trade-off

This architectural shift has two concrete effects on model design and efficiency.

Parameter Cost

Over-Encoding: Adds embeddings that grow with $|\mathcal{V}|$. Parameter count rises significantly with vocabulary size, often consuming VRAM that could be used for depth.

ShortSWA: Mainly adds the projection matrices $W_Q, W_K, W_V$. The cost is fixed regardless of effective "vocabulary" complexity.

Hardware Fit

Over-Encoding: Large lookups result in sparse memory access patterns.

ShortSWA: As noted in [1], attention over a chunk (e.g., 128 tokens) matches common memory movement. The data is already in fast memory (SRAM), so local attention becomes a dense “pre-encode” of the chunk before the global block.

Conclusion

The claim “Vocabulary is worth scaling” [2] matches a plain idea: Local context matters, and dense local signals help later global mixing. Scaling a static vocabulary is a blunt tool.

ShortSWA gives a cleaner mechanism. It forms soft n-grams up to window length $w$, and it adapts them to the actual context, so it captures the same signal without a huge table. While the window size $w$ remains a hyperparameter to tune, the structural advantage of dynamic composition over static lookup is clear.

References & Citation

Rethinking SWA, Yifan Zhang, December 16, 2025.
Over-Encoding, Hongzhi Huang et al., ICML 2025.

@article{zhang2026shortswa,
  title = {ShortSWA Is the Next-Generation N-gram Embedding},
  author = {Zhang, Yifan},
  journal = {yifanzhang-pro.github.io},
  year = {2026},
  month = {January},
  url = "https://github.com/yifanzhang-pro/ShortSWA-Ngram-Embedding"
}