Rethinking SWA

Yifan Zhang

The Core Claim

In the rapid evolution of efficient sequence modeling, whether modern RNNs, state space models (SSMs), linear attention, or even standard full attention, optimization efforts have largely focused on the global mixing mechanism. Yet, an inconspicuous legacy artifact remains: ShortConv (short convolution).

The Argument: In an era where hardware-efficient algorithms (e.g., FlashAttention, SSD) mandate chunk-based computation, using convolutions with $k=2$ or $4$ is an architectural mismatch. Since we already split sequences into chunks (e.g., 64 or 128 tokens), we should replace ShortConv with Short Sliding Window Attention (ShortSWA).

This shift mirrors the transition in vision backbones where ViT and Swin styles replaced small, fixed-kernel convolutions. ShortSWA upgrades the local mixer from fixed, tiny receptive fields to data-dependent, chunk-aligned receptive fields, substantially improving local modeling capability without materially increasing cost.

Current Architecture: The Misalignment

Most modern efficient architectures (GLA, Mamba-2, RetNet variants) adopt a parallel structure. The ShortConv is applied strictly to the content/state branches ($Q, K, V$ or SSM inputs) to inject local bias, while the Gate branch typically bypasses this step to act as a multiplicative control.

Figure 1: The standard parallel pipeline. ShortConv is applied to Q, K, V (or SSM inputs), while the Gate (z) completely bypasses the local mixing bottleneck.

The Standard Pipeline

Parallel Linear Projection: Input $u$ splits into mixing branches ($Q, K, V$ or $x, B, C$) and a multiplicative Gate branch ($z$).
Local Mixing (The Bottleneck): A fixed Conv1d(k=4) is applied to the mixing branches to enforce local smoothness. Note that the Gate branch bypasses this step.
Global Mixing: The convolved branches enter the global operator (SSM or Linear Attn), computed in chunks.
Gating: The output of the mixer is multiplied by the Gate branch.

The Misalignment: We load a 128-token chunk into SRAM for the Global Mixer. However, the branches feeding into it ($Q, K, V$) are restricted by the ShortConv (k=4). Even though the memory is already hot, we fail to perform attention across the full chunk before the global mixing step.

Why ShortSWA is Inevitable

The case for replacing ShortConv with ShortSWA rests on three pillars: hardware efficiency, modeling capacity, and empirical intuition.

1. Hardware Alignment

ShortConv ($k=4$) is memory-bound; it loads data but performs very little arithmetic.

ShortSWA ($w=128$) runs dense local attention inside the chunk. Since data is already in SRAM/registers for the global mixer, the incremental cost of intra-chunk attention is small relative to expressive gains.

2. Filling the Gap

ShortConv only captures tokens $t-1$ to $t-4$. Global Mixers (SSMs) excel at recurrence but may attenuate at medium distances.

ShortSWA fills the mid-range gap. A 128-token window captures sentence-level or clause-level context precisely, reducing the burden on the global mixer.

3. Empirical Intuition

ShortConv is fixed and stencil-like. It is good for smoothing, but bad for selective routing.

ShortSWA is content-adaptive. It supports copying, alignment, and delimiter-aware routing. As seen in Vision (ConvNeXt vs Swin), dynamic mixers win when scaling is efficient.

Implementation & Generality

The replacement applies to both SSMs (creating a Local-Global hybrid) and standard Transformers (replacing early positional mixing).

Figure 2: Visualizing the "Misalignment". ShortConv ignores the vast majority of the data already loaded into memory. ShortSWA utilizes the full chunk.

Conceptual Code (PyTorch-like)

class ModernBlockWithSWA(nn.Module):
    def forward(self, u):
        # 1. Parallel projection (standard in Mamba-2, GLA, Transformers)
        # Generate all branches at once: Gates, Values, Keys, etc.
        z, x, B, C, ... = self.in_proj(u).split(...)

        # 2. Local mixing: UPGRADED
        # Old: x = conv1d(x, k=4)
        # New: chunk-aligned ShortSWA
        # Window size = 128 (matching the chunk length of global mixer)
        x = self.short_swa(x, window_size=128)

        # Note: ShortSWA can be implemented efficiently using FlashAttention's
        # sliding-window or block-masking capabilities.

        # 3. Global mixing (the core heavy lifting)
        # Could be SSM (SSD), linear attention, or full attention
        y = self.global_mixer(x, B, C, ...)

        # 4. Gating and output
        y = y * F.silu(z)
        return self.out_proj(y)

Conclusion

ShortConv was a necessary bridge, but it has become a bottleneck. Whether you are building the next-generation Mamba, designing a linear Transformer, or optimizing a full-attention model: if your hardware loads a 128-token chunk, your local mixer should exploit all 128 tokens.

"ShortConv is to ShortSWA what modern ConvNeXt is to ViT. A strong baseline that becomes structurally disadvantaged once dynamic, windowed attention is implemented efficiently at scale."

Citation

@article{zhang2025rethinking,
  title = {Rethinking SWA},
  author = {Zhang, Yifan},
  journal = {yifanzhang-pro.github.io},
  year = {2025},
  month = {December},
  url = "https://github.com/yifanzhang-pro/Rethinking-SWA"
}