Most modern efficient architectures (GLA, Mamba-2, RetNet variants) adopt a parallel structure. The ShortConv is applied strictly to the content/state branches ($Q, K, V$ or SSM inputs) to inject local bias, while the Gate branch typically bypasses this step to act as a multiplicative control.
The Standard Pipeline
- Parallel Linear Projection: Input $u$ splits into mixing branches ($Q, K, V$ or $x, B, C$) and a multiplicative Gate branch ($z$).
- Local Mixing (The Bottleneck): A fixed
Conv1d(k=4)is applied to the mixing branches to enforce local smoothness. Note that the Gate branch bypasses this step. - Global Mixing: The convolved branches enter the global operator (SSM or Linear Attn), computed in chunks.
- Gating: The output of the mixer is multiplied by the Gate branch.