Matrix Exponential Attention

Yifan Zhang

Overview

Standard Transformers scale quadratically with sequence length due to the Softmax attention mechanism. Matrix Exponential Attention (MEA) offers a solution by approximating the matrix exponential of attention scores via a truncated Taylor series.

\mathrm{MExp}(\mathbf{Q} \mathbf{K}^{\top}) \mathbf{V} \approx \sum_{k=0}^{H} \frac{1}{k!} \mathrm{HLA}_k(\mathbf{Q}, \mathbf{K}, \mathbf{V})

By leveraging the state-space realization of Higher-order Linear Attention (HLA), MEA computes high-order interaction terms (powers of the attention matrix) in linear time without ever materializing the massive $n \times n$ attention matrices. This ensures the model maintains the efficiency of an RNN during inference while capturing complex token interactions.

Theoretical foundation: Higher-order Linear Attention (HLA).

Mathematical Formulation

Softmax vs. Matrix Exponential

Standard Scaled Dot-Product Attention utilizes the softmax nonlinearity, which creates a mixing bottleneck that prevents easy linearization:

\mathrm{Attn}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d}}\right)\mathbf{V}

MEA replaces the softmax with the Matrix Exponential ($\mathrm{MExp}$). For an unnormalized attention matrix $\mathbf{A} = \mathbf{Q}\mathbf{K}^\top$:

\mathrm{MExp}(\mathbf{A})\mathbf{V} = e^{\mathbf{A}}\mathbf{V} = \left( \sum_{k=0}^{\infty} \frac{1}{k!} \mathbf{A}^k \right) \mathbf{V}

While the infinite series is exact, we approximate this by truncating the series at order $H$ (typically $H=2$). This truncation is the key to computational tractability.

Recursive Decomposition via HLA

Explicitly computing the matrix power $\mathbf{A}^k$ would require $\mathcal{O}(n^3)$ complexity. Even an optimized iterative product scales as $\mathcal{O}(n^2)$. MEA exploits the associativity of matrix multiplication to factorize these terms into streaming updates with $\mathcal{O}(n)$ complexity.

Figure 1: Decomposition of Matrix Exponential Attention into cumulative interaction orders.

Streaming Updates

Key Insight: Order 2 represents the path $\mathbf{Q} (\mathbf{K}^\top \mathbf{Q}) (\mathbf{K}^\top \mathbf{V})$. This allows us to maintain compact streaming states.

The output for the second-order term at time $t$ is computed via:

\mathbf{o}_t^{(2)} = \mathbf{q}_t^\top \mathbf{E}_t

Where the sufficient statistics (states) update recursively:

\begin{aligned} \mathbf{P}_t^{KV} &= \sum_{j \le t} \mathbf{k}_j \mathbf{v}_j^\top &\in \mathbb{R}^{d \times d_v} \\ \mathbf{E}_t &= \sum_{i \le t} \mathbf{k}_i (\mathbf{q}_i^\top \mathbf{P}_i^{KV}) &\in \mathbb{R}^{d \times d_v} \end{aligned}

Exact Causal Masking

To enforce strict autoregressive causality (masking the upper triangular of $\mathbf{Q}\mathbf{K}^\top$), MEA relies on the Extended Summaries theorem. For the symmetric interpretation of second-order interactions, we maintain cross-moment summaries $\mathbf{G}_t$ to subtract acausal contributions dynamically:

\mathbf{o}_t^{\text{sym}} = \mathbf{q}_t^\top \left( \mathbf{S}_t^K \mathbf{C}_t^{QV} - \mathbf{G}_t \right)

Where the correction term $\mathbf{G}_t$ accumulates the interaction history:

\mathbf{G}_t = \sum_{i \le t} (\mathbf{k}_i \mathbf{k}_i^\top) \mathbf{C}_{i-1}^{QV}

This ensures the model is mathematically equivalent to a masked Transformer while maintaining the efficiency of a Recurrent Neural Network.

Citation

@article{zhang2025matrix,
   title   = {Matrix Exponential Attention},
   author  = {Zhang, Yifan},
   journal = {yifanzhang-pro.github.io},
   year = {2025},
   month = {December},
   url = "https://github.com/yifanzhang-pro/MEA"
}

@article{zhang2025hla,
   title   = {Higher-order Linear Attention},
   author  = {Zhang, Yifan and Qin, Zhen and Gu, Quanquan},
   journal = {arXiv preprint 2510.27258},
   year    = {2025}
}