MEA

Matrix Exponential Attention

Yifan Zhang
December 15, 2025
Linear Attention State Space Models Taylor Expansion Algorithm Design

Overview

Standard Transformers scale quadratically with sequence length due to the Softmax attention mechanism. Matrix Exponential Attention (MEA) offers a solution by approximating the matrix exponential of attention scores via a truncated Taylor series.

$$ \mathrm{MExp}(\mathbf{Q} \mathbf{K}^{\top}) \mathbf{V} \approx \sum_{k=0}^{H} \frac{1}{k!} \mathrm{HLA}_k(\mathbf{Q}, \mathbf{K}, \mathbf{V}) $$

By leveraging the state-space realization of Higher-order Linear Attention (HLA), MEA computes high-order interaction terms (powers of the attention matrix) in linear time without ever materializing the massive $n \times n$ attention matrices. This ensures the model maintains the efficiency of an RNN during inference while capturing complex token interactions.

Theoretical foundation: Higher-order Linear Attention (HLA).

Mathematical Formulation

Softmax vs. Matrix Exponential

Standard Scaled Dot-Product Attention utilizes the softmax nonlinearity, which creates a mixing bottleneck that prevents easy linearization:

$$ \mathrm{Attn}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d}}\right)\mathbf{V} $$

MEA replaces the softmax with the Matrix Exponential ($\mathrm{MExp}$). For an unnormalized attention matrix $\mathbf{A} = \mathbf{Q}\mathbf{K}^\top$:

$$ \mathrm{MExp}(\mathbf{A})\mathbf{V} = e^{\mathbf{A}}\mathbf{V} = \left( \sum_{k=0}^{\infty} \frac{1}{k!} \mathbf{A}^k \right) \mathbf{V} $$

While the infinite series is exact, we approximate this by truncating the series at order $H$ (typically $H=2$). This truncation is the key to computational tractability.

Recursive Decomposition via HLA

Explicitly computing the matrix power $\mathbf{A}^k$ would require $\mathcal{O}(n^3)$ complexity. Even an optimized iterative product scales as $\mathcal{O}(n^2)$. MEA exploits the associativity of matrix multiplication to factorize these terms into streaming updates with $\mathcal{O}(n)$ complexity.

Taylor Expansion Approximation (H=2) Order 0 Identity V + Order 1 Linear Attn (QKᵀ)V + Order 2 Asymmetric HLA ½ (QKᵀ)² V Path: Q → Kᵀ → Q → Kᵀ → V Streaming O(n) Figure 1: Decomposition of Matrix Exponential Attention into cumulative interaction orders.

Streaming Updates

Key Insight: Order 2 represents the path $\mathbf{Q} (\mathbf{K}^\top \mathbf{Q}) (\mathbf{K}^\top \mathbf{V})$. This allows us to maintain compact streaming states.

The output for the second-order term at time $t$ is computed via:

$$ \mathbf{o}_t^{(2)} = \mathbf{q}_t^\top \mathbf{E}_t $$

Where the sufficient statistics (states) update recursively:

$$ \begin{aligned} \mathbf{P}_t^{KV} &= \sum_{j \le t} \mathbf{k}_j \mathbf{v}_j^\top &\in \mathbb{R}^{d \times d_v} \\ \mathbf{E}_t &= \sum_{i \le t} \mathbf{k}_i (\mathbf{q}_i^\top \mathbf{P}_i^{KV}) &\in \mathbb{R}^{d \times d_v} \end{aligned} $$

Exact Causal Masking

To enforce strict autoregressive causality (masking the upper triangular of $\mathbf{Q}\mathbf{K}^\top$), MEA relies on the Extended Summaries theorem. For the symmetric interpretation of second-order interactions, we maintain cross-moment summaries $\mathbf{G}_t$ to subtract acausal contributions dynamically:

$$ \mathbf{o}_t^{\text{sym}} = \mathbf{q}_t^\top \left( \mathbf{S}_t^K \mathbf{C}_t^{QV} - \mathbf{G}_t \right) $$

Where the correction term $\mathbf{G}_t$ accumulates the interaction history:

$$ \mathbf{G}_t = \sum_{i \le t} (\mathbf{k}_i \mathbf{k}_i^\top) \mathbf{C}_{i-1}^{QV} $$

This ensures the model is mathematically equivalent to a masked Transformer while maintaining the efficiency of a Recurrent Neural Network.

Citation

@article{zhang2025matrix,
   title   = {Matrix Exponential Attention},
   author  = {Zhang, Yifan},
   journal = {yifanzhang-pro.github.io},
   year = {2025},
   month = {December},
   url = "https://github.com/yifanzhang-pro/MEA"
}

@article{zhang2025hla,
   title   = {Higher-order Linear Attention},
   author  = {Zhang, Yifan and Qin, Zhen and Gu, Quanquan},
   journal = {arXiv preprint 2510.27258},
   year    = {2025}
}