Softmax vs. Matrix Exponential
Standard Scaled Dot-Product Attention utilizes the softmax nonlinearity, which creates a mixing bottleneck that prevents easy linearization:
MEA replaces the softmax with the Matrix Exponential ($\mathrm{MExp}$). For an unnormalized attention matrix $\mathbf{A} = \mathbf{Q}\mathbf{K}^\top$:
While the infinite series is exact, we approximate this by truncating the series at order $H$ (typically $H=2$). This truncation is the key to computational tractability.