Deep Delta Learning

Yifan Zhang; Yifeng Liu; Mengdi Wang; Quanquan Gu

Abstract

The efficacy of deep residual networks is fundamentally predicated on the identity shortcut connection. While this mechanism effectively mitigates the vanishing gradient problem, it imposes a strictly additive inductive bias on feature transformations, thereby limiting the network's capacity to model complex state transitions.

In this paper, we introduce Deep Delta Learning (DDL), a novel architecture that generalizes the standard residual connection by modulating the identity shortcut with a learnable, data-dependent geometric transformation. This transformation, termed the Delta Operator, constitutes a rank-1 perturbation of the identity matrix, parameterized by a reflection direction vector $\kb(\Xb)$ and a gating scalar $\beta(\Xb)$. We provide a spectral analysis of this operator, demonstrating that the gate $\beta(\Xb)$ enables dynamic interpolation between identity mapping, orthogonal projection, and geometric reflection. Furthermore, we restructure the residual update as a synchronous rank-1 injection, where the gate acts as a dynamic step size governing both the erasure of old information and the writing of new features. This unification empowers the network to explicitly control the spectrum of its layer-wise transition operator, enabling the modeling of complex, non-monotonic dynamics while preserving the stable training characteristics of gated residual architectures.

Project Repository Read the Paper

The Delta Residual Block

Standard residual networks approximate the ODE $\dot{\Xb} = \Fb(\Xb)$ via an additive update $\Xb_{l+1} = \Xb_l + \Fb(\Xb_l)$. DDL generalizes this by applying a rank-1 transformation to the hidden state matrix $\Xb \in \RR^{d \times d_v}$. The Delta-Res block update rule is defined as:

$$ \Xb_{l+1} = \underbrace{(\Ib - \beta_l \kb_l \kb_l^\top)}_{\text{Delta Operator } \Ab(\Xb)} \Xb_l + \beta_l \kb_l \vb_l^\top $$

Figure 1: The Deep Delta Residual Block. The architecture generalizes the standard residual connection. A learnable scalar gate $\beta$ controls a rank-1 geometric transformation.

The network learns the reflection direction $\kb \in \RR^d$, the value vector $\vb \in \RR^{d_v}$, and the gate $\beta \in \RR$. This formulation couples the "erasure" of old information (via projection onto $\kb$) with the "writing" of new information (via injection of $\vb$), scaled synchronously by the gate $\beta$.

Spectral Analysis & Geometric Unification

The expressive power of DDL stems from the spectral properties of the Delta Operator $\Ab(\Xb)$, which are deterministically controlled by the gate $\beta$. The eigenvalues of $\Ab(\Xb)$ are $\{1, \dots, 1, 1-\beta\}$. This allows the network to interpolate between three fundamental linear transformations:

Regime	$\beta$ Value	Spectrum	Interpretation
Identity	$\beta \to 0$	$\{1\}$	Skip Connection: Signal preservation for deep propagation. $\Xb_{l+1} \approx \Xb_l$.
Projection	$\beta \to 1$	$\{0, 1\}$	Forgetting: Orthogonal projection onto the hyperplane $\kb^\perp$, erasing components parallel to $\kb$. $\det(\Ab) \to 0$.
Reflection	$\beta \to 2$	$\{-1, 1\}$	Householder Reflection: Inverts the state along $\kb$, introducing negative eigenvalues to model oscillatory/oppositional dynamics. $\det(\Ab) \to -1$.

Depth-Wise Delta Rule

DDL establishes a theoretical link to efficient sequence models like DeltaNet. While DeltaNet applies the "Delta Rule" over the time dimension, Deep Delta Learning applies it over the depth dimension. Expanding the DDL update reveals the classic Delta Rule structure:

$$ \Xb_{l+1} = \Xb_l + \beta_l \kb_l (\underbrace{\vb_l^\top}_{\text{Target}} - \underbrace{\kb_l^\top \Xb_l}_{\text{Current Projection}}) $$

This allows the network to selectively "clean" or "rewrite" specific feature subspaces layer-by-layer, preventing the accumulation of interference common in standard additive ResNets.

Citation

If you find this work useful, please cite:

@article{zhang2026deep,
   title   = {Deep Delta Learning},
   author  = {Zhang, Yifan and Liu, Yifeng and Wang, Mengdi and Gu, Quanquan},
   journal = {arXiv preprint arXiv:2601.00417},
   year    = {2026}
}

Deep Delta Learning

A generalization of residual networks where a single scalar gate dynamically interpolates between identity, projection, and reflection, enabling the modeling of complex, non-monotonic dynamics.

Yifan Zhang¹ · Yifeng Liu² · Mengdi Wang¹ · Quanquan Gu²

¹Princeton University • ²UCLA • January 1, 2026

Abstract

The Delta Residual Block

Spectral Analysis & Geometric Unification

Depth-Wise Delta Rule

Citation