Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the c...

Read Original Article →

Source

http://arxiv.org/abs/2605.21292v1