More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying...

Read Original Article →

Source

http://arxiv.org/abs/2605.26647v1