GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

Mixture-of-Expert (MoE) models enable efficient inference by employing smaller experts and activating only a subset of them per token. MoE serving engines distribute experts across multiple GPUs and route tokens to appropriate GPUs at inference time based on experts activated. They process tokens in...

Read Original Article →

Source

http://arxiv.org/abs/2605.19945v1