PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint ra...

Read Original Article →

Source

http://arxiv.org/abs/2605.21427v1