Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering the individual impor...

Read Original Article →

Source

http://arxiv.org/abs/2606.06302v1