Sequential Data Poisoning in LLM Post-Training

LLM post-training proceeds through multiple stages, e.g., supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), where each stage draws data from different, potentially untrusted sources. Existing literature assumes data po...

Read Original Article →

Source

http://arxiv.org/abs/2606.04929v1