Boundary-targeted Membership Inference Attacks on Safety Classifiers

Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, ...

Read Original Article →

Source

http://arxiv.org/abs/2605.22373v1