DeepMind Gemini Safety lead; Foundation board member
One way this could happen is searching for jailbreaks in the space of paraphrases and synonyms of a benign prompt.
Why would this produce fake/unlikely jailbreaks? If the paraphrases and such are natural, then doesn't the nearness to a real(istic) prompt enough to suggest that the jailbreak found is also realistic? Of course you can adversarially generate super unrealistic things, but does that necessarily happen with paraphrasing type attacks?
I think there are two paths, roughly, that RSPs could send us down.
If you think that Anthropic and other labs that adopt these are fundamentally well meaning and trying to do the right thing, you'll assume that we are by default heading down path #1. If you are more cynical about how companies are acting, then #2 may seem more plausible.
My feeling is that Anthropic et al are clearly trying to do the right thing, and that it's on us to do the work to ensure that we stay on the good path here, by working to deliver the concrete pieces we need, and to keep the pressure on AI labs to take these ideas seriously. And to ask regulators to also take concrete steps to make RSPs have teeth and enforce the right outcomes.
But I also suspect that people on the more cynical side aren't going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there's probably not much to say at this point other than, let's see what happens next.
One tip for research of this kind is to not only measure recall, but also precision. It's easy to block 100% of dangerous prompts by blocking 100% of prompts, but obviously that doesn't work in practice. The actual task that labs are trying to solve is to block as many unsafe prompts as possible while rarely blocking safe prompts, or in other words, looking at both precision and recall.
Of course with truly dangerous models and prompts, you do want ~100% recall, and in that situation it's fair to say that nobody should ever be able to build a bioweapon. But in the world we currently live in, the amount of uplift you get from a frontier model and a prompt in your dataset isn't very much, so it's reasonable to trade off against losses from over refusal.