For many cases, the answer will be “Yes, the property can be broken by a red-team specifically fine-tuning the model to break the property.” Then the question becomes “How much optimization effort did they have to put in to achieve this?” and “Is it plausible that the AI we are planning to actually deploy has had a similar or greater amount of optimization accidentally training it to break this property?”
The builder-breaker thing isn’t unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like “activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization”.
For many cases, the answer will be “Yes, the property can be broken by a red-team specifically fine-tuning the model to break the property.” Then the question becomes “How much optimization effort did they have to put in to achieve this?” and “Is it plausible that the AI we are planning to actually deploy has had a similar or greater amount of optimization accidentally training it to break this property?”
The builder-breaker thing isn’t unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like “activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization”.
Correct. There are several other agendas that have this nice property too.