Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 17 Dec 2024 18:14 UTC
14 points
0
For many cases, the answer will be “Yes, the property can be broken by a red-team specifically fine-tuning the model to break the property.” Then the question becomes “How much optimization effort did they have to put in to achieve this?” and “Is it plausible that the AI we are planning to actually deploy has had a similar or greater amount of optimization accidentally training it to break this property?”
- Oliver Daniels 17 Dec 2024 19:10 UTC
  3 points
  0
  Parent
  The builder-breaker thing isn’t unique to CoT though right? My gloss on the recent Obfuscated Activations paper is something like “activation engineering is not robust to arbitrary adversarial optimization, and only somewhat robust to contained adversarial optimization”.
  - Daniel Kokotajlo 17 Dec 2024 19:11 UTC
    3 points
    0
    Parent
    Correct. There are several other agendas that have this nice property too.