Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 24 Oct 2024 23:10 UTC
LW: 12 AF: 9
6
AF
Anthropic says:
More realistic evaluations are difficult without risking actual sabotage. One difficulty of running realistic evaluations is presenting realistic attack surfaces. Perhaps the hardest and most important attack surface to accurately model is the suspiciousness of expert humans in high-stakes decisions. To test a given set of realworld anti-sabotage mitigations, it’s tempting to introduce some sort of “penetration testing” or red-teaming, in which models are instructed to attempt to harmlessly but verifiably sabotage real decisions in a reversible manner. However, while such red-teaming might usefully encourage decision-makers to consider the possibility of sabotage, actually allowing an agent to interfere with the operation of important organizations risks introducing real vulnerabilities as an accidental side effect.
I don’t buy this argument. Seems like a very solveable problem, e.g. log everything your red-team agent does and automatically revert it after ten minutes, or wipe your whole network and reboot from a save. Idk. I’m not a cybersecurity expert but this feels like a solveable problem.
- ryan_greenblatt 25 Oct 2024 0:50 UTC
  LW: 17 AF: 11
  14
  AF Parent
  Low importance aside:
  
  Seems slightly unproductive and unnecessarily connotation inducing to say “Anthropic says:”. Maybe “The Sabotage eval paper (from Anthropic) says:”. Like this was just written by various people on the Anthropic alignment science team and by no means should discussion sections of papers be interpreted to speak for the company overall. Obviously not very important.
  - Daniel Kokotajlo 25 Oct 2024 16:52 UTC
    LW: 4 AF: 3
    0
    AF Parent
    OK, good point, I should have said “From Anthropic” or something like that.
- ryan_greenblatt 24 Oct 2024 23:48 UTC
  LW: 9 AF: 8
  4
  AF Parent
  We discuss similar issues in meta-level adversarial evaluations of oversight.
  
  The most simple approach is just to have a human in the loop monitoring every action the AI does in combination with targeting a proxy failure rather than a real failure. This probably can reduce risk to a point where this is a small fraction of overall risk.
  
  Human monitoring might be too expensive, in which case we’ll have to go with sandboxing and pure proxy.
- David Duvenaud 11 Nov 2024 18:26 UTC
  4 points
  0
  Parent
  As the person who wrote most of the text of that paragraph (distilling discussions with the other authors), maybe I can elaborate:
  
  I agree that in some narrow settings such as pure-cyberattacks on systems that are totally controlled and time-reversible, something like your proposal would probably work. The problem is that this is a constrained attack surface, especially compared to one that involves gaining trust, setting precedents, throwing out red herrings, or weakening defences for later. That’s why we mention human decision-makers—I imagine that the richest part of the attack surface for most LLM deployments will be human expectations.
  
  For example, models could build a precedent of spinning up other copies of itself to delegate to, versus perhaps a more transparent habit of writing scripts and offering to check them with a trusted model. Or, proposing a command-and-control system that makes later auditing or monitoring easier or harder. These sorts of precedent-setting might be actual the pivotal actions in a takeover, but would look relatively innocent at the time.
  
  So after trying to think through what kinds of deployment could safely incorporate sometimes-rogue agents trying to demonstrate sabotage, we decided that it would have to be a pretty narrow deployment context to avoid the possibility of undetected attacks. Now I wish that paragraph was clearer that this might still be feasible for some narrow classes of decisions.
  - Daniel Kokotajlo 11 Nov 2024 18:28 UTC
    3 points
    1
    Parent
    OK, interesting, thanks! I agree that the sort of partly-social attacks you mention seem at the very least quite messy to red-team! And deserving of further thought + caution.