ryan_greenblatt comments on AI Control: Improving Safety Despite Intentional Subversion

ryan_greenblatt 14 Dec 2023 0:06 UTC
7 points
2
See appendix K for discussion of comments:

We didn’t want to need to heavily optimize the red team’s commenting ability, so we stripped comments for monitoring. For trusted editing, we also strip comments, but we empirically find that it improves performance to get GPT-3.5 to re-comment the code prior to ~~using it for inspiration~~ editing it, so we have it do this.

Overall, I think it’s unlikely, but plausible that requiring comments would improve performance (assuming sufficient red team optimization).

For our current monitoring prompts used for T, retaining comments didn’t seem to immediately improve performance (we currently initially draft backdoored solutions with comments and then remove). But, these monitoring prompts aren’t well optimized to leverage signs that U is lying.

However, see the section called “Extending trusted monitoring with untrusted justification”. We found that asking U for justification didn’t improve performance after the red team did sufficient prompt optimization. I imagine this also applies to comments, though it isn’t obvious this would transfer. Thus, with sufficient red team optimization, requiring comments might not improve performance.

(Aside: to get GPT-4 to lie effectively in justification, we found that the best approach was to use extensive gaslighting^[1] to convince it of false claims and then get it to write justification. This might not be required for models which aren’t trained for honesty.)

Overall, my take is that requiring U to comment the code and optimizating this process extensively (for the both the blue and red teams) wouldn’t result in higher safety. Though, I’m not very confident.
1. ↩︎
  I think the ethics here are acceptable given the overall situtation, but it’s plausible I will feel differently about gaslighting future more powerful models.