If we can construct fake honeypots that are indistinguishable—to the policy—from real opportunities, and do this at scale, then that could suffice: we could drive the policy’s P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.)
Yes totally. (This is impractical if you think that the model has a very large number of opportunities to attack, e.g. if every time it’s programming is an opportunity to attack.)
If we can construct fake attacks that are indistinguishable—to the untrusted monitor—from real attacks, then that possibly suffices: we could drive the monitor’s prior P(deployment | attack) very low
Indeed; as you know, this is probably easier than driving the policy’s P(deployment|great opportunity) sufficiently low, because untrusted monitors have to defect reliably.
If you are saying that we don’t need either of these—i.e. that we can get by with honeypots/attacks that are noticeably fake—then that’s news to me, and I don’t see how you can do that.
What I was trying to say is that if you don’t have the ability to generate tasks that look real, you might still be able to honeypot your monitor by taking real trajectories and editing them to contain fake attacks (that you hope the monitor will be unable to distinguish from real attacks). I discuss this here.
Thanks, that clears it up. Indeed I agree that “take real trajectories, then edit the attacks in” is relatively promising, and didn’t track this when writing my first comment (oops) - thanks for flagging it.
Yes totally. (This is impractical if you think that the model has a very large number of opportunities to attack, e.g. if every time it’s programming is an opportunity to attack.)
Indeed; as you know, this is probably easier than driving the policy’s P(deployment|great opportunity) sufficiently low, because untrusted monitors have to defect reliably.
What I was trying to say is that if you don’t have the ability to generate tasks that look real, you might still be able to honeypot your monitor by taking real trajectories and editing them to contain fake attacks (that you hope the monitor will be unable to distinguish from real attacks). I discuss this here.
Thanks, that clears it up. Indeed I agree that “take real trajectories, then edit the attacks in” is relatively promising, and didn’t track this when writing my first comment (oops) - thanks for flagging it.