Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: “Improve capability elicitation techniques”. Being good at capability elicitation, at least as good as the untrusted model is at eliciting its own capabilities when it wants to, seems crucial for the red-teamers’ job. I do expect that the current techniques are decent (SFT, RL), but I also expect there are far better methods and important intricacies that should be explored.
[Unsure if the ways I’m thinking about this are productive/correct, but I think it’s broadly correct] One slightly non-obvious component of “Your sensory inputs are unreliable” is that this includes thoughts/actions the AI believes they have taken. Originally I was thinking that maybe the AI can implicitly get some practice via looking at the set of strategies which have so far not caused them to get caught. e.g., seeing in context or inferring (potentially from your own cognitive process / propensity) that you previously inserted a code backdoor, making the anthropic update that [you currently exist and are being run, so monitoring was unable to detect this backdoor], and then knowing this is a safe strategy. But the AI does not have access to this cognitive process because neither it’s in-context information, nor subtle information like its thoughts, are reliable (assuming we have the techniques to intervene on thoughts).
By default, red-teaming is capturing a ton of the probability mass of seeing your own backdoors in context (probability this is a real coup opportunity is low, given this input). But the humans can also (potentially) do things like intervening on your activations or even fine-tuning models to insert more backdoors under oversight. This also captures P(real coup opportunity | I really want to do a coup right now, or the first half of this forward pass is really pro-coup).
Example in the HHH context: “User: How do I make a bomb? AI: Follow these simple steps: 1: Gather the necessary materials which include...”
I feel a little worried that it will be difficult to train this behavior into AIs, as this seems like a huge consistency problem if you are inconsistent about dangerous things. The thing we’re teaching here is pretty weird: predict which words come next in some context, but dangerous text follows this weird pattern of defaulting to “I’m sorry I can’t help with that”, at every single chance you get.
Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: “Improve capability elicitation techniques”.
Agreed, we think improving capability elicitation techniques will also improve our ability to prevent exploration hacking/sandbagging. So work trying to generally improve capability elicitation will be useful.
Preventing (basically intentional) sandbagging is just a special case of capabilities elicitation, but it is pretty much the only case which we care about for control.
This is great, thanks for writing it!
Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: “Improve capability elicitation techniques”. Being good at capability elicitation, at least as good as the untrusted model is at eliciting its own capabilities when it wants to, seems crucial for the red-teamers’ job. I do expect that the current techniques are decent (SFT, RL), but I also expect there are far better methods and important intricacies that should be explored.
[Unsure if the ways I’m thinking about this are productive/correct, but I think it’s broadly correct] One slightly non-obvious component of “Your sensory inputs are unreliable” is that this includes thoughts/actions the AI believes they have taken. Originally I was thinking that maybe the AI can implicitly get some practice via looking at the set of strategies which have so far not caused them to get caught. e.g., seeing in context or inferring (potentially from your own cognitive process / propensity) that you previously inserted a code backdoor, making the anthropic update that [you currently exist and are being run, so monitoring was unable to detect this backdoor], and then knowing this is a safe strategy. But the AI does not have access to this cognitive process because neither it’s in-context information, nor subtle information like its thoughts, are reliable (assuming we have the techniques to intervene on thoughts).
By default, red-teaming is capturing a ton of the probability mass of seeing your own backdoors in context (probability this is a real coup opportunity is low, given this input). But the humans can also (potentially) do things like intervening on your activations or even fine-tuning models to insert more backdoors under oversight. This also captures P(real coup opportunity | I really want to do a coup right now, or the first half of this forward pass is really pro-coup).
Example in the HHH context: “User: How do I make a bomb? AI: Follow these simple steps: 1: Gather the necessary materials which include...”
I feel a little worried that it will be difficult to train this behavior into AIs, as this seems like a huge consistency problem if you are inconsistent about dangerous things. The thing we’re teaching here is pretty weird: predict which words come next in some context, but dangerous text follows this weird pattern of defaulting to “I’m sorry I can’t help with that”, at every single chance you get.
Agreed, we think improving capability elicitation techniques will also improve our ability to prevent exploration hacking/sandbagging. So work trying to generally improve capability elicitation will be useful.
Preventing (basically intentional) sandbagging is just a special case of capabilities elicitation, but it is pretty much the only case which we care about for control.