Note that there are two pretty different things here:
Capability evaluation for determing if models are plausibly scheming (aka deceptively aligned)
Capability evaluations for determing if models would be able to cause harm if they were scheming (which we call control evaluations).
The first one is always going to be somewhat imprecise, but we can potentially get away with being pretty conservative (e.g. saying that all models more powerful than GPT-4 are plausibly scheming).
(ETA: to be clear, this is one of the trickiest and least precise parts of the control proposal. And further precision here would be quite useful (even for approaches other than control.)
There’s a related idea on how we see capability gains from better prompting methods from time and experimentation in the real world that are hard to fully explore in advance using evals.
Note that our goal (with control evaluations) will be to estimate the capabilities of the model when it is actually trying to do the task zero-shot. We call this self-elicitation. In the current regime, competing with self-elicitation looks much easier than competing with arbitrary elicitation methods developed in the future. (Models might have superhuman abilties that neither they nor us can effectively utilize, though future humans might be able to utilize these abillities.)
Also, the self-elicitation case we care about is one where the AI needs to do the task without having it be obvious that it’s doing something problematic. So we don’t necessarily need to find optimal CoT strategies because the AI might not even be able to do CoT.
I expect that in the case where AIs can relatively easily self-elicit their abilities (and conditioning on similar architectures to now), eliciting abilities also becomes much more sample efficient (a small number of examples yields large returns). We have some forth coming work which is basically on this.
Overall, I expect that capability elicitation difficulties are probably the main technical difficulty with control evaluations. So, this does indeed seem like a key technical difficulty. (Though perhaps not the main difficulty with control overall, which is more likely to be just getting good enough countermeasures IMO.)
I think that even current models have ~superhuman understanding of text authors, which seems extremely useful for superhuman persuasion.
Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn’t true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn’t sufficient for superhuman persuasion.
Overall, I’m pretty skeptical of the specific stories I’ve heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties given a realistic effort at eliciting R&D capabilities.
(And we should be able to (in principle) handle persuasion abilities comparable with good humans.)
On the dynamics of systems in the real world
I don’t think I understand this section well enough to really respond, sorry.
One response though, it sounds like you’re talking about cases where our training directly incentivizes reasoning about how to trick overseers and this will result in “false alarms”. If these cases actually involve AIs doing egregiously bad actions, I think this isn’t really a false alarm. See also here.
Also, There are various methods to determine how egregious an AI actions is as discussed here.
Separately, I think cases where our training method directly selects for particular average case behavior should be relatively easy to study with model organisms or in the wild. So we should be able to get a pretty good understand of what’s going on. (This might not suffice for avoiding problems.)
At a high level I feel confused about how we get good control evaluations without having good capability evaluations. If we don’t understand what the capability is like, how can we be sure that our red-teaming has caught everything? Where “sure” is more like “99.9%” rather than “70%.” In general, the eval situation seems crazy offense advantaged to me: if there’s just one thing that we haven’t looked for that the AI can sneak by, we’ve lost. The only way this changes, imo, is if we can directly measure the capabilities themselves, and just check “is it there—yes or no”? And if we don’t know how to do that, then I don’t understand how we can feel confident in control strategies, either.
At a high level I feel confused about how we get good control evaluations without having good capability evaluations.
To be clear, you will need good capability evaluation to get good control evaluation. I was just arguing think that the bar for “good” might be lower than people are thinking. You just need to compete with self-elicitation. And, AIs might not be that good eliciting their own capabilities.
if there’s just one thing that we haven’t looked for that the AI can sneak by, we’ve lost
This doesn’t seem like quite the right model of the situation. (Or at least, in my ontology, this statement is wrong without a more precise operationalization.)
Suppose there are N possible strategies which you’d a priori expect to be equally good. Then, our red team finds N-1 of these strategies and tests that our defenses are robust to them. If our defenses are generic (rather than being overfit to specific strategies), then our defenses are very likely to work for the remaining strategy.
More generally, we only need to test with representatively good red team strategies and representatively good capabilities as long as we avoid overfitting (which should be relatively doable from a techical perspective though this might cause organizational issues). We don’t have to literally test every strategy. (Thus, even if there is a huge space of possible approaches, we just need a representively large sample of the best strategies and approaches.)
Additionally, even if our defenses aren’t generic, the AI can be caught if it tries a strategy if we can defend against. So, suppose there are N possible strategies and we only can defend against N-1 of these. But, the AI doesn’t know about which strategy we can/can’t defend against. Further, suppose that our defenses will catch the AI. Then, we have a (N-1)/N chance of catching the AI. More generally, the situation is high-stakes for both us and the AI.
It’s not just that “if the AI finds a strategy which it can sneak by we’ve lost”, it’s also the case that “if we can catch the thing which the AI does on it’s first try, we’ve won” (with some important caveats on “won”).
(The above supposes that the AI follows the optimal approach which guarantees 1/N chance of success. If the AI is likely to try multiple strategies, then we might be able to get a higher chance of catching it (than (N-1)/N) via mechanisms like adversarial evaluation even if adversarial evaluation still can’t catch the 1/N strategies we can’t defend against.)
Ok, so the existance of a strategy which we haven’t evaluated doesn’t imply danger, but perhaps your concern is more like “there will be some type of capability which is totally different than the sort of thing that humans would even be able to think of and our defenses won’t at all generalize to this capability”. (Or otherwise that the AI strategy is very different from the human distribution of strategies in a way which we can’t handle.) In the case, my main disagreement is just thinking this is pretty unlikely for early transformatively useful AI. (This doesn’t seem impossible, but more like 10-20% likely. Also, it quite likely we’ll have pretty good warning signs of this if this is the case.) From some perspective this is a qualitatively wildly superhuman capability as it’s something that humans wouldn’t have thought and perhaps can’t even understand so the arguments we present in the link seem applicable. (Though it might not be super well described as qualitatively wildly superhuman as opposed to doing a relatively normal thing using very strange means. See also here.)
(Aside: It might be clarifying if you think through the situation in the case of a stateless human. What is the right model in that situation? Then, if your concerns don’t apply to this case because AIs differ from humans, what specifically about the AI capability profile are you imagining causes issues?)
Further, suppose that our defenses will catch the AI. Then, we have a (N-1)/N chance of catching the AI. More generally, the situation is high-stakes for both us and the AI.
What do you think of Honeypot strategies—security vulnerabilities that have silicon level monitoring. For example if certain iommu bits get edited that lets the AIs container reach a wider address range (to read its own weights etc) there is a “silent alarm.”. This is a separate computer hosted in the same die for this kind of thing, plugged into separate Ethernet ports, running off private on die RAM. (Called integrated lights out, this has existed for years)
Idea is to leave lots of trap vulnerabilities at key bottlenecks in your threat model. (Depending on computer architecture it may be impossible to read the weights without allowing the address range to be mapped or injection of executable code in messages to the hypervisor or root OS that owns that memory)
For a prison this would be detectors if the prisoners get between the fences. The assumption is no matter how the prisoner got past the other layers, they must pass in between the fences or escape via a helicopter or stow away in a vehicle. So you have 3 modalities to secure, and you don’t have to evaluate all the ways the inmate got past the other defenses.
Similarly this applies to even Lex Luthor, intelligence doesn’t change the physical bottlenecks for escape. Model evaluation is important but securing bottlenecks that work regardless of capabilities seems more productive.
And even the possibility of trap vulnerabilities—if the model knows it might be watched via a mechanism it can’t detect—makes it much more dangerous for the model to misbehave.
Yeah, putting a bunch of honey pot vulnerabilities in our OS or in other software/hardware naively seems pretty good because we get to monitor AIs so heavily. (See discussion here starting with “Writing a novel operating system …”.)
Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn’t true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn’t sufficient for superhuman persuasion.
Both “better understanding” and in a sense “superhuman persuasion” seem to be too coarse a way to think about this (I realize you’re responding to a claim-at-similar-coarseness).
Models don’t need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.
Overall, I’m with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D. However, a specific [doesn’t understand an author better than coworkers] → [unlikely there’s a superhuman persuasion strategy] argument seems weak.
It’s unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.
If we don’t understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation. Obvious-to-me adjustments don’t necessarily help. E.g. giving huge amounts of context, since [inferences about author given input (x1)] are not a subset of [inferences about author given input (x1∪x2∪ … ∪x1000)].
However, a specific [doesn’t understand an author better than coworkers] → [unlikely there’s a superhuman persuasion strategy] argument seems weak.
Note that I wasn’t making this argument. I was just reponding to one specific story and then noting “I’m pretty skeptical of the specific stories I’ve heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties”.
This is obviously only one of many possible arguments.
However, I’m still unclear what you meant by “This level of understanding isn’t sufficient for superhuman persuasion.”. If ‘this’ referred to [human coworker level], then you’re correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It’s not clear to me why it’d be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.
I interpreted ‘this’ as referring to the [understanding level of current models]. In that case it’s not clear to me that this isn’t sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)
Note that there are two pretty different things here:
Capability evaluation for determing if models are plausibly scheming (aka deceptively aligned)
Capability evaluations for determing if models would be able to cause harm if they were scheming (which we call control evaluations).
The first one is always going to be somewhat imprecise, but we can potentially get away with being pretty conservative (e.g. saying that all models more powerful than GPT-4 are plausibly scheming).
(ETA: to be clear, this is one of the trickiest and least precise parts of the control proposal. And further precision here would be quite useful (even for approaches other than control.)
Note that our goal (with control evaluations) will be to estimate the capabilities of the model when it is actually trying to do the task zero-shot. We call this self-elicitation. In the current regime, competing with self-elicitation looks much easier than competing with arbitrary elicitation methods developed in the future. (Models might have superhuman abilties that neither they nor us can effectively utilize, though future humans might be able to utilize these abillities.)
Also, the self-elicitation case we care about is one where the AI needs to do the task without having it be obvious that it’s doing something problematic. So we don’t necessarily need to find optimal CoT strategies because the AI might not even be able to do CoT.
I expect that in the case where AIs can relatively easily self-elicit their abilities (and conditioning on similar architectures to now), eliciting abilities also becomes much more sample efficient (a small number of examples yields large returns). We have some forth coming work which is basically on this.
Overall, I expect that capability elicitation difficulties are probably the main technical difficulty with control evaluations. So, this does indeed seem like a key technical difficulty. (Though perhaps not the main difficulty with control overall, which is more likely to be just getting good enough countermeasures IMO.)
Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn’t true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn’t sufficient for superhuman persuasion.
Overall, I’m pretty skeptical of the specific stories I’ve heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties given a realistic effort at eliciting R&D capabilities.
(And we should be able to (in principle) handle persuasion abilities comparable with good humans.)
I don’t think I understand this section well enough to really respond, sorry.
One response though, it sounds like you’re talking about cases where our training directly incentivizes reasoning about how to trick overseers and this will result in “false alarms”. If these cases actually involve AIs doing egregiously bad actions, I think this isn’t really a false alarm. See also here.
Also, There are various methods to determine how egregious an AI actions is as discussed here.
Separately, I think cases where our training method directly selects for particular average case behavior should be relatively easy to study with model organisms or in the wild. So we should be able to get a pretty good understand of what’s going on. (This might not suffice for avoiding problems.)
At a high level I feel confused about how we get good control evaluations without having good capability evaluations. If we don’t understand what the capability is like, how can we be sure that our red-teaming has caught everything? Where “sure” is more like “99.9%” rather than “70%.” In general, the eval situation seems crazy offense advantaged to me: if there’s just one thing that we haven’t looked for that the AI can sneak by, we’ve lost. The only way this changes, imo, is if we can directly measure the capabilities themselves, and just check “is it there—yes or no”? And if we don’t know how to do that, then I don’t understand how we can feel confident in control strategies, either.
To be clear, you will need good capability evaluation to get good control evaluation. I was just arguing think that the bar for “good” might be lower than people are thinking. You just need to compete with self-elicitation. And, AIs might not be that good eliciting their own capabilities.
This doesn’t seem like quite the right model of the situation. (Or at least, in my ontology, this statement is wrong without a more precise operationalization.)
Suppose there are N possible strategies which you’d a priori expect to be equally good. Then, our red team finds N-1 of these strategies and tests that our defenses are robust to them. If our defenses are generic (rather than being overfit to specific strategies), then our defenses are very likely to work for the remaining strategy.
More generally, we only need to test with representatively good red team strategies and representatively good capabilities as long as we avoid overfitting (which should be relatively doable from a techical perspective though this might cause organizational issues). We don’t have to literally test every strategy. (Thus, even if there is a huge space of possible approaches, we just need a representively large sample of the best strategies and approaches.)
Additionally, even if our defenses aren’t generic, the AI can be caught if it tries a strategy if we can defend against. So, suppose there are N possible strategies and we only can defend against N-1 of these. But, the AI doesn’t know about which strategy we can/can’t defend against. Further, suppose that our defenses will catch the AI. Then, we have a (N-1)/N chance of catching the AI. More generally, the situation is high-stakes for both us and the AI.
It’s not just that “if the AI finds a strategy which it can sneak by we’ve lost”, it’s also the case that “if we can catch the thing which the AI does on it’s first try, we’ve won” (with some important caveats on “won”).
(The above supposes that the AI follows the optimal approach which guarantees 1/N chance of success. If the AI is likely to try multiple strategies, then we might be able to get a higher chance of catching it (than (N-1)/N) via mechanisms like adversarial evaluation even if adversarial evaluation still can’t catch the 1/N strategies we can’t defend against.)
Ok, so the existance of a strategy which we haven’t evaluated doesn’t imply danger, but perhaps your concern is more like “there will be some type of capability which is totally different than the sort of thing that humans would even be able to think of and our defenses won’t at all generalize to this capability”. (Or otherwise that the AI strategy is very different from the human distribution of strategies in a way which we can’t handle.) In the case, my main disagreement is just thinking this is pretty unlikely for early transformatively useful AI. (This doesn’t seem impossible, but more like 10-20% likely. Also, it quite likely we’ll have pretty good warning signs of this if this is the case.) From some perspective this is a qualitatively wildly superhuman capability as it’s something that humans wouldn’t have thought and perhaps can’t even understand so the arguments we present in the link seem applicable. (Though it might not be super well described as qualitatively wildly superhuman as opposed to doing a relatively normal thing using very strange means. See also here.)
(Aside: It might be clarifying if you think through the situation in the case of a stateless human. What is the right model in that situation? Then, if your concerns don’t apply to this case because AIs differ from humans, what specifically about the AI capability profile are you imagining causes issues?)
What do you think of Honeypot strategies—security vulnerabilities that have silicon level monitoring. For example if certain iommu bits get edited that lets the AIs container reach a wider address range (to read its own weights etc) there is a “silent alarm.”. This is a separate computer hosted in the same die for this kind of thing, plugged into separate Ethernet ports, running off private on die RAM. (Called integrated lights out, this has existed for years)
Idea is to leave lots of trap vulnerabilities at key bottlenecks in your threat model. (Depending on computer architecture it may be impossible to read the weights without allowing the address range to be mapped or injection of executable code in messages to the hypervisor or root OS that owns that memory)
For a prison this would be detectors if the prisoners get between the fences. The assumption is no matter how the prisoner got past the other layers, they must pass in between the fences or escape via a helicopter or stow away in a vehicle. So you have 3 modalities to secure, and you don’t have to evaluate all the ways the inmate got past the other defenses.
Similarly this applies to even Lex Luthor, intelligence doesn’t change the physical bottlenecks for escape. Model evaluation is important but securing bottlenecks that work regardless of capabilities seems more productive.
And even the possibility of trap vulnerabilities—if the model knows it might be watched via a mechanism it can’t detect—makes it much more dangerous for the model to misbehave.
Yeah, putting a bunch of honey pot vulnerabilities in our OS or in other software/hardware naively seems pretty good because we get to monitor AIs so heavily. (See discussion here starting with “Writing a novel operating system …”.)
Both “better understanding” and in a sense “superhuman persuasion” seem to be too coarse a way to think about this (I realize you’re responding to a claim-at-similar-coarseness).
Models don’t need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.
Overall, I’m with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D.
However, a specific [doesn’t understand an author better than coworkers] → [unlikely there’s a superhuman persuasion strategy] argument seems weak.
It’s unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.
If we don’t understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation.
Obvious-to-me adjustments don’t necessarily help. E.g. giving huge amounts of context, since [inferences about author given input (x1)] are not a subset of [inferences about author given input (x1 ∪ x2 ∪ … ∪ x1000)].
Note that I wasn’t making this argument. I was just reponding to one specific story and then noting “I’m pretty skeptical of the specific stories I’ve heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties”.
This is obviously only one of many possible arguments.
Sure, understood.
However, I’m still unclear what you meant by “This level of understanding isn’t sufficient for superhuman persuasion.”. If ‘this’ referred to [human coworker level], then you’re correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It’s not clear to me why it’d be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.
I interpreted ‘this’ as referring to the [understanding level of current models]. In that case it’s not clear to me that this isn’t sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)
Yep, I just literally meant, “human coworker level doesn’t suffice”. I was just making a relatively narrow argument here, sorry about the confusion.