To what extent setups of this type can in practice preserve nice features, both in alignment and other capabilities, and how much those results will then generalize and survive out of distribution as capabilities of the underlying systems scale higher, is a key question. If we can get nice enough properties, we can do various forms of amplification, and the sky is the limit. I am deeply skeptical we can get such properties where it matters. Some others are more hopeful.
A key hope I have for this type of research is that we can test our techniques on the actual powerful models we’re worried about, using domains very similar to the ones we care about. This can be done as long as we can find very similar domains (including similar in difficulty) where we happen to have ground truth (or some other held-out signal for validation). For instance, perhaps we can use string theory as a testbed for theoretical alignment work: we can potentially use held-out string theory experts to see if we can reproduce important results using our techniques as an analogy for alignment research. In practice, I’m reasonably optimistic about finding such domains which are quite similar up through moderately superhuman models.
(Note that my perspective here might be pretty different than the perspective of the OpenAI people and that I don’t work there.)
Alternatively, if we’re worried about measurement tampering type concerns, we can potentially hold out some sources of validation (additional measurements) on our actual domain of interest for testing.
Thus, our final technique testing doesn’t need to generalize across model capabilities, just to generalize across (hopefully) very similar domains. And we should be able to use these similar domains to find hyperparameters and otherwise make relatively fine-grained choices. We can even potentially try to mostly train our model on domains where we have ground truth and then generalize to the domains we actually care about (though there are some complications with full training as we’ll discuss below). We’ll need to test for overfitting by using cross-validation across multiple such domains. (Cross-validation seems particularly important for very parameter-heavy approaches like training the entire model.)
This still leaves us with some questions:
Q: When can we reasonably expect technique testing to generalize across very similar domains given that cross-validation across such domains looks good?
A: I think technique testing probably will work well unless our model is intentionally trying to sabotage our evaluations (perhaps due to scheming/deceptive alignment). Models likely will be able to know what sorts of held-out signals we might have on various domains and thus scheming models could sabotage our evaluations. So, I don’t expect these sorts of methods to work for scheming AIs, but otherwise, they might work quite well. (We still need to ensure we get a large enough sample size of domains such that cross-validation is low variance, etc.) To handle scheming AIs, we’ll probably need a pretty different approach at least for evaluation. (I’m optimistic about AI control here.)
Q: Same question as before, but what about actually fully training the model on some domains and then generalizing to the domain we care about?
A: I get somewhat concerned that even non-scheming models might naturally learn to predict “what would humans think the answer is based on all of the held-out validation they have” and then this will look like it generalizes well according to cross-validation while it actually fails. This seems especially likely if there are flaws in the ground truth labeling process which are predictable to the model. In theory, this concern could also apply to technique testing if “what would humans think the answer is based on the held-out validation they have” is extremely salient, but I think this difficulty should be avoidable in practice given how much less search is applied.
Q: Even if we can test how well training across domains works using cross-validation, how well will this sort of technique work in practice?
A: First, note that in all domains of interest, we probably still want to train on “easy” cases we know we can label or otherwise do a bit of training; otherwise, we might not get any generalization at all. With this, it’s still unclear how flexibly models will actually generalize on the basis of (e.g.) instructions about a new domain. Another concern is that in many domains, it will be doable to get some held-out validation which will work for testing, but this validation will be extremely non-robust to training. For instance, in the context of measurement tampering detection, we can probably have a bunch of redundant measurements (e.g., 5) and then hold out some of our measurements as validation (we can even probably do this testing on our actual domain of interest as noted earlier). But, if we train against more of our measurements, then they will just also be tampered with. So, training on a bunch of different domains with easy-to-hack validation is unlikely to generalize well (in fact, it might make generalization worse than if omitted).
Q: Sure, but at some capability level, we’ll no longer be able to find similar domains with ground truth (or good enough validation), right? (Regardless of the scheming concern.)
A: This seems mostly true. I think the exact capability level where this starts really biting is probably around “moderately superhuman” (perhaps >3 sigmas smarter than John von Neumann insofar as g makes some sense for AIs), though I’m unsure. In the case of (low-stakes) measurement tampering concerns, we should often be able to use held-out measurements for validation even for very smart models. A similar approach can be used to test for certain types of reward hacks in some other domains also.
Q: Sure, but will we even be able to find techniques that help much in the cases we care about?
A: I think there are good reasons to expect that some versions of scalable oversight style techniques like debate are reasonably helpful in some important cases (though it’s very unclear if these approaches will be sufficient). (Recall that scalable oversight and W2SG are compatible approaches to the same problem and can be tested using the same methodology.) Various types of (supervised) consistency checks also seem pretty useful in types of domains. It’s much less clear that we can use the internals of models to get better performance. If we made no structural assumptions about the internals of models and assumed the worst case about the internal structure of AIs, we wouldn’t be able to accomplish anything, so we will depend on some type of structure in the internals of models. In practice, I’m moderately optimistic that we can find domain-specific structure to leverage in many important cases, but far less optimistic about domain agnostic approaches. For instance, see “Why solving measurement tampering may be notably easier than eliciting arbitrary knowledge”. We plan on writing a post which compares the structural hopes we have in the measurement tampering detection case to the hopes discussed in the OpenAI paper. Even if it’s not clear that we can get huge gains here, picking up the low hanging fruit with internals based methods still seems pretty good to me.
A key hope I have for this type of research is that we can test our techniques on the actual powerful models we’re worried about, using domains very similar to the ones we care about. This can be done as long as we can find very similar domains (including similar in difficulty) where we happen to have ground truth (or some other held-out signal for validation). For instance, perhaps we can use string theory as a testbed for theoretical alignment work: we can potentially use held-out string theory experts to see if we can reproduce important results using our techniques as an analogy for alignment research. In practice, I’m reasonably optimistic about finding such domains which are quite similar up through moderately superhuman models.
(Note that my perspective here might be pretty different than the perspective of the OpenAI people and that I don’t work there.)
Alternatively, if we’re worried about measurement tampering type concerns, we can potentially hold out some sources of validation (additional measurements) on our actual domain of interest for testing.
These sorts of approaches are actually just cases of sandwiching, and we discuss this type of evaluation in more detail here. Also note that we can use the exact same testbeds to test scalable oversight techniques and generalization techniques (this link is to the same post as linked earlier, but not a specific section).
Thus, our final technique testing doesn’t need to generalize across model capabilities, just to generalize across (hopefully) very similar domains. And we should be able to use these similar domains to find hyperparameters and otherwise make relatively fine-grained choices. We can even potentially try to mostly train our model on domains where we have ground truth and then generalize to the domains we actually care about (though there are some complications with full training as we’ll discuss below). We’ll need to test for overfitting by using cross-validation across multiple such domains. (Cross-validation seems particularly important for very parameter-heavy approaches like training the entire model.)
This still leaves us with some questions:
Q: When can we reasonably expect technique testing to generalize across very similar domains given that cross-validation across such domains looks good?
A: I think technique testing probably will work well unless our model is intentionally trying to sabotage our evaluations (perhaps due to scheming/deceptive alignment). Models likely will be able to know what sorts of held-out signals we might have on various domains and thus scheming models could sabotage our evaluations. So, I don’t expect these sorts of methods to work for scheming AIs, but otherwise, they might work quite well. (We still need to ensure we get a large enough sample size of domains such that cross-validation is low variance, etc.) To handle scheming AIs, we’ll probably need a pretty different approach at least for evaluation. (I’m optimistic about AI control here.)
Q: Same question as before, but what about actually fully training the model on some domains and then generalizing to the domain we care about?
A: I get somewhat concerned that even non-scheming models might naturally learn to predict “what would humans think the answer is based on all of the held-out validation they have” and then this will look like it generalizes well according to cross-validation while it actually fails. This seems especially likely if there are flaws in the ground truth labeling process which are predictable to the model. In theory, this concern could also apply to technique testing if “what would humans think the answer is based on the held-out validation they have” is extremely salient, but I think this difficulty should be avoidable in practice given how much less search is applied.
Q: Even if we can test how well training across domains works using cross-validation, how well will this sort of technique work in practice?
A: First, note that in all domains of interest, we probably still want to train on “easy” cases we know we can label or otherwise do a bit of training; otherwise, we might not get any generalization at all. With this, it’s still unclear how flexibly models will actually generalize on the basis of (e.g.) instructions about a new domain. Another concern is that in many domains, it will be doable to get some held-out validation which will work for testing, but this validation will be extremely non-robust to training. For instance, in the context of measurement tampering detection, we can probably have a bunch of redundant measurements (e.g., 5) and then hold out some of our measurements as validation (we can even probably do this testing on our actual domain of interest as noted earlier). But, if we train against more of our measurements, then they will just also be tampered with. So, training on a bunch of different domains with easy-to-hack validation is unlikely to generalize well (in fact, it might make generalization worse than if omitted).
Q: Sure, but at some capability level, we’ll no longer be able to find similar domains with ground truth (or good enough validation), right? (Regardless of the scheming concern.)
A: This seems mostly true. I think the exact capability level where this starts really biting is probably around “moderately superhuman” (perhaps >3 sigmas smarter than John von Neumann insofar as g makes some sense for AIs), though I’m unsure. In the case of (low-stakes) measurement tampering concerns, we should often be able to use held-out measurements for validation even for very smart models. A similar approach can be used to test for certain types of reward hacks in some other domains also.
Q: Sure, but will we even be able to find techniques that help much in the cases we care about?
A: I think there are good reasons to expect that some versions of scalable oversight style techniques like debate are reasonably helpful in some important cases (though it’s very unclear if these approaches will be sufficient). (Recall that scalable oversight and W2SG are compatible approaches to the same problem and can be tested using the same methodology.) Various types of (supervised) consistency checks also seem pretty useful in types of domains. It’s much less clear that we can use the internals of models to get better performance. If we made no structural assumptions about the internals of models and assumed the worst case about the internal structure of AIs, we wouldn’t be able to accomplish anything, so we will depend on some type of structure in the internals of models. In practice, I’m moderately optimistic that we can find domain-specific structure to leverage in many important cases, but far less optimistic about domain agnostic approaches. For instance, see “Why solving measurement tampering may be notably easier than eliciting arbitrary knowledge”. We plan on writing a post which compares the structural hopes we have in the measurement tampering detection case to the hopes discussed in the OpenAI paper. Even if it’s not clear that we can get huge gains here, picking up the low hanging fruit with internals based methods still seems pretty good to me.