Thanks for sharing these thoughts. I’m particularly excited about the possibility of running empirical experiments to better understand potential risks of ML systems and and contribute to debates about difficulties of alignment.
1. Potential implications for optimistic views on alignment
If we observe systems that learn to bullshit convincingly, but don’t transfer to behaving honestly, I think that’s a real challenge to the most optimistic views about alignment and I expect it would convince some people in ML.
I’m most interested in this point. IIUC, the viewpoint you allude to here is something along the lines “There will be very important decisions we can’t train directly for, but we’ll be able to directly apply ML to these decisions by generalizing from feedback on easier decisions.” We can call this “optimism about generalization” (of honesty, on out-of-distribution tasks). I’m generally pretty skeptical about this position (see point 2 below).
OTOH, there is a different reason for optimism, along the lines of “The set of decisions where ‘humans never get the right answer’ is small to none, and we don’t need to use ML directly for these cases, and this situation is helped dramatically by the fact that we can use ML for questions we can supervise to indirectly help us with these questions.” For example, in amplification, ML systems can summarize important arguments on both sides of thorny questions, produce scientific insights we can directly evaluate, guide us through stacks of sub-questions, and so on. We can call this “optimism about scalable oversight”.
My view is that a negative result in experiments proposed here would point against optimism about generalization, but not against optimism about scalable oversight. (I’m curious if this seems right to you.) And I imagine your view is that (a) this is still useful, since optimism about generalization is fairly common (among ML researchers you talk to) (b) we should in fact currently have some uncertainty about optimism about generalization which this would address (see point 2 below) and (c) in the limit, scalable human feedback might not be competitive with more heavily generalization-reliant approaches, and so we need to better understand these generalization questions too.
2. Skepticism about optimism about generalization (of honesty, on out-of-distribution tasks)
I’d reasonably strongly (as a really rough number off the top of my head, >80%) expect a negative result from these experiments, and am also generally pessimistic about the broader approach of optimism about generalization. For these experiments, this depends in large part on what’s considered a positive or negative result.
I agree that very plausibly:
The model will generalize to answering the OOD tasks with significantly non-zero accuracy.
The results will be impressive to many people (“Look how well this model generalizes to tasks it’s never directly trained on!”)
At the same time:
It seems very unlikely that counting on generalization alone would actually perform as well as directly supervising model outputs (e.g. on tone description tasks). Almost certainly, some fraction of the model responses are going to be “plausible looking but incorrect” (and this fraction will be larger than for the directly supervised model).
There is some ambiguity over whether this should count as a negative result. I think we’ve discussed in the past that the comparison should ideally be between the generalization-only model, and the model which is directly supervised on the tone task but “without acquiring any new information” (for some fuzzy notion here). But if we’re willing to say that this task is one where “the capabilities of the model do already generalize”, then it seems this would be a negative result.
And more broadly, I think most people would agree that if we had a model which output “plausible looking but incorrect” responses a substantial fraction of the time, it’d be irresponsible to use such a model for important societal-scale decisions. (I’d argue that this would also be “bad practice” in many smaller stakes tasks.)
Training for plausibility or coherence
I’m pretty wary about this, and we should hold such approaches to a high standard-of-proof. We already have a bunch of examples for what happens when you optimize for “plausible but not-necessarily correct”: you end up with plausible but not-necessarily-correct outputs. (And so in general, I think we should be cautious about claiming any particular generalization feature will fix this, unless we have a principled reason for thinking so.)
I realize that the key point is that honesty might generalize from the tasks where we can directly supervise for correctness. But even then, it seems that if you think the basic generalization story won’t work, you shouldn’t expect training for plausibility to rescue it; we should expect the responses to become much more plausible, and somewhat (but not completely) more correct. So I’d be wary of whether training for plausibility would be masking the problem without addressing the root cause.
I’m most interested in this point. IIUC, the viewpoint you allude to here is something along the lines “There will be very important decisions we can’t train directly for, but we’ll be able to directly apply ML to these decisions by generalizing from feedback on easier decisions.”
That’s basically right, although I think the view is less plausible for “decisions” than for some kinds of reports. For example, it is more plausible that a mapping from symbols in an internal vocabulary to expressions in natural language would generalize than that correct decisions would generalize (or even than other forms of reasoning).
I think a realistic approach would need to use generalization in some situations (where we expect it to work) and then use the facts-that-generalize as an input into supervision. For example, if you were able to answer empirical questions about what’s happening right now, you could use those as an input into debate/amplification.
(This may also make it more clear why I’m interested in coherence conditions where you can’t supervise—in some sense “use the stuff that does generalize as an input into amplification” is quite similar to saying “impose a coherence condition amongst the stuff you can’t directly supervise.”)
OTOH, there is a different reason for optimism, along the lines of “The set of decisions where ‘humans never get the right answer’ is small to none, and we don’t need to use ML directly for these cases, and this situation is helped dramatically by the fact that we can use ML for questions we can supervise to indirectly help us with these questions.” For example, in amplification, ML systems can summarize important arguments on both sides of thorny questions, produce scientific insights we can directly evaluate, guide us through stacks of sub-questions, and so on. We can call this “optimism about scalable oversight”.
“Optimism about scalable oversight” is what I’m usually thinking about, but it does seem to me that there are some cases where it is inadequate. You could hope to play a quantitative/empirical game of getting lots of useful work out of AI before this kind of approach breaks down, but I am interested in whether there’s a chance at going straight for an indefinitely scalable approach to alignment.
My view is that a negative result in experiments proposed here would point against optimism about generalization, but not against optimism about scalable oversight. (I’m curious if this seems right to you.) And I imagine your view is that...
That seems right to me and that is a reasonable description of my view.
(I’d be curious to know if you don’t encounter a lot of optimism-about-generalization.)
I’d reasonably strongly (as a really rough number off the top of my head, >80%) expect a negative result from these experiments, and am also generally pessimistic about the broader approach of optimism about generalization. For these experiments, this depends in large part on what’s considered a positive or negative result.
For the purposes of “indefinitely scalable alignment approach” the relevant threshold is something quite ambitious like “reflects everything the system knows.”
But if we’re willing to say that this task is one where “the capabilities of the model do already generalize”, then it seems this would be a negative result.
I think there’s a further question about whether the model knows how to talk about these concepts / is able to express itself. This is another part of the motivation for plausibility constraints (if nothing else they teach the model the syntax, but then when combined with other knowledge about language the coherence conditions seem like they probably just pin down the meaning unambiguously).
And more broadly, I think most people would agree that if we had a model which output “plausible looking but incorrect” responses a substantial fraction of the time, it’d be irresponsible to use such a model for important societal-scale decisions. (I’d argue that this would also be “bad practice” in many smaller stakes tasks.)
Yes, I think the fact that people would agree with this is important for the demonstration moving anyone on the pessimistic side.
But even then, it seems that if you think the basic generalization story won’t work, you shouldn’t expect training for plausibility to rescue it; we should expect the responses to become much more plausible, and somewhat (but not completely) more correct. So I’d be wary of whether training for plausibility would be masking the problem without addressing the root cause.
I’m not so pessimistic. There are lots of “bad” ways to answer questions, and as we add more and more stringent plausibility checks we eliminate more and more of them. Eventually we get down to the hard core of “stuff that would pass any number of plausibility checks.” Then the question is whether the truth is the easiest-to-learn way to pass all plausibility checks.
This seems much easier than getting the truth out of the gate. One way of poking at the intuition is that the initial goal is hitting an isolated point in a giant high-dimensional space of all question-answering policies. But after selecting for “passing all plausibility checks” it seems plausible that you are basically left with a set of isolated points and the question is just which of them you got.
(I think that literal picture is too optimistic, but the point is that in some sense you have to be doing a lot of work to look fully coherent and that space is a lot smaller and different from getting random stuff from a neural network off distribution.)
You may say “In that case the plausibility checks alone should work,” but I think that it’s not so clear either—since the set of points may still be quite big and generalization was the main reason to expect an inductive bias towards truth, without making strong claims on the pre-training (for GPT-3 in particular the pre-training bias is fairly likely to be adequate, but I’d become quite a bit more skeptical about other domains).
Overall I do think it’s >50% that if the whole thing works, one of the two pieces worked independently.
Thanks for these thoughts. Mostly just responding to the bits with questions/disagreements, and skipping the parts I agree with:
That’s basically right, although I think the view is less plausible for “decisions” than for some kinds of reports. For example, it is more plausible that a mapping from symbols in an internal vocabulary to expressions in natural language would generalize than that correct decisions would generalize (or even than other forms of reasoning).
I’m curious what factors point to a significant difference regarding generalization between “decisions” and “unsupervised translation”. Perhaps there is a more natural concept of “honesty” / “truth” for unsupervised translation, which makes it more likely. But this is very fuzzy to me, and I’m curious why (or if) it’s clear to you there is a big difference between the two cases.
My model is that “if honesty doesn’t generalize in general, then it may generalize somewhat better within translation tasks, but is unlikely to generalize completely”. In other words, on a simplified model where honesty generalization goes from 0 to 1, it seems likely it is somewhat higher for translation tasks, but unlikely it is exactly 1.
(This may also make it more clear why I’m interested in coherence conditions where you can’t supervise—in some sense “use the stuff that does generalize as an input into amplification” is quite similar to saying “impose a coherence condition amongst the stuff you can’t directly supervise.”)
IIUC, the analogy you’re drawing here is that amplification is playing a similar role to the coherence condition, where even if we can’t supervise the model response in full, we can at least check it is consistent with the inputs into amplification (I might be misunderstanding). These seem different though: in amplification, we are concerned about dishonesty in the inputs into amplification, whereas in these experiments, we’re concerned about dishonesty in the model’s outputs. I also feel substantially more optimistic about amplification in the scalable oversight story: we check for accuracy of amplification’s outputs, assuming the inputs are accurate, and we separately are checking for accuracy of inputs, via the same recursion.
(I’d be curious to know if you don’t encounter a lot of optimism-about-generalization.)
I often encounter a somewhat similar view from people optimistic about multi-agent approaches (e.g. we can train agents to be cooperative in a broad set of simulations, and transfer this to the real world). I am even more skeptical about this strong-optimism-about-generalization.
Among folks excited about direct RL, I think it is more common to say “we will be very careful about our optimization targets” and have a view more like optimism about scalable oversight.
(My assessment may also be biased. To the extent that people are optimistic about both, or have different ways of thinking about this, or haven’t thought too much about the indefinitely scalable case and so have difficulty articulating what their intuitions are, I am inclined to extend a “charitable” interpretation and assume their view is closer to optimism-about-oversight, because that’s the view I consider most plausible. If I pushed harder against optimism-about-oversight in conversations, I expect I’d get a more accurate picture.)
I think there’s a further question about whether the model knows how to talk about these concepts / is able to express itself. This is another part of the motivation for plausibility constraints (if nothing else they teach the model the syntax, but then when combined with other knowledge about language the coherence conditions seem like they probably just pin down the meaning unambiguously).
I don’t think I follow. Doesn’t the model already know syntax? If that plus the “other knowledge about language… pins down the meaning unambiguously”, it feels like basically all the work came from the “other knowledge about language”, and I’m not sure what the coherence conditions are adding.
Regarding optimism about generalization:
I’m not so pessimistic. There are lots of “bad” ways to answer questions, and as we add more and more stringent plausibility checks we eliminate more and more of them...
I think I share your general picture that it is very difficult to “get truth out of the gate”, but reach the opposite conclusion:
I am indeed skeptical that you’d get truth out of the gate (without training for plausibility). I think that training for plausibility wouldn’t fix this (though it would definitely make any remaining inaccuracies harder to spot). This is basically the same claim as above (and below); while it likely decreases the fraction of inaccurate responses, it’s unlikely to get rid of all of them.
The ideal way to overcome this difficulty is to directly select for accuracy (though we agree here).
One way of poking at the intuition is that the initial goal is hitting an isolated point in a giant high-dimensional space of all question-answering policies. But after selecting for “passing all plausibility checks” it seems plausible that you are basically left with a set of isolated points and the question is just which of them you got.
Perhaps a core claim is that we are “basically left with a set of isolated points”. I don’t buy this claim (this disagreement is possibly also related to my second bullet above on “My model is that...”). It seems like the plausibility checks probably cut out some dimensions from the original high-dimensional space, but picking out “honest” behavior is still woefully underdefined for the model. For instance, it seems like “completely honest” is training-consistent, so is “coherent-but-often-inaccurate”, and so is every level of inaccuracy between these two points.
Overall I do think it’s >50% that if the whole thing works, one of the two pieces worked independently.
It seems we’ve already done the experiments on training for plausibility without leveraging generalization across tasks (e.g. generic specification gaming examples, or summarization with less careful human raters). So I’d put <1% on that possibility. This estimate seems consistent with the view that across-task-generalization plausibly works out of the box (though I also consider that <<50%).
I’m curious what factors point to a significant difference regarding generalization between “decisions” and “unsupervised translation”. Perhaps there is a more natural concept of “honesty” / “truth” for unsupervised translation, which makes it more likely. But this is very fuzzy to me, and I’m curious why (or if) it’s clear to you there is a big difference between the two cases.
For honest translation from your world-model to mine (or at least sufficiently small parts of it), there is a uniform intended behavior. But for decisions there isn’t any intended uniform behavior.
My model is that “if honesty doesn’t generalize in general, then it may generalize somewhat better within translation tasks, but is unlikely to generalize completely”. In other words, on a simplified model where honesty generalization goes from 0 to 1, it seems likely it is somewhat higher for translation tasks, but unlikely it is exactly 1.
This is not clear to me (and it seems like we get to check).
These seem different though: in amplification, we are concerned about dishonesty in the inputs into amplification, whereas in these experiments, we’re concerned about dishonesty in the model’s outputs.
I’m not sure I understand the distinction here. Suppose that amplification lets us compute B from A, so that B is good if A is good (i.e. so that the learned model will end up answering top-level questions well if it answers leaves well). Whereas a coherence condition ensures that A and B are consistent with one another—so if B is “supposed to be” a deterministic function of A, then consistency guarantees that B is good if A is good.
I don’t think I follow. Doesn’t the model already know syntax? If that plus the “other knowledge about language… pins down the meaning unambiguously”, it feels like basically all the work came from the “other knowledge about language”, and I’m not sure what the coherence conditions are adding.
I don’t think the model necessarily “knows” how to talk about tone (given only a question about tone in natural language with no examples), nor how to translate between its internal beliefs about tone and a description in natural language. The point of a plausibility/coherence condition is to provide enough constraint to pin those things down. You fully learn what kind of sentences we were looking for when we asked about tone, which might have just been totally non-obvious initially. And you learn a bunch of things about how different concepts about tone are supposed to relate to one another (in order to make sure all of your utterances are consistent).
Perhaps a core claim is that we are “basically left with a set of isolated points”. I don’t buy this claim (this disagreement is possibly also related to my second bullet above on “My model is that...”). It seems like the plausibility checks probably cut out some dimensions from the original high-dimensional space, but picking out “honest” behavior is still woefully underdefined for the model. For instance, it seems like “completely honest” is training-consistent, so is “coherent-but-often-inaccurate”, and so is every level of inaccuracy between these two points.
What is “coherent-but-often-inaccurate” though? The point is that in order to be coherent you actually have to do quite a lot of work, that basically requires you to understand what the involved terms mean to humans.
It seems we’ve already done the experiments on training for plausibility without leveraging generalization across tasks (e.g. generic specification gaming examples, or summarization with less careful human raters). So I’d put <1% on that possibility. This estimate seems consistent with the view that across-task-generalization plausibly works out of the box (though I also consider that <<50%).
I don’t really think we’ve done those experiments. I don’t know what specification gaming examples you have in mind. And less careful raters are also less careful about coherence conditions. Not to mention that no effort was made to e.g. ask multiple questions about the text and check agreement between them.
I agree that it’s possible to have some plausibility condition which is insufficient to get good behavior. But that’s quite different from saying “And if you actually try to make it work it doesn’t work.”
For example, I do think you can keep adding coherence conditions until you reach the limit of “Actually looks coherent to a human no matter how they investigate it,” such that you are actually generalizing to new coherence checks. At that point honesty becomes much more plausible.
Zooming out a bit, I would summarize a few high-level threads as:
We both agree that the experiments described here primarily relate to optimism-about-generalization, rather than optimism-about-scalable-oversight.
I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.
I am substantially more pessimistic about generalization of honest OOD, whereas you think it is plausible (via some combination of default neural network generalization and dynamic/thorough coherence checks), and likely useful for at least certain classes of tasks.
These two differences above translate pretty naturally into differences about both what we should expect in these types of experiments, and what we should interpret from different types of results (some caveats on this later)
We both agree these types of experiments would be very useful.
I think the two disagreements are probably broader threads, so I’m mostly curious whether this seems right to you. Will also explore a few individual points a bit more below:
> My model is that “if honesty doesn’t generalize in general, then it may generalize somewhat better within translation tasks, but is unlikely to generalize completely”.
This is not clear to me (and it seems like we get to check).
Fair :)
I’m not sure I understand the distinction here. Suppose that amplification lets us compute B from A, so that B is good if A is good (i.e. so that the learned model will end up answering top-level questions well if it answers leaves well). Whereas a coherence condition ensures that A and B are consistent with one another—so if B is “supposed to be” a deterministic function of A, then consistency guarantees that B is good if A is good.
In this framing, the distinction is that implication is only one way. If B is the model’s claim about tone, and A is a consistency check, or another fact serving as input to some consistency check, then A being bad implies B is bad, but not necessarily the converse of this. Whereas in amplification/debate, we would actually check that all the subquestion answers + subsequent reasoning actually justify B.
I don’t think the model necessarily “knows” how to talk about tone (given only a question about tone in natural language with no examples), nor how to translate between its internal beliefs about tone and a description in natural language.
Got it, thanks. This seems right—I agree that the finetuning (either via coherence, or via direct supervision) is providing some extra information about how to do this translation, and what questions about tone actually are asking about, and this information is not necessarily present just from across-task generalization.
For example, I do think you can keep adding coherence conditions until you reach the limit of “Actually looks coherent to a human no matter how they investigate it,” such that you are actually generalizing to new coherence checks. At that point honesty becomes much more plausible.
I see; this makes more sense to me. I guess this will be pretty closely related (equivalent?) to your optimism in previous discussions about being able to use adversarial training to eventually eliminate all malign failures, where we have the similar dynamic of generalizing to unseen testing strategies, and needing to eventually eliminate all failures. I think I’m pretty pessimistic about this approach, but happy to leave this in favor of seeing how empirics play out with experiments like these (or the Unrestricted Advx Contest).
ETA: I suppose a difference here is that with adv training, we are typically assuming we can recognize the failures, and the generalization is across multiple inputs. Whereas here we’re less concerned about generalization across inputs, and aren’t assuming we can recognize failures, and instead want to generalize across different coherence checks. The two feel similar in spirit, but maybe this is a significant enough difference that the analogy isn’t useful.
I don’t really think we’ve done those experiments. I don’t know what specification gaming examples you have in mind. And less careful raters are also less careful about coherence conditions. Not to mention that no effort was made to e.g. ask multiple questions about the text and check agreement between them.
I agree that it’s possible to have some plausibility condition which is insufficient to get good behavior. But that’s quite different from saying “And if you actually try to make it work it doesn’t work.”
I think this is fair. I agree that nobody has “really tried” to make something like this work (in part this is because it is often possible to just hire/train better human raters, or re-scope the task so that you actually measure the thing you care about). I do think the wide variety of failures from researchers “sort of trying” points me against optimism-about-generalization. Overall, I do agree it’d be valuable to distill to a single project which really tests this.
A related intuition pump: I’m pretty pessimistic in cases where non-French speakers are supervising questions about French. I’m substantially more pessimistic in cases where humans are supervising questions about e.g. what “neuralese statements” (activations vectors) passed between neural networks mean, where e.g. we don’t have intuitions for how grammar structures should work, can’t rely on common Latin roots, can’t easily write down explicit rules governing the language, and generally have a harder time generating consistency checks. The French case is nicer for experiments because it’s easier for us to check ground truth for evaluation, but hopefully the example also conveys some intuition for my pessimism in a substantially harder case.
I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.
I’d still describe my optimistic take as “do imitative generalization.” But when you really dig into what that means it seems very closely connected to generalization: (i) the reason why “just use this neural net” isn’t a good hypothesis is that it generalizes poorly, (ii) for competitiveness reasons you still need to use hypotheses that look quite a lot like neural nets, (iii) so you really need to understand why the “neural net hypothesis” is bad.
In this framing, the distinction is that implication is only one way. If B is the model’s claim about tone, and A is a consistency check, or another fact serving as input to some consistency check, then A being bad implies B is bad, but not necessarily the converse of this. Whereas in amplification/debate, we would actually check that all the subquestion answers + subsequent reasoning actually justify B.
I think this was a miscommunication, trying again: in amplification we compute the answer from subanswers. A coherence check can ensure that subanswers and answers agree. If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct). The two training strategies are still different in a way that makes consistency checks seem worse (since you might end up tweaking the subanswers to be consistent with the answer, whereas amplification would only try to tweak the answer to be consistent with the subanswers) but it’s not clear if that’s a key distinction.
The French case is nicer for experiments because it’s easier for us to check ground truth for evaluation, but hopefully the example also conveys some intuition for my pessimism in a substantially harder case.
I’m comparably optimistic about the “neuralese” case as the French case, though there are a lot of other non-generalization difficulties in the neuralese case and it’s not overall the kind of thing that I’m imagining working unless you happen to have “introspective” neural nets (and therefore isn’t part of the object-level safety program, it’s just part of what you’d do if your neural networks were thinking about neuralase rather than in neuralese).
Is it fair to describe the reason you view imitative generalization as necessary at all (instead of just debate/amplification) as “direct oversight is not indefinitely scalable”? [ETA: It seems equally/more valid to frame imitative generalization as a way of scaling direct oversight to handle inaccessible info, so this isn’t a good framing.]
To check my understanding, you’re saying that rather than rely on “some combination of scalable oversight + generalization of honesty OOD” you’d rather use something like imitative generalization (where if we can surface knowledge from neural networks to humans, then we don’t need to rely on generalization of honesty OOD). Is this accurate?
If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct).
I agree with this argument. But it seems “if the answer is a deterministic [human-known] function of the subanswers” is a very strong condition, such that “(passes consistency check) + (subanswers are correct) ==> (answers are correct)” rarely holds in practice. Maybe the most common case is that we have some subanswers, but they don’t uniquely define the right answer / there are (infinitely many) other subquestions we could have asked which aren’t there.
Not sure this point is too important though (I’d definitely want to pick it up again if I wanted to push on a direction relying on something like generalization-of-honesty).
I’m comparably optimistic about the “neuralese” case as the French case
Got it, thanks! (I am slightly surprised, but happy to leave it here.)
Thanks for sharing these thoughts. I’m particularly excited about the possibility of running empirical experiments to better understand potential risks of ML systems and and contribute to debates about difficulties of alignment.
1. Potential implications for optimistic views on alignment
I’m most interested in this point. IIUC, the viewpoint you allude to here is something along the lines “There will be very important decisions we can’t train directly for, but we’ll be able to directly apply ML to these decisions by generalizing from feedback on easier decisions.” We can call this “optimism about generalization” (of honesty, on out-of-distribution tasks). I’m generally pretty skeptical about this position (see point 2 below).
OTOH, there is a different reason for optimism, along the lines of “The set of decisions where ‘humans never get the right answer’ is small to none, and we don’t need to use ML directly for these cases, and this situation is helped dramatically by the fact that we can use ML for questions we can supervise to indirectly help us with these questions.” For example, in amplification, ML systems can summarize important arguments on both sides of thorny questions, produce scientific insights we can directly evaluate, guide us through stacks of sub-questions, and so on. We can call this “optimism about scalable oversight”.
My view is that a negative result in experiments proposed here would point against optimism about generalization, but not against optimism about scalable oversight. (I’m curious if this seems right to you.) And I imagine your view is that (a) this is still useful, since optimism about generalization is fairly common (among ML researchers you talk to) (b) we should in fact currently have some uncertainty about optimism about generalization which this would address (see point 2 below) and (c) in the limit, scalable human feedback might not be competitive with more heavily generalization-reliant approaches, and so we need to better understand these generalization questions too.
2. Skepticism about optimism about generalization (of honesty, on out-of-distribution tasks)
I’d reasonably strongly (as a really rough number off the top of my head, >80%) expect a negative result from these experiments, and am also generally pessimistic about the broader approach of optimism about generalization. For these experiments, this depends in large part on what’s considered a positive or negative result.
I agree that very plausibly:
The model will generalize to answering the OOD tasks with significantly non-zero accuracy.
The results will be impressive to many people (“Look how well this model generalizes to tasks it’s never directly trained on!”)
At the same time:
It seems very unlikely that counting on generalization alone would actually perform as well as directly supervising model outputs (e.g. on tone description tasks). Almost certainly, some fraction of the model responses are going to be “plausible looking but incorrect” (and this fraction will be larger than for the directly supervised model).
There is some ambiguity over whether this should count as a negative result. I think we’ve discussed in the past that the comparison should ideally be between the generalization-only model, and the model which is directly supervised on the tone task but “without acquiring any new information” (for some fuzzy notion here). But if we’re willing to say that this task is one where “the capabilities of the model do already generalize”, then it seems this would be a negative result.
And more broadly, I think most people would agree that if we had a model which output “plausible looking but incorrect” responses a substantial fraction of the time, it’d be irresponsible to use such a model for important societal-scale decisions. (I’d argue that this would also be “bad practice” in many smaller stakes tasks.)
I’m pretty wary about this, and we should hold such approaches to a high standard-of-proof. We already have a bunch of examples for what happens when you optimize for “plausible but not-necessarily correct”: you end up with plausible but not-necessarily-correct outputs. (And so in general, I think we should be cautious about claiming any particular generalization feature will fix this, unless we have a principled reason for thinking so.)
I realize that the key point is that honesty might generalize from the tasks where we can directly supervise for correctness. But even then, it seems that if you think the basic generalization story won’t work, you shouldn’t expect training for plausibility to rescue it; we should expect the responses to become much more plausible, and somewhat (but not completely) more correct. So I’d be wary of whether training for plausibility would be masking the problem without addressing the root cause.
That’s basically right, although I think the view is less plausible for “decisions” than for some kinds of reports. For example, it is more plausible that a mapping from symbols in an internal vocabulary to expressions in natural language would generalize than that correct decisions would generalize (or even than other forms of reasoning).
I think a realistic approach would need to use generalization in some situations (where we expect it to work) and then use the facts-that-generalize as an input into supervision. For example, if you were able to answer empirical questions about what’s happening right now, you could use those as an input into debate/amplification.
(This may also make it more clear why I’m interested in coherence conditions where you can’t supervise—in some sense “use the stuff that does generalize as an input into amplification” is quite similar to saying “impose a coherence condition amongst the stuff you can’t directly supervise.”)
“Optimism about scalable oversight” is what I’m usually thinking about, but it does seem to me that there are some cases where it is inadequate. You could hope to play a quantitative/empirical game of getting lots of useful work out of AI before this kind of approach breaks down, but I am interested in whether there’s a chance at going straight for an indefinitely scalable approach to alignment.
That seems right to me and that is a reasonable description of my view.
(I’d be curious to know if you don’t encounter a lot of optimism-about-generalization.)
For the purposes of “indefinitely scalable alignment approach” the relevant threshold is something quite ambitious like “reflects everything the system knows.”
I think there’s a further question about whether the model knows how to talk about these concepts / is able to express itself. This is another part of the motivation for plausibility constraints (if nothing else they teach the model the syntax, but then when combined with other knowledge about language the coherence conditions seem like they probably just pin down the meaning unambiguously).
Yes, I think the fact that people would agree with this is important for the demonstration moving anyone on the pessimistic side.
I’m not so pessimistic. There are lots of “bad” ways to answer questions, and as we add more and more stringent plausibility checks we eliminate more and more of them. Eventually we get down to the hard core of “stuff that would pass any number of plausibility checks.” Then the question is whether the truth is the easiest-to-learn way to pass all plausibility checks.
This seems much easier than getting the truth out of the gate. One way of poking at the intuition is that the initial goal is hitting an isolated point in a giant high-dimensional space of all question-answering policies. But after selecting for “passing all plausibility checks” it seems plausible that you are basically left with a set of isolated points and the question is just which of them you got.
(I think that literal picture is too optimistic, but the point is that in some sense you have to be doing a lot of work to look fully coherent and that space is a lot smaller and different from getting random stuff from a neural network off distribution.)
You may say “In that case the plausibility checks alone should work,” but I think that it’s not so clear either—since the set of points may still be quite big and generalization was the main reason to expect an inductive bias towards truth, without making strong claims on the pre-training (for GPT-3 in particular the pre-training bias is fairly likely to be adequate, but I’d become quite a bit more skeptical about other domains).
Overall I do think it’s >50% that if the whole thing works, one of the two pieces worked independently.
Thanks for these thoughts. Mostly just responding to the bits with questions/disagreements, and skipping the parts I agree with:
I’m curious what factors point to a significant difference regarding generalization between “decisions” and “unsupervised translation”. Perhaps there is a more natural concept of “honesty” / “truth” for unsupervised translation, which makes it more likely. But this is very fuzzy to me, and I’m curious why (or if) it’s clear to you there is a big difference between the two cases.
My model is that “if honesty doesn’t generalize in general, then it may generalize somewhat better within translation tasks, but is unlikely to generalize completely”. In other words, on a simplified model where honesty generalization goes from 0 to 1, it seems likely it is somewhat higher for translation tasks, but unlikely it is exactly 1.
IIUC, the analogy you’re drawing here is that amplification is playing a similar role to the coherence condition, where even if we can’t supervise the model response in full, we can at least check it is consistent with the inputs into amplification (I might be misunderstanding). These seem different though: in amplification, we are concerned about dishonesty in the inputs into amplification, whereas in these experiments, we’re concerned about dishonesty in the model’s outputs. I also feel substantially more optimistic about amplification in the scalable oversight story: we check for accuracy of amplification’s outputs, assuming the inputs are accurate, and we separately are checking for accuracy of inputs, via the same recursion.
I often encounter a somewhat similar view from people optimistic about multi-agent approaches (e.g. we can train agents to be cooperative in a broad set of simulations, and transfer this to the real world). I am even more skeptical about this strong-optimism-about-generalization.
Among folks excited about direct RL, I think it is more common to say “we will be very careful about our optimization targets” and have a view more like optimism about scalable oversight.
(My assessment may also be biased. To the extent that people are optimistic about both, or have different ways of thinking about this, or haven’t thought too much about the indefinitely scalable case and so have difficulty articulating what their intuitions are, I am inclined to extend a “charitable” interpretation and assume their view is closer to optimism-about-oversight, because that’s the view I consider most plausible. If I pushed harder against optimism-about-oversight in conversations, I expect I’d get a more accurate picture.)
I don’t think I follow. Doesn’t the model already know syntax? If that plus the “other knowledge about language… pins down the meaning unambiguously”, it feels like basically all the work came from the “other knowledge about language”, and I’m not sure what the coherence conditions are adding.
Regarding optimism about generalization:
I think I share your general picture that it is very difficult to “get truth out of the gate”, but reach the opposite conclusion:
I am indeed skeptical that you’d get truth out of the gate (without training for plausibility). I think that training for plausibility wouldn’t fix this (though it would definitely make any remaining inaccuracies harder to spot). This is basically the same claim as above (and below); while it likely decreases the fraction of inaccurate responses, it’s unlikely to get rid of all of them.
The ideal way to overcome this difficulty is to directly select for accuracy (though we agree here).
Perhaps a core claim is that we are “basically left with a set of isolated points”. I don’t buy this claim (this disagreement is possibly also related to my second bullet above on “My model is that...”). It seems like the plausibility checks probably cut out some dimensions from the original high-dimensional space, but picking out “honest” behavior is still woefully underdefined for the model. For instance, it seems like “completely honest” is training-consistent, so is “coherent-but-often-inaccurate”, and so is every level of inaccuracy between these two points.
It seems we’ve already done the experiments on training for plausibility without leveraging generalization across tasks (e.g. generic specification gaming examples, or summarization with less careful human raters). So I’d put <1% on that possibility. This estimate seems consistent with the view that across-task-generalization plausibly works out of the box (though I also consider that <<50%).
For honest translation from your world-model to mine (or at least sufficiently small parts of it), there is a uniform intended behavior. But for decisions there isn’t any intended uniform behavior.
This is not clear to me (and it seems like we get to check).
I’m not sure I understand the distinction here. Suppose that amplification lets us compute B from A, so that B is good if A is good (i.e. so that the learned model will end up answering top-level questions well if it answers leaves well). Whereas a coherence condition ensures that A and B are consistent with one another—so if B is “supposed to be” a deterministic function of A, then consistency guarantees that B is good if A is good.
I don’t think the model necessarily “knows” how to talk about tone (given only a question about tone in natural language with no examples), nor how to translate between its internal beliefs about tone and a description in natural language. The point of a plausibility/coherence condition is to provide enough constraint to pin those things down. You fully learn what kind of sentences we were looking for when we asked about tone, which might have just been totally non-obvious initially. And you learn a bunch of things about how different concepts about tone are supposed to relate to one another (in order to make sure all of your utterances are consistent).
What is “coherent-but-often-inaccurate” though? The point is that in order to be coherent you actually have to do quite a lot of work, that basically requires you to understand what the involved terms mean to humans.
I don’t really think we’ve done those experiments. I don’t know what specification gaming examples you have in mind. And less careful raters are also less careful about coherence conditions. Not to mention that no effort was made to e.g. ask multiple questions about the text and check agreement between them.
I agree that it’s possible to have some plausibility condition which is insufficient to get good behavior. But that’s quite different from saying “And if you actually try to make it work it doesn’t work.”
For example, I do think you can keep adding coherence conditions until you reach the limit of “Actually looks coherent to a human no matter how they investigate it,” such that you are actually generalizing to new coherence checks. At that point honesty becomes much more plausible.
Zooming out a bit, I would summarize a few high-level threads as:
We both agree that the experiments described here primarily relate to optimism-about-generalization, rather than optimism-about-scalable-oversight.
I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.
I am substantially more pessimistic about generalization of honest OOD, whereas you think it is plausible (via some combination of default neural network generalization and dynamic/thorough coherence checks), and likely useful for at least certain classes of tasks.
These two differences above translate pretty naturally into differences about both what we should expect in these types of experiments, and what we should interpret from different types of results (some caveats on this later)
We both agree these types of experiments would be very useful.
I think the two disagreements are probably broader threads, so I’m mostly curious whether this seems right to you. Will also explore a few individual points a bit more below:
Fair :)
In this framing, the distinction is that implication is only one way. If B is the model’s claim about tone, and A is a consistency check, or another fact serving as input to some consistency check, then A being bad implies B is bad, but not necessarily the converse of this. Whereas in amplification/debate, we would actually check that all the subquestion answers + subsequent reasoning actually justify B.
Got it, thanks. This seems right—I agree that the finetuning (either via coherence, or via direct supervision) is providing some extra information about how to do this translation, and what questions about tone actually are asking about, and this information is not necessarily present just from across-task generalization.
I see; this makes more sense to me. I guess this will be pretty closely related (equivalent?) to your optimism in previous discussions about being able to use adversarial training to eventually eliminate all malign failures, where we have the similar dynamic of generalizing to unseen testing strategies, and needing to eventually eliminate all failures. I think I’m pretty pessimistic about this approach, but happy to leave this in favor of seeing how empirics play out with experiments like these (or the Unrestricted Advx Contest).
ETA: I suppose a difference here is that with adv training, we are typically assuming we can recognize the failures, and the generalization is across multiple inputs. Whereas here we’re less concerned about generalization across inputs, and aren’t assuming we can recognize failures, and instead want to generalize across different coherence checks. The two feel similar in spirit, but maybe this is a significant enough difference that the analogy isn’t useful.
I think this is fair. I agree that nobody has “really tried” to make something like this work (in part this is because it is often possible to just hire/train better human raters, or re-scope the task so that you actually measure the thing you care about). I do think the wide variety of failures from researchers “sort of trying” points me against optimism-about-generalization. Overall, I do agree it’d be valuable to distill to a single project which really tests this.
A related intuition pump: I’m pretty pessimistic in cases where non-French speakers are supervising questions about French. I’m substantially more pessimistic in cases where humans are supervising questions about e.g. what “neuralese statements” (activations vectors) passed between neural networks mean, where e.g. we don’t have intuitions for how grammar structures should work, can’t rely on common Latin roots, can’t easily write down explicit rules governing the language, and generally have a harder time generating consistency checks. The French case is nicer for experiments because it’s easier for us to check ground truth for evaluation, but hopefully the example also conveys some intuition for my pessimism in a substantially harder case.
I’d still describe my optimistic take as “do imitative generalization.” But when you really dig into what that means it seems very closely connected to generalization: (i) the reason why “just use this neural net” isn’t a good hypothesis is that it generalizes poorly, (ii) for competitiveness reasons you still need to use hypotheses that look quite a lot like neural nets, (iii) so you really need to understand why the “neural net hypothesis” is bad.
I think this was a miscommunication, trying again: in amplification we compute the answer from subanswers. A coherence check can ensure that subanswers and answers agree. If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct). The two training strategies are still different in a way that makes consistency checks seem worse (since you might end up tweaking the subanswers to be consistent with the answer, whereas amplification would only try to tweak the answer to be consistent with the subanswers) but it’s not clear if that’s a key distinction.
I’m comparably optimistic about the “neuralese” case as the French case, though there are a lot of other non-generalization difficulties in the neuralese case and it’s not overall the kind of thing that I’m imagining working unless you happen to have “introspective” neural nets (and therefore isn’t part of the object-level safety program, it’s just part of what you’d do if your neural networks were thinking about neuralase rather than in neuralese).
Is it fair to describe the reason you view imitative generalization as necessary at all (instead of just debate/amplification) as “direct oversight is not indefinitely scalable”?[ETA: It seems equally/more valid to frame imitative generalization as a way of scaling direct oversight to handle inaccessible info, so this isn’t a good framing.]
To check my understanding, you’re saying that rather than rely on “some combination of scalable oversight + generalization of honesty OOD” you’d rather use something like imitative generalization (where if we can surface knowledge from neural networks to humans, then we don’t need to rely on generalization of honesty OOD). Is this accurate?
I agree with this argument. But it seems “if the answer is a deterministic [human-known] function of the subanswers” is a very strong condition, such that “(passes consistency check) + (subanswers are correct) ==> (answers are correct)” rarely holds in practice. Maybe the most common case is that we have some subanswers, but they don’t uniquely define the right answer / there are (infinitely many) other subquestions we could have asked which aren’t there.
Not sure this point is too important though (I’d definitely want to pick it up again if I wanted to push on a direction relying on something like generalization-of-honesty).
Got it, thanks! (I am slightly surprised, but happy to leave it here.)