Zooming out a bit, I would summarize a few high-level threads as:
We both agree that the experiments described here primarily relate to optimism-about-generalization, rather than optimism-about-scalable-oversight.
I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.
I am substantially more pessimistic about generalization of honest OOD, whereas you think it is plausible (via some combination of default neural network generalization and dynamic/thorough coherence checks), and likely useful for at least certain classes of tasks.
These two differences above translate pretty naturally into differences about both what we should expect in these types of experiments, and what we should interpret from different types of results (some caveats on this later)
We both agree these types of experiments would be very useful.
I think the two disagreements are probably broader threads, so I’m mostly curious whether this seems right to you. Will also explore a few individual points a bit more below:
> My model is that “if honesty doesn’t generalize in general, then it may generalize somewhat better within translation tasks, but is unlikely to generalize completely”.
This is not clear to me (and it seems like we get to check).
Fair :)
I’m not sure I understand the distinction here. Suppose that amplification lets us compute B from A, so that B is good if A is good (i.e. so that the learned model will end up answering top-level questions well if it answers leaves well). Whereas a coherence condition ensures that A and B are consistent with one another—so if B is “supposed to be” a deterministic function of A, then consistency guarantees that B is good if A is good.
In this framing, the distinction is that implication is only one way. If B is the model’s claim about tone, and A is a consistency check, or another fact serving as input to some consistency check, then A being bad implies B is bad, but not necessarily the converse of this. Whereas in amplification/debate, we would actually check that all the subquestion answers + subsequent reasoning actually justify B.
I don’t think the model necessarily “knows” how to talk about tone (given only a question about tone in natural language with no examples), nor how to translate between its internal beliefs about tone and a description in natural language.
Got it, thanks. This seems right—I agree that the finetuning (either via coherence, or via direct supervision) is providing some extra information about how to do this translation, and what questions about tone actually are asking about, and this information is not necessarily present just from across-task generalization.
For example, I do think you can keep adding coherence conditions until you reach the limit of “Actually looks coherent to a human no matter how they investigate it,” such that you are actually generalizing to new coherence checks. At that point honesty becomes much more plausible.
I see; this makes more sense to me. I guess this will be pretty closely related (equivalent?) to your optimism in previous discussions about being able to use adversarial training to eventually eliminate all malign failures, where we have the similar dynamic of generalizing to unseen testing strategies, and needing to eventually eliminate all failures. I think I’m pretty pessimistic about this approach, but happy to leave this in favor of seeing how empirics play out with experiments like these (or the Unrestricted Advx Contest).
ETA: I suppose a difference here is that with adv training, we are typically assuming we can recognize the failures, and the generalization is across multiple inputs. Whereas here we’re less concerned about generalization across inputs, and aren’t assuming we can recognize failures, and instead want to generalize across different coherence checks. The two feel similar in spirit, but maybe this is a significant enough difference that the analogy isn’t useful.
I don’t really think we’ve done those experiments. I don’t know what specification gaming examples you have in mind. And less careful raters are also less careful about coherence conditions. Not to mention that no effort was made to e.g. ask multiple questions about the text and check agreement between them.
I agree that it’s possible to have some plausibility condition which is insufficient to get good behavior. But that’s quite different from saying “And if you actually try to make it work it doesn’t work.”
I think this is fair. I agree that nobody has “really tried” to make something like this work (in part this is because it is often possible to just hire/train better human raters, or re-scope the task so that you actually measure the thing you care about). I do think the wide variety of failures from researchers “sort of trying” points me against optimism-about-generalization. Overall, I do agree it’d be valuable to distill to a single project which really tests this.
A related intuition pump: I’m pretty pessimistic in cases where non-French speakers are supervising questions about French. I’m substantially more pessimistic in cases where humans are supervising questions about e.g. what “neuralese statements” (activations vectors) passed between neural networks mean, where e.g. we don’t have intuitions for how grammar structures should work, can’t rely on common Latin roots, can’t easily write down explicit rules governing the language, and generally have a harder time generating consistency checks. The French case is nicer for experiments because it’s easier for us to check ground truth for evaluation, but hopefully the example also conveys some intuition for my pessimism in a substantially harder case.
I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.
I’d still describe my optimistic take as “do imitative generalization.” But when you really dig into what that means it seems very closely connected to generalization: (i) the reason why “just use this neural net” isn’t a good hypothesis is that it generalizes poorly, (ii) for competitiveness reasons you still need to use hypotheses that look quite a lot like neural nets, (iii) so you really need to understand why the “neural net hypothesis” is bad.
In this framing, the distinction is that implication is only one way. If B is the model’s claim about tone, and A is a consistency check, or another fact serving as input to some consistency check, then A being bad implies B is bad, but not necessarily the converse of this. Whereas in amplification/debate, we would actually check that all the subquestion answers + subsequent reasoning actually justify B.
I think this was a miscommunication, trying again: in amplification we compute the answer from subanswers. A coherence check can ensure that subanswers and answers agree. If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct). The two training strategies are still different in a way that makes consistency checks seem worse (since you might end up tweaking the subanswers to be consistent with the answer, whereas amplification would only try to tweak the answer to be consistent with the subanswers) but it’s not clear if that’s a key distinction.
The French case is nicer for experiments because it’s easier for us to check ground truth for evaluation, but hopefully the example also conveys some intuition for my pessimism in a substantially harder case.
I’m comparably optimistic about the “neuralese” case as the French case, though there are a lot of other non-generalization difficulties in the neuralese case and it’s not overall the kind of thing that I’m imagining working unless you happen to have “introspective” neural nets (and therefore isn’t part of the object-level safety program, it’s just part of what you’d do if your neural networks were thinking about neuralase rather than in neuralese).
Is it fair to describe the reason you view imitative generalization as necessary at all (instead of just debate/amplification) as “direct oversight is not indefinitely scalable”? [ETA: It seems equally/more valid to frame imitative generalization as a way of scaling direct oversight to handle inaccessible info, so this isn’t a good framing.]
To check my understanding, you’re saying that rather than rely on “some combination of scalable oversight + generalization of honesty OOD” you’d rather use something like imitative generalization (where if we can surface knowledge from neural networks to humans, then we don’t need to rely on generalization of honesty OOD). Is this accurate?
If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct).
I agree with this argument. But it seems “if the answer is a deterministic [human-known] function of the subanswers” is a very strong condition, such that “(passes consistency check) + (subanswers are correct) ==> (answers are correct)” rarely holds in practice. Maybe the most common case is that we have some subanswers, but they don’t uniquely define the right answer / there are (infinitely many) other subquestions we could have asked which aren’t there.
Not sure this point is too important though (I’d definitely want to pick it up again if I wanted to push on a direction relying on something like generalization-of-honesty).
I’m comparably optimistic about the “neuralese” case as the French case
Got it, thanks! (I am slightly surprised, but happy to leave it here.)
Zooming out a bit, I would summarize a few high-level threads as:
We both agree that the experiments described here primarily relate to optimism-about-generalization, rather than optimism-about-scalable-oversight.
I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.
I am substantially more pessimistic about generalization of honest OOD, whereas you think it is plausible (via some combination of default neural network generalization and dynamic/thorough coherence checks), and likely useful for at least certain classes of tasks.
These two differences above translate pretty naturally into differences about both what we should expect in these types of experiments, and what we should interpret from different types of results (some caveats on this later)
We both agree these types of experiments would be very useful.
I think the two disagreements are probably broader threads, so I’m mostly curious whether this seems right to you. Will also explore a few individual points a bit more below:
Fair :)
In this framing, the distinction is that implication is only one way. If B is the model’s claim about tone, and A is a consistency check, or another fact serving as input to some consistency check, then A being bad implies B is bad, but not necessarily the converse of this. Whereas in amplification/debate, we would actually check that all the subquestion answers + subsequent reasoning actually justify B.
Got it, thanks. This seems right—I agree that the finetuning (either via coherence, or via direct supervision) is providing some extra information about how to do this translation, and what questions about tone actually are asking about, and this information is not necessarily present just from across-task generalization.
I see; this makes more sense to me. I guess this will be pretty closely related (equivalent?) to your optimism in previous discussions about being able to use adversarial training to eventually eliminate all malign failures, where we have the similar dynamic of generalizing to unseen testing strategies, and needing to eventually eliminate all failures. I think I’m pretty pessimistic about this approach, but happy to leave this in favor of seeing how empirics play out with experiments like these (or the Unrestricted Advx Contest).
ETA: I suppose a difference here is that with adv training, we are typically assuming we can recognize the failures, and the generalization is across multiple inputs. Whereas here we’re less concerned about generalization across inputs, and aren’t assuming we can recognize failures, and instead want to generalize across different coherence checks. The two feel similar in spirit, but maybe this is a significant enough difference that the analogy isn’t useful.
I think this is fair. I agree that nobody has “really tried” to make something like this work (in part this is because it is often possible to just hire/train better human raters, or re-scope the task so that you actually measure the thing you care about). I do think the wide variety of failures from researchers “sort of trying” points me against optimism-about-generalization. Overall, I do agree it’d be valuable to distill to a single project which really tests this.
A related intuition pump: I’m pretty pessimistic in cases where non-French speakers are supervising questions about French. I’m substantially more pessimistic in cases where humans are supervising questions about e.g. what “neuralese statements” (activations vectors) passed between neural networks mean, where e.g. we don’t have intuitions for how grammar structures should work, can’t rely on common Latin roots, can’t easily write down explicit rules governing the language, and generally have a harder time generating consistency checks. The French case is nicer for experiments because it’s easier for us to check ground truth for evaluation, but hopefully the example also conveys some intuition for my pessimism in a substantially harder case.
I’d still describe my optimistic take as “do imitative generalization.” But when you really dig into what that means it seems very closely connected to generalization: (i) the reason why “just use this neural net” isn’t a good hypothesis is that it generalizes poorly, (ii) for competitiveness reasons you still need to use hypotheses that look quite a lot like neural nets, (iii) so you really need to understand why the “neural net hypothesis” is bad.
I think this was a miscommunication, trying again: in amplification we compute the answer from subanswers. A coherence check can ensure that subanswers and answers agree. If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct). The two training strategies are still different in a way that makes consistency checks seem worse (since you might end up tweaking the subanswers to be consistent with the answer, whereas amplification would only try to tweak the answer to be consistent with the subanswers) but it’s not clear if that’s a key distinction.
I’m comparably optimistic about the “neuralese” case as the French case, though there are a lot of other non-generalization difficulties in the neuralese case and it’s not overall the kind of thing that I’m imagining working unless you happen to have “introspective” neural nets (and therefore isn’t part of the object-level safety program, it’s just part of what you’d do if your neural networks were thinking about neuralase rather than in neuralese).
Is it fair to describe the reason you view imitative generalization as necessary at all (instead of just debate/amplification) as “direct oversight is not indefinitely scalable”?[ETA: It seems equally/more valid to frame imitative generalization as a way of scaling direct oversight to handle inaccessible info, so this isn’t a good framing.]
To check my understanding, you’re saying that rather than rely on “some combination of scalable oversight + generalization of honesty OOD” you’d rather use something like imitative generalization (where if we can surface knowledge from neural networks to humans, then we don’t need to rely on generalization of honesty OOD). Is this accurate?
I agree with this argument. But it seems “if the answer is a deterministic [human-known] function of the subanswers” is a very strong condition, such that “(passes consistency check) + (subanswers are correct) ==> (answers are correct)” rarely holds in practice. Maybe the most common case is that we have some subanswers, but they don’t uniquely define the right answer / there are (infinitely many) other subquestions we could have asked which aren’t there.
Not sure this point is too important though (I’d definitely want to pick it up again if I wanted to push on a direction relying on something like generalization-of-honesty).
Got it, thanks! (I am slightly surprised, but happy to leave it here.)