paulfchristiano comments on Experimentally evaluating whether honesty generalizes

paulfchristiano 6 Jul 2021 18:44 UTC
LW: 2 AF: 2
AF
I am substantially more optimistic about scalable oversight, whereas you think that (eventually) we will need to rely on some combination of scalable oversight + generalization of honesty OOD.
I’d still describe my optimistic take as “do imitative generalization.” But when you really dig into what that means it seems very closely connected to generalization: (i) the reason why “just use this neural net” isn’t a good hypothesis is that it generalizes poorly, (ii) for competitiveness reasons you still need to use hypotheses that look quite a lot like neural nets, (iii) so you really need to understand why the “neural net hypothesis” is bad.
In this framing, the distinction is that implication is only one way. If B is the model’s claim about tone, and A is a consistency check, or another fact serving as input to some consistency check, then A being bad implies B is bad, but not necessarily the converse of this. Whereas in amplification/debate, we would actually check that all the subquestion answers + subsequent reasoning actually justify B.
I think this was a miscommunication, trying again: in amplification we compute the answer from subanswers. A coherence check can ensure that subanswers and answers agree. If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct). The two training strategies are still different in a way that makes consistency checks seem worse (since you might end up tweaking the subanswers to be consistent with the answer, whereas amplification would only try to tweak the answer to be consistent with the subanswers) but it’s not clear if that’s a key distinction.
The French case is nicer for experiments because it’s easier for us to check ground truth for evaluation, but hopefully the example also conveys some intuition for my pessimism in a substantially harder case.
I’m comparably optimistic about the “neuralese” case as the French case, though there are a lot of other non-generalization difficulties in the neuralese case and it’s not overall the kind of thing that I’m imagining working unless you happen to have “introspective” neural nets (and therefore isn’t part of the object-level safety program, it’s just part of what you’d do if your neural networks were thinking about neuralase rather than in neuralese).
- Jonathan Uesato 8 Jul 2021 10:58 UTC
  LW: 1 AF: 1
  AF Parent
  I’d still describe my optimistic take as “do imitative generalization.
  - ~~Is it fair to describe the reason you view imitative generalization as necessary at all (instead of just debate/amplification) as “direct oversight is not indefinitely scalable”?~~
    [ETA: It seems equally/more valid to frame imitative generalization as a way of scaling direct oversight to handle inaccessible info, so this isn’t a good framing.]
  - To check my understanding, you’re saying that rather than rely on “some combination of scalable oversight + generalization of honesty OOD” you’d rather use something like imitative generalization (where if we can surface knowledge from neural networks to humans, then we don’t need to rely on generalization of honesty OOD). Is this accurate?
  If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct).
  I agree with this argument. But it seems “if the answer is a deterministic [human-known] function of the subanswers” is a very strong condition, such that “(passes consistency check) + (subanswers are correct) ==> (answers are correct)” rarely holds in practice. Maybe the most common case is that we have some subanswers, but they don’t uniquely define the right answer / there are (infinitely many) other subquestions we could have asked which aren’t there.
  Not sure this point is too important though (I’d definitely want to pick it up again if I wanted to push on a direction relying on something like generalization-of-honesty).
  I’m comparably optimistic about the “neuralese” case as the French case
  Got it, thanks! (I am slightly surprised, but happy to leave it here.)