Jonathan Uesato comments on Experimentally evaluating whether honesty generalizes

Jonathan Uesato 8 Jul 2021 10:58 UTC
LW: 1 AF: 1
AF
I’d still describe my optimistic take as “do imitative generalization.
- ~~Is it fair to describe the reason you view imitative generalization as necessary at all (instead of just debate/amplification) as “direct oversight is not indefinitely scalable”?~~
  [ETA: It seems equally/more valid to frame imitative generalization as a way of scaling direct oversight to handle inaccessible info, so this isn’t a good framing.]
- To check my understanding, you’re saying that rather than rely on “some combination of scalable oversight + generalization of honesty OOD” you’d rather use something like imitative generalization (where if we can surface knowledge from neural networks to humans, then we don’t need to rely on generalization of honesty OOD). Is this accurate?
If the answer is a deterministic function of the subanswers, then the consistency check can rule most possible combinations out as inconsistent, and therefore: (passes consistency check) + (subanswers are correct) ==> (answers are correct).
I agree with this argument. But it seems “if the answer is a deterministic [human-known] function of the subanswers” is a very strong condition, such that “(passes consistency check) + (subanswers are correct) ==> (answers are correct)” rarely holds in practice. Maybe the most common case is that we have some subanswers, but they don’t uniquely define the right answer / there are (infinitely many) other subquestions we could have asked which aren’t there.
Not sure this point is too important though (I’d definitely want to pick it up again if I wanted to push on a direction relying on something like generalization-of-honesty).
I’m comparably optimistic about the “neuralese” case as the French case
Got it, thanks! (I am slightly surprised, but happy to leave it here.)