paulfchristiano comments on Teaching ML to answer questions honestly instead of predicting human answers

paulfchristiano 1 Jun 2021 20:01 UTC
LW: 3 AF: 3
AF
I think “deferring” was a bad word for me to use. I mostly imagine the complex labeling process will just independently label data, and then only include datapoints when there is agreement. That is, you’d just always return the (simple, complex) pair, and is-correct basically just tests whether they are equal.
I said “defer” because one of the data that the complex labeling process uses may be “what a human who was in the room said,” and this may sometimes be a really important source of evidence. But that really depends on how you set things up, if you have enough other signals then you would basically always just ignore that one.
(That said, I think probably amplification is the most important difference between the simple and complex labeling processes, because that’s the only scalable way to inject meaningful amounts of extra complexity into the complex labeling process—since the ML system can’t predict itself very well, it forces it to basically try to win a multiplayer game with copies of itself, and we hope that’s more complicated. And if that’s the case then the simple labeling process may as well use all of the data sources, and the difference is just how complex a judgment we are making using those inputs.)
- Joe Collman 1 Jun 2021 22:03 UTC
  LW: 1 AF: 1
  AF Parent
  Ok, that all makes sense, thanks.
  ...and is-correct basically just tests whether they are equal.
  So here “equal” would presumably be “essentially equal in the judgement of complex process”, rather than verbatim equality of labels (the latter seems silly to me; if it’s not silly I must be missing something).
  - paulfchristiano 1 Jun 2021 23:38 UTC
    LW: 3 AF: 3
    AF Parent
    I think they need to be exactly equal. I think this is most likely accomplished by making something like pairwise judgments and only passing judgment when the comparison is a slam dunk (as discussed in section 3). Otherwise the instrumental policy will outperform the intended policy (since it will do the right thing when the simple labels are wrong).