paulfchristiano comments on Teaching ML to answer questions honestly instead of predicting human answers

paulfchristiano 1 Jun 2021 23:38 UTC
LW: 3 AF: 3
AF
I think they need to be exactly equal. I think this is most likely accomplished by making something like pairwise judgments and only passing judgment when the comparison is a slam dunk (as discussed in section 3). Otherwise the instrumental policy will outperform the intended policy (since it will do the right thing when the simple labels are wrong).