Most importantly, the success of the scheme relies on the correctness of the prior over helper models (or else the helper could just be another copy of GPT-Klingon)
I’m not sure I understand this. My understanding of the worry: what if there’s some equilibrium where the model gives wrong explanations of meanings, but I can’t tell using just the model to give me meanings.
But it seems to me that having the human in the loop doing prediction helps a lot, even with the same prior. Like, if the meanings are wrong, then the user will just not predict the correct word. But maybe this is not enough corrective data?
I’m not sure I understand this. My understanding of the worry: what if there’s some equilibrium where the model gives wrong explanations of meanings, but I can’t tell using just the model to give me meanings.
But it seems to me that having the human in the loop doing prediction helps a lot, even with the same prior. Like, if the meanings are wrong, then the user will just not predict the correct word. But maybe this is not enough corrective data?