Hm, I think I get the issue you’re pointing at. I guess the argument for the evaluator learning accurate human preferences in this proposal is that it can make use of infinitely many examples of inaccurate human preferences conferred by the agent as negative examples. However, the argument against can be summed up in the following comment of Adam:
I get the impression that with Oversight Leagues, you don’t necessarily consider the possibility that there might be many different “limits” of the oversight process, that are coherent with the initial examples. And it’s not clear you have an argument that it’s going to pick one that we actually want.
Or in your terms:
Not just any model will do
I’m indeed not sure if the agent’s pressure would force the evaluator all the way to accurate human preferences. The fact that GANs get significantly closer to the illegible distributions they model and away from random noise while following a legible objective feels like evidence for, but the fact that they still have artifacts feels like evidence against. Also, I’m not sure how GANs fare against purely generative models trained on the positive examples alone (e.g. VAEs) as data on whether the adversarial regime helps point at the underlying distribution.
Hm, I think I get the issue you’re pointing at. I guess the argument for the evaluator learning accurate human preferences in this proposal is that it can make use of infinitely many examples of inaccurate human preferences conferred by the agent as negative examples. However, the argument against can be summed up in the following comment of Adam:
Or in your terms:
I’m indeed not sure if the agent’s pressure would force the evaluator all the way to accurate human preferences. The fact that GANs get significantly closer to the illegible distributions they model and away from random noise while following a legible objective feels like evidence for, but the fact that they still have artifacts feels like evidence against. Also, I’m not sure how GANs fare against purely generative models trained on the positive examples alone (e.g. VAEs) as data on whether the adversarial regime helps point at the underlying distribution.