Hm.
It’s easy to come up with a crappy proxy feedback signal—just use human approval or something. And then it will obviously fail horribly under sufficient optimization pressure.
Now I don’t understand why this will obviously fail horribly, if your argument doesn’t apply to reward signals. How does human approval fail horribly when used in RL training?
Hm.
Now I don’t understand why this will obviously fail horribly, if your argument doesn’t apply to reward signals. How does human approval fail horribly when used in RL training?