TurnTrout comments on Will Capabilities Generalise More?

TurnTrout 11 Aug 2022 22:26 UTC
LW: 2 AF: 3
0
AF
Hm.
It’s easy to come up with a crappy proxy feedback signal—just use human approval or something. And then it will obviously fail horribly under sufficient optimization pressure.
Now I don’t understand why this will obviously fail horribly, if your argument doesn’t apply to reward signals. How does human approval fail horribly when used in RL training?