If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It’s a fact about the territory, not the map—about the environment, not the optimizer—that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.
I see how the concept learned from the reward signal is not exactly the concept that you wanted, but why is that concept lethal?
It seems like the feedback from the human raters would put a very low score on any scenario that involves all humans being killed.
I see how the concept learned from the reward signal is not exactly the concept that you wanted, but why is that concept lethal?
It seems like the feedback from the human raters would put a very low score on any scenario that involves all humans being killed.