I think the way I’d fit that into my ontology is “the reward signal is not the relevant feedback signal (for purposes of this argument)”. The relevant feedback signal is whatever some human looks at, at the end of the day, to notice when there’s problems or to tell how well the AI is doing by the human’s standards. It’s how we (human designers/operators) notice the problems on which to iterate. It’s whatever the designer is implicitly optimizing for, in the long run, by developing an AI via the particular process the designer is using.
If the human is just using the reward signal as a control interface for steering the AI’s internals, then the reward signal is not the feedback signal to which this argument applies.
We discussed more in person. I ended up agreeing with (what I perceive to be) a substantially different claim than I read from your original comment. I agree that we can’t just figure out alignment by black-boxing AI cognition and seeing whether the AI does good things or not, nor can we just set up feedback loops on that (e.g. train a succession of agents and tweak the process based on how aligned they seem) without some substantial theoretical underpinnings with which to interpret the evidence.
However, I still don’t see how your original comment is a reasonable way to communicate this state of mind. For example, you wrote:
It’s easy to come up with a crappy proxy feedback signal—just use human approval or something. And then it will obviously fail horribly under sufficient optimization pressure.
What does this mean, if not using human approval as a reward signal? Can you briefly step me through a fictional scenario where the described failure obtains?
It’s easy to come up with a crappy proxy feedback signal—just use human approval or something. And then it will obviously fail horribly under sufficient optimization pressure.
I think the way I’d fit that into my ontology is “the reward signal is not the relevant feedback signal (for purposes of this argument)”. The relevant feedback signal is whatever some human looks at, at the end of the day, to notice when there’s problems or to tell how well the AI is doing by the human’s standards. It’s how we (human designers/operators) notice the problems on which to iterate. It’s whatever the designer is implicitly optimizing for, in the long run, by developing an AI via the particular process the designer is using.
If the human is just using the reward signal as a control interface for steering the AI’s internals, then the reward signal is not the feedback signal to which this argument applies.
We discussed more in person. I ended up agreeing with (what I perceive to be) a substantially different claim than I read from your original comment. I agree that we can’t just figure out alignment by black-boxing AI cognition and seeing whether the AI does good things or not, nor can we just set up feedback loops on that (e.g. train a succession of agents and tweak the process based on how aligned they seem) without some substantial theoretical underpinnings with which to interpret the evidence.
However, I still don’t see how your original comment is a reasonable way to communicate this state of mind. For example, you wrote:
What does this mean, if not using human approval as a reward signal? Can you briefly step me through a fictional scenario where the described failure obtains?
Hm.
Now I don’t understand why this will obviously fail horribly, if your argument doesn’t apply to reward signals. How does human approval fail horribly when used in RL training?