The agent _knows_ how it’s going to update based on the action it takes.
Yep, that’s a key part of the problem. We want to designed the AI to update according to what the human says; but what the human says is not a variable out there in the world that the AI discovers, it’s something the AI can rig or influence through its own actions.
Can we simply make sure that the agent selects its action according to the current estimate of the reward function
This estimate depends on the agent’s own actions (again, this is the heart of the problem).
Yep, that’s a key part of the problem. We want to designed the AI to update according to what the human says; but what the human says is not a variable out there in the world that the AI discovers, it’s something the AI can rig or influence through its own actions.
This estimate depends on the agent’s own actions (again, this is the heart of the problem).