At the risk of reading too much into wording, I think the phrasing of the above two comments contains an interesting difference.
The first comment (TurnTrout) talks about reward as the thing providing updates to the agent’s cognition, i.e. “reward schedules produce … cognitive updates”, and expresses confusion about a prior quote that mentioned implementing our wishes through reward functions.
The second comment (paulfchristiano) talks about picking “rewards that would implement human wishes” and strategies for doing so.
These seem quite different. If I try to inhabit my model of TurnTrout, I expect he might say “But rewards don’t implement wishes! Our wishes are for the agent to have a certain kind of cognition.” and perhaps also “Whether a reward event helps us get the cognition we want is, first and foremost, a question of which circuits inside the agent are reinforced/refined by said event, and whether those particular circuits implement the kind of cognition we wanted the agent to have.”
I don’t particularly object to that framing, it’s just a huge gap from “Rewards have unpredictable effects on agent’s cognition, not necessarily to cause them to want reward” to “we have a way to use RL to interpret and implement human wishes.”
it’s just a huge gap from “Rewards have unpredictable effects on agent’s cognition, not necessarily to cause them to want reward” to “we have a way to use RL to interpret and implement human wishes.”
So, OP said
In general, we have no way to use RL to actually interpret and implement human wishes, rather than to optimize some concrete and easily-calculated reward signal.
I read into this a connotation of “In general, there isn’t a practically-findable way to use RL...”. I’m now leaning towards my original interpretation being wrong—that you meant something more like “we don’t know how to use RL to actually interpret and implement human wishes” (which I agree with).
At the risk of reading too much into wording, I think the phrasing of the above two comments contains an interesting difference.
The first comment (TurnTrout) talks about reward as the thing providing updates to the agent’s cognition, i.e. “reward schedules produce … cognitive updates”, and expresses confusion about a prior quote that mentioned implementing our wishes through reward functions.
The second comment (paulfchristiano) talks about picking “rewards that would implement human wishes” and strategies for doing so.
These seem quite different. If I try to inhabit my model of TurnTrout, I expect he might say “But rewards don’t implement wishes! Our wishes are for the agent to have a certain kind of cognition.” and perhaps also “Whether a reward event helps us get the cognition we want is, first and foremost, a question of which circuits inside the agent are reinforced/refined by said event, and whether those particular circuits implement the kind of cognition we wanted the agent to have.”
I don’t particularly object to that framing, it’s just a huge gap from “Rewards have unpredictable effects on agent’s cognition, not necessarily to cause them to want reward” to “we have a way to use RL to interpret and implement human wishes.”
So, OP said
I read into this a connotation of “In general, there isn’t a practically-findable way to use RL...”. I’m now leaning towards my original interpretation being wrong—that you meant something more like “we don’t know how to use RL to actually interpret and implement human wishes” (which I agree with).