So, shouldn’t wisely-selected reward schedules produce good cognitive updates, which produces a mind which implements human wishes?
I don’t think we know how to pick rewards that would implement human wishes. It’s great if people want to propose wise strategies and then argue or demonstrate that they have that effect.
On the other side: if you have an easily measurable reward function, then you can find a policy that gets a high reward by using RL (essentially gradient descent on “how much reward does the policy get.”) I agree that this need not produce a policy that is trying to get reward, just one that in fact gets a lot of reward on distribution.
At the risk of reading too much into wording, I think the phrasing of the above two comments contains an interesting difference.
The first comment (TurnTrout) talks about reward as the thing providing updates to the agent’s cognition, i.e. “reward schedules produce … cognitive updates”, and expresses confusion about a prior quote that mentioned implementing our wishes through reward functions.
The second comment (paulfchristiano) talks about picking “rewards that would implement human wishes” and strategies for doing so.
These seem quite different. If I try to inhabit my model of TurnTrout, I expect he might say “But rewards don’t implement wishes! Our wishes are for the agent to have a certain kind of cognition.” and perhaps also “Whether a reward event helps us get the cognition we want is, first and foremost, a question of which circuits inside the agent are reinforced/refined by said event, and whether those particular circuits implement the kind of cognition we wanted the agent to have.”
I don’t particularly object to that framing, it’s just a huge gap from “Rewards have unpredictable effects on agent’s cognition, not necessarily to cause them to want reward” to “we have a way to use RL to interpret and implement human wishes.”
it’s just a huge gap from “Rewards have unpredictable effects on agent’s cognition, not necessarily to cause them to want reward” to “we have a way to use RL to interpret and implement human wishes.”
So, OP said
In general, we have no way to use RL to actually interpret and implement human wishes, rather than to optimize some concrete and easily-calculated reward signal.
I read into this a connotation of “In general, there isn’t a practically-findable way to use RL...”. I’m now leaning towards my original interpretation being wrong—that you meant something more like “we don’t know how to use RL to actually interpret and implement human wishes” (which I agree with).
I agree that this need not produce a policy that is trying to get reward, just one that in fact gets a lot of reward on distribution.
I think this tells us relatively little about the internal cognition, and so is a relatively non-actionable fact (which you probably agree with?). But I want to sort out my thoughts more, here, before laying down more of my intuitions on that.
Related clarifying question:
To illustrate how this can go wrong, imagine using RL to implement a decentralized autonomous organization (DAO) which maximizes its profit[...]
The shareholders of such a DAO may be able to capture the value it creates as long as they are able to retain effective control over its computing hardware / reward signal.
Do you remember why you thought it important that the shareholders retain control over the reward signal? I can think of several interpretations, but am curious what your take was.
I don’t think we know how to pick rewards that would implement human wishes. It’s great if people want to propose wise strategies and then argue or demonstrate that they have that effect.
On the other side: if you have an easily measurable reward function, then you can find a policy that gets a high reward by using RL (essentially gradient descent on “how much reward does the policy get.”) I agree that this need not produce a policy that is trying to get reward, just one that in fact gets a lot of reward on distribution.
At the risk of reading too much into wording, I think the phrasing of the above two comments contains an interesting difference.
The first comment (TurnTrout) talks about reward as the thing providing updates to the agent’s cognition, i.e. “reward schedules produce … cognitive updates”, and expresses confusion about a prior quote that mentioned implementing our wishes through reward functions.
The second comment (paulfchristiano) talks about picking “rewards that would implement human wishes” and strategies for doing so.
These seem quite different. If I try to inhabit my model of TurnTrout, I expect he might say “But rewards don’t implement wishes! Our wishes are for the agent to have a certain kind of cognition.” and perhaps also “Whether a reward event helps us get the cognition we want is, first and foremost, a question of which circuits inside the agent are reinforced/refined by said event, and whether those particular circuits implement the kind of cognition we wanted the agent to have.”
I don’t particularly object to that framing, it’s just a huge gap from “Rewards have unpredictable effects on agent’s cognition, not necessarily to cause them to want reward” to “we have a way to use RL to interpret and implement human wishes.”
So, OP said
I read into this a connotation of “In general, there isn’t a practically-findable way to use RL...”. I’m now leaning towards my original interpretation being wrong—that you meant something more like “we don’t know how to use RL to actually interpret and implement human wishes” (which I agree with).
I think this tells us relatively little about the internal cognition, and so is a relatively non-actionable fact (which you probably agree with?). But I want to sort out my thoughts more, here, before laying down more of my intuitions on that.
Related clarifying question:
Do you remember why you thought it important that the shareholders retain control over the reward signal? I can think of several interpretations, but am curious what your take was.