I agree that for such a system, the optimal policy of the actor is to rig the estimator, and to “intentionally” bias it towards easy-to-satisfy rewards like “the human loves heroin”.
The part that confuses me is why we’re having two separate systems with different objectives where one system is dumb and the other system is smart.
We don’t need to have two separate systems. There’s two meaning to your “bias it towards” phrase: the first one is the informal human one, where “the human loves heroin” is clearly a bias. The second is some formal definition of what is biasing and what isn’t. And the system doesn’t have that. The “estimator” doesn’t “know” that “the human loves heroin” is a bias; instead, it sees this as a perfectly satisfactory way of accomplishing its goals, according to the bridging function it’s been given. There is no conflict between estimator and actor.
Imagine that you have a complex CIRL game that models the real world well but assumes that the human is Boltzmann-rational. [...] Such a policy is going to “try” to learn preferences, learn incorrectly, and then act according to those incorrect learned preferences, but it is not going to “intentionally” rig the learning process.
The AI would not see any of these actions as “rigging”, even if we would.
It might think “hey, I should check whether the human likes heroin by giving them some”, and then think “oh they really do love heroin, I should pump them full of it”.
It will do this if it can’t already predict the effect of giving them heroin.
It won’t think “aha, if I give the human heroin, then they’ll ask for more heroin, causing my Boltzmann-rationality estimator module to predict they like heroin, and then I can get easy points by giving humans heroin”.
If it can predict the effect of giving humans heroin, it will think something like that. It think: “if I give the humans heroin, they’ll ask for more heroin; my Boltzmann-rationality estimator module confirms that this means they like heroin, so I can efficiently satisfy their preferences by giving humans heroin”.
“if I give the humans heroin, they’ll ask for more heroin; my Boltzmann-rationality estimator module confirms that this means they like heroin, so I can efficiently satisfy their preferences by giving humans heroin”.
is more IRL than CIRL. It doesn’t necessarily assume that the human knows their own utility function and is trying to play a cooperative strategy with the AI that maximizes that same utility function. If I knew that what would really maximize utility is having that second hit of heroin, I’d try to indicate it to the AI I was cooperating with.
Problems with IRL look like “we modeled the human as an agent based on representative observations, and now we’re going to try to maximize the modeled values, and that’s bad.” Problems with CIRL look like “we’re trying to play this cooperative game with the human that involves modeling it as an agent playing the same game, and now we’re going to try to take actions that have really high EV in the game, and that’s bad.”
The key point is not that the AI knows what is or isn’t “rigging”, or that the AI “knows what a bias is”. The key point is that in a CIRL game, by construction there is a true (unknown) reward function, and thus an optimal policy must be viewable as being Bayesian about the reward function, and in particular its actions must be consistent with conservation of expected evidence about the reward function; anything which “rigs” the “learning process” does not satisfy this property and so can’t be optimal.
You might reasonably ask where the magic happens. The CIRL game that you choose would have to commit to some connection between rewards and behavior. It could be that in one episode the human wants heroin (but doesn’t know it) and in another episode the human doesn’t want heroin (this depends on the prior over rewards). However, it could never be the case that in a single episode (where the reward must be fixed) the human doesn’t want heroin, and then later in the same episode the human does want heroin. Perhaps in the real world this can happen; that would make this policy suboptimal in the real world. (What it does then is unclear since it depends on how the policy generalizes out of distribution.)
If this doesn’t clarify it, I’ll probably table this discussion until publishing an upcoming paper on CIRL games (where it will probably be renamed to assistance games).
EDIT: Perhaps another way to put this: I agree that if you train an AI system to act such that it maximizes the expected reward under the posterior inferred by a fixed update rule looking at the AI system’s actions and resulting states, the AI will tend to gain reward by choosing actions which when plugged into the update rule lead to a posterior that is “easy to maximize”. This seems like training the controller but not training the estimator, and so the controller learns information about the world that allows it to “trick” the estimator into updating in a particular direction (something that would be disallowed by the rules of probability applied to a unified Bayesian agent, and is only possible here because either a) the estimator is uncalibrated or b) the controller learns information that the estimator doesn’t know).
Instead, you should train an AI system such that it maximizes the expected reward it gets under the prior; this is what CIRL / assistance games do. This is kinda sorta like training both the “estimator” and the “controller” simultaneously, and so the controller can’t gain any information that the estimator doesn’t have (at least at optimality).
We don’t need to have two separate systems. There’s two meaning to your “bias it towards” phrase: the first one is the informal human one, where “the human loves heroin” is clearly a bias. The second is some formal definition of what is biasing and what isn’t. And the system doesn’t have that. The “estimator” doesn’t “know” that “the human loves heroin” is a bias; instead, it sees this as a perfectly satisfactory way of accomplishing its goals, according to the bridging function it’s been given. There is no conflict between estimator and actor.
The AI would not see any of these actions as “rigging”, even if we would.
It will do this if it can’t already predict the effect of giving them heroin.
If it can predict the effect of giving humans heroin, it will think something like that. It think: “if I give the humans heroin, they’ll ask for more heroin; my Boltzmann-rationality estimator module confirms that this means they like heroin, so I can efficiently satisfy their preferences by giving humans heroin”.
I think Rohin’s point is that the model of
is more IRL than CIRL. It doesn’t necessarily assume that the human knows their own utility function and is trying to play a cooperative strategy with the AI that maximizes that same utility function. If I knew that what would really maximize utility is having that second hit of heroin, I’d try to indicate it to the AI I was cooperating with.
Problems with IRL look like “we modeled the human as an agent based on representative observations, and now we’re going to try to maximize the modeled values, and that’s bad.” Problems with CIRL look like “we’re trying to play this cooperative game with the human that involves modeling it as an agent playing the same game, and now we’re going to try to take actions that have really high EV in the game, and that’s bad.”
Thanks! Responded here: https://www.lesswrong.com/posts/EYEkYX6vijL7zsKEt/reward-functions-and-updating-assumptions-can-hide-a
The key point is not that the AI knows what is or isn’t “rigging”, or that the AI “knows what a bias is”. The key point is that in a CIRL game, by construction there is a true (unknown) reward function, and thus an optimal policy must be viewable as being Bayesian about the reward function, and in particular its actions must be consistent with conservation of expected evidence about the reward function; anything which “rigs” the “learning process” does not satisfy this property and so can’t be optimal.
You might reasonably ask where the magic happens. The CIRL game that you choose would have to commit to some connection between rewards and behavior. It could be that in one episode the human wants heroin (but doesn’t know it) and in another episode the human doesn’t want heroin (this depends on the prior over rewards). However, it could never be the case that in a single episode (where the reward must be fixed) the human doesn’t want heroin, and then later in the same episode the human does want heroin. Perhaps in the real world this can happen; that would make this policy suboptimal in the real world. (What it does then is unclear since it depends on how the policy generalizes out of distribution.)
If this doesn’t clarify it, I’ll probably table this discussion until publishing an upcoming paper on CIRL games (where it will probably be renamed to assistance games).
EDIT: Perhaps another way to put this: I agree that if you train an AI system to act such that it maximizes the expected reward under the posterior inferred by a fixed update rule looking at the AI system’s actions and resulting states, the AI will tend to gain reward by choosing actions which when plugged into the update rule lead to a posterior that is “easy to maximize”. This seems like training the controller but not training the estimator, and so the controller learns information about the world that allows it to “trick” the estimator into updating in a particular direction (something that would be disallowed by the rules of probability applied to a unified Bayesian agent, and is only possible here because either a) the estimator is uncalibrated or b) the controller learns information that the estimator doesn’t know).
Instead, you should train an AI system such that it maximizes the expected reward it gets under the prior; this is what CIRL / assistance games do. This is kinda sorta like training both the “estimator” and the “controller” simultaneously, and so the controller can’t gain any information that the estimator doesn’t have (at least at optimality).
Thanks! Responded here: https://www.lesswrong.com/posts/EYEkYX6vijL7zsKEt/reward-functions-and-updating-assumptions-can-hide-a