I recently had occasion to write up quick thoughts about the role of assistance games (CIRL) in AI alignment, and how it relates to the problem of fully updated deference. I thought I’d crosspost here as a reference.
Assistance games / CIRL is a similar sort of thing as CEV. Just as CEV is English poetry about what we want, assistance games are math poetry about what we want. In particular, neither CEV nor assistance games tells you how to build a friendly AGI. You need to know something about how the capabilities arise for that.
One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
Well-specified assistive agents (i.e. ones where you got the observation model and reward space exactly correct) do many of the other nice things corrigible agents do, like the 5 bullet points at the top of this post. Obviously we don’t know how to correctly specify the observation model and reward space, so this is not a solution to alignment, which is why it is “math poetry about what we want”.
Another objection: ultimately an assistive agent becomes equivalent to optimizing a fixed reward, aren’t things that optimize a fixed reward bad? Again, I think this seems totally fine; the intuition that “optimizing a fixed reward is bad” comes from our expectation that we’ll get the fixed reward wrong, because there’s so much information that has to be in that fixed reward. An assistive agent will spend a long time gaining all the information about the reward—it really should get it correct (barring misspecification)! If we imagine the superintelligent CIRL sovereign, it has billions of years to optimize the universe! It would be worth it to spend a thousand years to learn a single bit about the reward function if that has more than a 1 in a million chance of doubling the resulting utility (and obviously going from existential catastrophe to not-that seems like a huge increase in utility).
I don’t personally work on assistance-game-like algorithms because they rely on having explicit probability distributions over high-dimensional reward spaces, which we don’t have great techniques for, and I think we will probably get AGI before we have great techniques for that. But this is more about what I expect drives AGI capabilities than about some fundamental “safety problems” with assistance games.
Another point against assistance games is that they might have very narrow “safety margins”, i.e. if you get the observation model slightly wrong, maybe you get a slightly wrong reward function, and that still leads to an existential catastrophe because value is fragile. (Though this isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!) If this were the only point against assistance (i.e. the previous bullet point somehow didn’t apply) I’d still be keen for a large fraction of the field pushing forward the assistance games approach, while the others look for approaches with wider safety margins.
One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
I think this is way more worrying in the case where you’re implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower.
Though [the claim that slightly wrong observation model ⇒ doom] isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem—in those cases, the way you know how ‘human preferences’ rank future hyperslavery vs wireheaded rat tiling vs humane utopia is by how human actions affect the likelihood of those possible worlds, but that’s probably not well-modelled by Boltzmann rationality (e.g. the thing I’m most likely to do today is not to write a short computer program that implements humane utopia), and it seems like your inference is going to be very sensitive to plausible variations in the observation model.
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem
It’s also not super clear what you algorithmically do instead—words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.
I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem
I agree Boltzmann rationality (over the action space of, say, “muscle movements”) is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including “things that humans say”, and the human can just tell you that hyperslavery is really bad. Obviously you can’t trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes.
(Ideally you’d figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of “getting a good observation model” while you still have the ability to turn off the model. It’s hard to say exactly what that would look like since I don’t have a great sense of how you get AGI capabilities under the non-ML story.)
I mentioned above that I’m not that keen on assistance games because they don’t seem like a great fit for the specific ways we’re getting capabilities now. A more direct comment on this point that I recently wrote:
I broadly agree that assistance games are a pretty great framework. The main reason I don’t work on them is that it doesn’t seem like it works as a solution if you expect AGI via scaled up deep learning. (Whereas I’d be pretty excited about pushing forward on it if it looked like we were getting AGI via things like explicit hierarchical planning or search algorithms.)
The main difference in the deep learning case is that with scaled up deep learning it looks like you are doing a search over programs for a program that performs well on your loss function, and the intelligent thing is the learned program as opposed to the search that found the learned program. if you wanted assistance-style safety, then the learned program needs to reason in a assistance-like way (i.e. maintain uncertainty over what the humans want, and narrow down the uncertainty by observing human behavior).
But then you run into a major problem, which is that we have no idea how to design the learned program, precisely because it is learned — all we do is constrain the behavior of the learned program on the particular inputs that we trained on, and there are many programs you could learn that have that behavior, some of which reason in a CIRL-like way and some of which don’t. (If you then try to solve this problem, you end up regenerating many of the directions that other alignment people work on.)
I recently had occasion to write up quick thoughts about the role of assistance games (CIRL) in AI alignment, and how it relates to the problem of fully updated deference. I thought I’d crosspost here as a reference.
Assistance games / CIRL is a similar sort of thing as CEV. Just as CEV is English poetry about what we want, assistance games are math poetry about what we want. In particular, neither CEV nor assistance games tells you how to build a friendly AGI. You need to know something about how the capabilities arise for that.
One objection: an assistive agent doesn’t let you turn it off, how could that be what we want? This just seems totally fine to me — if a toddler in a fit of anger wishes that its parents were dead, I don’t think the maximally-toddler-aligned parents would then commit suicide, that just seems obviously bad for the toddler.
Well-specified assistive agents (i.e. ones where you got the observation model and reward space exactly correct) do many of the other nice things corrigible agents do, like the 5 bullet points at the top of this post. Obviously we don’t know how to correctly specify the observation model and reward space, so this is not a solution to alignment, which is why it is “math poetry about what we want”.
Another objection: ultimately an assistive agent becomes equivalent to optimizing a fixed reward, aren’t things that optimize a fixed reward bad? Again, I think this seems totally fine; the intuition that “optimizing a fixed reward is bad” comes from our expectation that we’ll get the fixed reward wrong, because there’s so much information that has to be in that fixed reward. An assistive agent will spend a long time gaining all the information about the reward—it really should get it correct (barring misspecification)! If we imagine the superintelligent CIRL sovereign, it has billions of years to optimize the universe! It would be worth it to spend a thousand years to learn a single bit about the reward function if that has more than a 1 in a million chance of doubling the resulting utility (and obviously going from existential catastrophe to not-that seems like a huge increase in utility).
I don’t personally work on assistance-game-like algorithms because they rely on having explicit probability distributions over high-dimensional reward spaces, which we don’t have great techniques for, and I think we will probably get AGI before we have great techniques for that. But this is more about what I expect drives AGI capabilities than about some fundamental “safety problems” with assistance games.
Another point against assistance games is that they might have very narrow “safety margins”, i.e. if you get the observation model slightly wrong, maybe you get a slightly wrong reward function, and that still leads to an existential catastrophe because value is fragile. (Though this isn’t totally clear, e.g. is it really that easy to mess up the observation model such that it leads to a reward function that’s fine with murdering humans? It seems like there’s a lot of evidence that humans don’t want to be murdered!) If this were the only point against assistance (i.e. the previous bullet point somehow didn’t apply) I’d still be keen for a large fraction of the field pushing forward the assistance games approach, while the others look for approaches with wider safety margins.
(I made some of these points before in my summary of Human Compatible.)
I think this is way more worrying in the case where you’re implementing an assistance game solver, where this lack of off-switchability means your margins for safety are much narrower.
I think it’s more concerning in cases where you’re getting all of your info from goal-oriented behaviour and solving the inverse planning problem—in those cases, the way you know how ‘human preferences’ rank future hyperslavery vs wireheaded rat tiling vs humane utopia is by how human actions affect the likelihood of those possible worlds, but that’s probably not well-modelled by Boltzmann rationality (e.g. the thing I’m most likely to do today is not to write a short computer program that implements humane utopia), and it seems like your inference is going to be very sensitive to plausible variations in the observation model.
It’s also not super clear what you algorithmically do instead—words are kind of vague, and trajectory comparisons depend crucially on getting the right info about the trajectory, which is hard, as per the ELK document.
That’s what future research is for!
I agree the lack of off-switchability is bad for safety margins (that was part of the intuition driving my last point).
I agree Boltzmann rationality (over the action space of, say, “muscle movements”) is going to be pretty bad, but any realistic version of this is going to include a bunch of sources of info including “things that humans say”, and the human can just tell you that hyperslavery is really bad. Obviously you can’t trust everything that humans say, but it seems plausible that if we spent a bunch of time figuring out a good observation model that would then lead to okay outcomes.
(Ideally you’d figure out how you were getting AGI capabilities, and then leverage those capabilities towards the task of “getting a good observation model” while you still have the ability to turn off the model. It’s hard to say exactly what that would look like since I don’t have a great sense of how you get AGI capabilities under the non-ML story.)
I mentioned above that I’m not that keen on assistance games because they don’t seem like a great fit for the specific ways we’re getting capabilities now. A more direct comment on this point that I recently wrote: