One nice point that this post makes (which I suppose was also prominent in the talk, but I can only guess, not being there myself) is that there’s a kind of progression we can draw (simplifying a little):
- Human specifies what to do (Classical software) - Human specifies what to achieve (RL) - Machine infers a specification of what to achieve (IRL) - Machine collaborates with human to infer and achieve what the human wants (Assistance games)
Towards the end, this post describes an extrapolation of this trend,
- Machine and human collaboratively figure out what the human even wants to do in the first place.
‘Helping humans figure out what they want’ is a deep, complex and interesting problem, and I’d love it if more folks were thinking through what solutions to it ought to look like. This seems particularly urgent because human motivations can be affected even by algorithms that were not designed to solve this problem—for example, think of recommender systems shaping their users’ habits—and which therefore aren’t doing what we’d want them to do.
---
Another nice point is the connection between ML algorithm design and HCI. I’ve been meaning to write something looking at RL as ‘technique for communicating and achieving human intent’ (and, as a corollary, at AI safety as a kind of human-centred algorithm design), but it seems that I’ve been scooped by Michael :)
I note that not everyone sees RL from this frame. Some RL researchers view it as a way of understanding intelligence in the abstract, without connecting reward to human values.
---
One thing I’m a little less sure of is the conclusion you draw from your examples of changing intentions. While the examples convince me that the AI ought to have some sophistication about the human’s intentions—for example, being aware that human intentions can change—it’s not obvious that the right move is to ‘pop out’ further and assume there is something ‘bigger’ that the human’s intentions should be aligned with. Could you elaborate on your vision of what you have in mind there?
for example, being aware that human intentions can change—it’s not obvious that the right move is to ‘pop out’ further and assume there is something ‘bigger’ that the human’s intentions should be aligned with. Could you elaborate on your vision of what you have in mind there?
Well it would definitely be a mistake to build an AI system that extracts human intentions at some fixed point in time and treats them as fixed forever, yes? So it seems to me that it would be better to build systems predicated on that which is the underlying generator of the trajectory of human intentions. When I say “something bigger that human’s intentions should be aligned with” I don’t mean “physically bigger”, I mean “prior to” or “the cause of”.
For example, the work concerning corrigibility is about building AI systems that can be modified later, yes? But why is it good to have AI systems that can be modified later? I would say that the implicit claim underlying corrigibility research is that we believe humans have the capacity to, over time, slowly and with many detours, align our own intentions with that which is actually good. So we believe that if we align AI systems with human intentions in a way that is not locked in, then we will be aligning AI systems with that which is actually good. I’m not claiming this is true, just that this is a premise of corrigibility being good.
Another way of looking at it:
Suppose we look at a whole universe with a single human embedded in it, and we ask: where in this system should we look in order to discover the trajectory of this human’s intentions as they evolve through time? We might draw a boundary around the human’s left foot and ask: can we discover the trajectory of this human’s intentions by examining the configuration of this part of the world? We might draw a boundary around the human’s head and ask the same question, and I think some would say in this case that the answer is yes, we can discover the human’s intentions by examining the configuration of the head. But this is a remarkably strong claim: it asserts that there is no information crucial to tracing the trajectory of the human’s intentions over time in any part of the system outside the head. It we draw a boundary around the entire human then this is still an incredibly strong claim. We have a big physical system with constant interactions between regions inside and outside this boundary. We can see that every part of the physical configuration of the region inside the boundary is affected over time by the physical configuration of the region outside the boundary. It is not impossible that all the information relevant to discovering the trajectory of intentions is inside the boundary, but it is a very strong claim to make. On what basis might we make such a claim?
One way to defend the claim the trajectory of intentions can be discovered by looking just at the head or just at the whole human is to postulate that intentions are fixed. In that case we could extract the human’s current intentions from the physical configuration of their head, which does seem highly plausible, and then the trajectory of intentions over time would just be a constant. But I do not think it is plausible that intentions are fixed like this.
Thanks for a great post.
---
One nice point that this post makes (which I suppose was also prominent in the talk, but I can only guess, not being there myself) is that there’s a kind of progression we can draw (simplifying a little):
- Human specifies what to do (Classical software)
- Human specifies what to achieve (RL)
- Machine infers a specification of what to achieve (IRL)
- Machine collaborates with human to infer and achieve what the human wants (Assistance games)
Towards the end, this post describes an extrapolation of this trend,
- Machine and human collaboratively figure out what the human even wants to do in the first place.
‘Helping humans figure out what they want’ is a deep, complex and interesting problem, and I’d love it if more folks were thinking through what solutions to it ought to look like. This seems particularly urgent because human motivations can be affected even by algorithms that were not designed to solve this problem—for example, think of recommender systems shaping their users’ habits—and which therefore aren’t doing what we’d want them to do.
---
Another nice point is the connection between ML algorithm design and HCI. I’ve been meaning to write something looking at RL as ‘technique for communicating and achieving human intent’ (and, as a corollary, at AI safety as a kind of human-centred algorithm design), but it seems that I’ve been scooped by Michael :)
I note that not everyone sees RL from this frame. Some RL researchers view it as a way of understanding intelligence in the abstract, without connecting reward to human values.
---
One thing I’m a little less sure of is the conclusion you draw from your examples of changing intentions. While the examples convince me that the AI ought to have some sophistication about the human’s intentions—for example, being aware that human intentions can change—it’s not obvious that the right move is to ‘pop out’ further and assume there is something ‘bigger’ that the human’s intentions should be aligned with. Could you elaborate on your vision of what you have in mind there?
Thank you for the kind words.
Well it would definitely be a mistake to build an AI system that extracts human intentions at some fixed point in time and treats them as fixed forever, yes? So it seems to me that it would be better to build systems predicated on that which is the underlying generator of the trajectory of human intentions. When I say “something bigger that human’s intentions should be aligned with” I don’t mean “physically bigger”, I mean “prior to” or “the cause of”.
For example, the work concerning corrigibility is about building AI systems that can be modified later, yes? But why is it good to have AI systems that can be modified later? I would say that the implicit claim underlying corrigibility research is that we believe humans have the capacity to, over time, slowly and with many detours, align our own intentions with that which is actually good. So we believe that if we align AI systems with human intentions in a way that is not locked in, then we will be aligning AI systems with that which is actually good. I’m not claiming this is true, just that this is a premise of corrigibility being good.
Another way of looking at it:
Suppose we look at a whole universe with a single human embedded in it, and we ask: where in this system should we look in order to discover the trajectory of this human’s intentions as they evolve through time? We might draw a boundary around the human’s left foot and ask: can we discover the trajectory of this human’s intentions by examining the configuration of this part of the world? We might draw a boundary around the human’s head and ask the same question, and I think some would say in this case that the answer is yes, we can discover the human’s intentions by examining the configuration of the head. But this is a remarkably strong claim: it asserts that there is no information crucial to tracing the trajectory of the human’s intentions over time in any part of the system outside the head. It we draw a boundary around the entire human then this is still an incredibly strong claim. We have a big physical system with constant interactions between regions inside and outside this boundary. We can see that every part of the physical configuration of the region inside the boundary is affected over time by the physical configuration of the region outside the boundary. It is not impossible that all the information relevant to discovering the trajectory of intentions is inside the boundary, but it is a very strong claim to make. On what basis might we make such a claim?
One way to defend the claim the trajectory of intentions can be discovered by looking just at the head or just at the whole human is to postulate that intentions are fixed. In that case we could extract the human’s current intentions from the physical configuration of their head, which does seem highly plausible, and then the trajectory of intentions over time would just be a constant. But I do not think it is plausible that intentions are fixed like this.