To be clear, this post was not arguing that CIRL is not goal-directed—you’ll notice that CIRL is not on my list of potential non-goal-directed models above.
I think CIRL is in this weird in-between place where it is kind of sort of goal-directed. You can think of three different kinds of AI systems:
An agent optimizing a known, definite utility function
An agent optimizing a utility function that it is uncertain about, that it gets information about from humans
A system that isn’t maximizing any simple utility function at all
I claim the first is clearly goal-directed, and the last is not goal-directed. CIRL is in the second set, where it’s not totally clear: it’s actions are driven by a goal, but that goal comes from another agent (a human). (This is also the case with imitation learning, and that case is also not clear—see this thread.)
I did in fact keep thinking to myself “How do we make sure that the AI really has the goal of “The human gets what they want.”, as opposed to a proxy to it that will diverge out-of-distribution?”
I think this is a reasonable critique to have. In the context of Stuart’s book, this is essentially a quibble with principle 3:
3. The ultimate source of information about human preferences is human behavior.
The goal learned by the AI system depends on how it maps human behavior (or sensory data) into (beliefs about) human preferences. If that mapping is not accurate (quite likely), then it will in fact learn some other goal, which could be catastrophic.
Thanks! Pulling on that thread a bit more, compare:
My goal is that the human overseer achieves her goals. To accomplish this, I need to observe and interact with the human to understand her better—what kind of food she likes, how she responds to different experiences, etc. etc.
My goal is to maximize the speed of this racecar. To accomplish this, I need to observe and interact with the racecar to understand it better—how its engine responds to different octane fuels, how its tires respond to different weather conditions, etc. etc.
To me, they don’t seem that different on a fundamental level. But they do have the super-important practical difference that the first one doesn’t seem to have problematic instrumental subgoals.
(I think I’m just agreeing with your comment here?)
To be clear, this post was not arguing that CIRL is not goal-directed—you’ll notice that CIRL is not on my list of potential non-goal-directed models above.
I think CIRL is in this weird in-between place where it is kind of sort of goal-directed. You can think of three different kinds of AI systems:
An agent optimizing a known, definite utility function
An agent optimizing a utility function that it is uncertain about, that it gets information about from humans
A system that isn’t maximizing any simple utility function at all
I claim the first is clearly goal-directed, and the last is not goal-directed. CIRL is in the second set, where it’s not totally clear: it’s actions are driven by a goal, but that goal comes from another agent (a human). (This is also the case with imitation learning, and that case is also not clear—see this thread.)
I think this is a reasonable critique to have. In the context of Stuart’s book, this is essentially a quibble with principle 3:
The goal learned by the AI system depends on how it maps human behavior (or sensory data) into (beliefs about) human preferences. If that mapping is not accurate (quite likely), then it will in fact learn some other goal, which could be catastrophic.
Thanks! Pulling on that thread a bit more, compare:
My goal is that the human overseer achieves her goals. To accomplish this, I need to observe and interact with the human to understand her better—what kind of food she likes, how she responds to different experiences, etc. etc.
My goal is to maximize the speed of this racecar. To accomplish this, I need to observe and interact with the racecar to understand it better—how its engine responds to different octane fuels, how its tires respond to different weather conditions, etc. etc.
To me, they don’t seem that different on a fundamental level. But they do have the super-important practical difference that the first one doesn’t seem to have problematic instrumental subgoals.
(I think I’m just agreeing with your comment here?)
Yeah, I think that’s basically right.