Rohin, I really like the distinction you draw between “build[ing] an AI system that could maximize an arbitrary function, and then [trying] to program in the utility function we care about” versus “build[ing] systems in such a way that these properties are inherent in the way that they reason.” That was helpful.
However, it seems to me—and please correct me if I’m wrong!—that most or all CIRL papers are framing the problem in terms of understanding a generic goal-seeking system whose goal is “the human gets what they want”. Then papers like The Off-Switch Game show that the goal of “the human gets what they want” leads to nice instrumental goals like not disabling off-switches. Do you agree?
So when I was reading CIRL papers, or reading Stuart Russell’s new book, I did in fact keep thinking to myself “How do we make sure that the AI really has the goal of “The human gets what they want.”, as opposed to a proxy to it that will diverge out-of-distribution?”
IDA / “act-based corrigibility” seems like more of an attempt to break out of the goal-seeking paradigm altogether, although I still haven’t convinced myself that it succeeds.
To be clear, this post was not arguing that CIRL is not goal-directed—you’ll notice that CIRL is not on my list of potential non-goal-directed models above.
I think CIRL is in this weird in-between place where it is kind of sort of goal-directed. You can think of three different kinds of AI systems:
An agent optimizing a known, definite utility function
An agent optimizing a utility function that it is uncertain about, that it gets information about from humans
A system that isn’t maximizing any simple utility function at all
I claim the first is clearly goal-directed, and the last is not goal-directed. CIRL is in the second set, where it’s not totally clear: it’s actions are driven by a goal, but that goal comes from another agent (a human). (This is also the case with imitation learning, and that case is also not clear—see this thread.)
I did in fact keep thinking to myself “How do we make sure that the AI really has the goal of “The human gets what they want.”, as opposed to a proxy to it that will diverge out-of-distribution?”
I think this is a reasonable critique to have. In the context of Stuart’s book, this is essentially a quibble with principle 3:
3. The ultimate source of information about human preferences is human behavior.
The goal learned by the AI system depends on how it maps human behavior (or sensory data) into (beliefs about) human preferences. If that mapping is not accurate (quite likely), then it will in fact learn some other goal, which could be catastrophic.
Thanks! Pulling on that thread a bit more, compare:
My goal is that the human overseer achieves her goals. To accomplish this, I need to observe and interact with the human to understand her better—what kind of food she likes, how she responds to different experiences, etc. etc.
My goal is to maximize the speed of this racecar. To accomplish this, I need to observe and interact with the racecar to understand it better—how its engine responds to different octane fuels, how its tires respond to different weather conditions, etc. etc.
To me, they don’t seem that different on a fundamental level. But they do have the super-important practical difference that the first one doesn’t seem to have problematic instrumental subgoals.
(I think I’m just agreeing with your comment here?)
Rohin, I really like the distinction you draw between “build[ing] an AI system that could maximize an arbitrary function, and then [trying] to program in the utility function we care about” versus “build[ing] systems in such a way that these properties are inherent in the way that they reason.” That was helpful.
However, it seems to me—and please correct me if I’m wrong!—that most or all CIRL papers are framing the problem in terms of understanding a generic goal-seeking system whose goal is “the human gets what they want”. Then papers like The Off-Switch Game show that the goal of “the human gets what they want” leads to nice instrumental goals like not disabling off-switches. Do you agree?
So when I was reading CIRL papers, or reading Stuart Russell’s new book, I did in fact keep thinking to myself “How do we make sure that the AI really has the goal of “The human gets what they want.”, as opposed to a proxy to it that will diverge out-of-distribution?”
IDA / “act-based corrigibility” seems like more of an attempt to break out of the goal-seeking paradigm altogether, although I still haven’t convinced myself that it succeeds.
To be clear, this post was not arguing that CIRL is not goal-directed—you’ll notice that CIRL is not on my list of potential non-goal-directed models above.
I think CIRL is in this weird in-between place where it is kind of sort of goal-directed. You can think of three different kinds of AI systems:
An agent optimizing a known, definite utility function
An agent optimizing a utility function that it is uncertain about, that it gets information about from humans
A system that isn’t maximizing any simple utility function at all
I claim the first is clearly goal-directed, and the last is not goal-directed. CIRL is in the second set, where it’s not totally clear: it’s actions are driven by a goal, but that goal comes from another agent (a human). (This is also the case with imitation learning, and that case is also not clear—see this thread.)
I think this is a reasonable critique to have. In the context of Stuart’s book, this is essentially a quibble with principle 3:
The goal learned by the AI system depends on how it maps human behavior (or sensory data) into (beliefs about) human preferences. If that mapping is not accurate (quite likely), then it will in fact learn some other goal, which could be catastrophic.
Thanks! Pulling on that thread a bit more, compare:
My goal is that the human overseer achieves her goals. To accomplish this, I need to observe and interact with the human to understand her better—what kind of food she likes, how she responds to different experiences, etc. etc.
My goal is to maximize the speed of this racecar. To accomplish this, I need to observe and interact with the racecar to understand it better—how its engine responds to different octane fuels, how its tires respond to different weather conditions, etc. etc.
To me, they don’t seem that different on a fundamental level. But they do have the super-important practical difference that the first one doesn’t seem to have problematic instrumental subgoals.
(I think I’m just agreeing with your comment here?)
Yeah, I think that’s basically right.