Thanks for writing this up. Quick question re: “Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans.” What does it mean for an objective to be aligned with humans, on your view? You define what it is for an agent to be aligned with humans, e.g.: “An agent is aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.” But you don’t say explicitly what it is for an objective to be aligned: I’m curious if you have a preferred formulation.
Is it something like: “the behavioral objective is such that, when the agent does ‘well’ on this objective, the agent doesn’t act in a way we would view as bad/problematic/dangerous/catastrophic.” If so, it seems like a lot might depend on exactly how “well” the agent does, and what opportunities it has in a given context. That is, an “aligned” agent might not stay aligned if it becomes more powerful, but continues optimizing for the same objective (for example, a weak robot optimizing for beating me at chess might be “aligned” because it only focuses on making good chess moves, but a stronger one might not be, because it figures out how to drug my tea). Is that an implication you’d endorse?
Or is the thought something like: “the behavioral objective such that, no matter how powerfully the agent optimizes for it, and no matter its opportunities for action, it doesn’t take actions we would view as bad/problematic/dangerous/catastrophic”? My sense is that something like this is often the idea people have in mind, especially in the context of anticipating things like intelligence explosions. If this is what you have in mind, though, maybe worth saying so explicitly, since intent alignment in this sense seems like a different constraint than intent alignment in the sense of e.g. “the agent’s pursuit of its behavioral objective does not in fact give rise to bad actions, given the abilities/contexts/constraints that will in fact be relevant to its behavior.”
Outer alignment: An objective function r is outer aligned if all models that perform optimally on r in the limit of perfect training and infinite data are intent aligned.”
Good point—and I think that the reference to intent alignment is an important part of outer alignment, so I don’t want to change that definition. I further tweaked the intent alignment definition a bit to just reference optimal policies rather than outer alignment.
Cool (though FWIW, if you’re going to lean on the notion of policies being aligned with humans, I’d be inclined to define that as well, in addition to defining what it is for agents to be aligned with humans. But maybe the implied definition is clear enough: I’m assuming you have in mind something like “a policy is aligned with humans if an agent implementing that policy is aligned with humans.”).
Regardless, sounds like your definition is pretty similar to: “An agent is intent aligned if its behavioral objective is such that an arbitrarily powerful and competent agent pursuing this objective to arbitrary extremes wouldn’t act in ways that humans judge bad”? If you see it as importantly different from this, I’d be curious.
Thanks for writing this up. Quick question re: “Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans.” What does it mean for an objective to be aligned with humans, on your view? You define what it is for an agent to be aligned with humans, e.g.: “An agent is aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.” But you don’t say explicitly what it is for an objective to be aligned: I’m curious if you have a preferred formulation.
Is it something like: “the behavioral objective is such that, when the agent does ‘well’ on this objective, the agent doesn’t act in a way we would view as bad/problematic/dangerous/catastrophic.” If so, it seems like a lot might depend on exactly how “well” the agent does, and what opportunities it has in a given context. That is, an “aligned” agent might not stay aligned if it becomes more powerful, but continues optimizing for the same objective (for example, a weak robot optimizing for beating me at chess might be “aligned” because it only focuses on making good chess moves, but a stronger one might not be, because it figures out how to drug my tea). Is that an implication you’d endorse?
Or is the thought something like: “the behavioral objective such that, no matter how powerfully the agent optimizes for it, and no matter its opportunities for action, it doesn’t take actions we would view as bad/problematic/dangerous/catastrophic”? My sense is that something like this is often the idea people have in mind, especially in the context of anticipating things like intelligence explosions. If this is what you have in mind, though, maybe worth saying so explicitly, since intent alignment in this sense seems like a different constraint than intent alignment in the sense of e.g. “the agent’s pursuit of its behavioral objective does not in fact give rise to bad actions, given the abilities/contexts/constraints that will in fact be relevant to its behavior.”
Maybe the best thing to use here is just the same definition as I gave for outer alignment—I’ll change it to reference that instead.
Aren’t they now defined in terms of each other?
“Intent alignment: An agent is intent aligned if its behavioral objective is outer aligned.
Outer alignment: An objective function r is outer aligned if all models that perform optimally on r in the limit of perfect training and infinite data are intent aligned.”
Good point—and I think that the reference to intent alignment is an important part of outer alignment, so I don’t want to change that definition. I further tweaked the intent alignment definition a bit to just reference optimal policies rather than outer alignment.
Cool (though FWIW, if you’re going to lean on the notion of policies being aligned with humans, I’d be inclined to define that as well, in addition to defining what it is for agents to be aligned with humans. But maybe the implied definition is clear enough: I’m assuming you have in mind something like “a policy is aligned with humans if an agent implementing that policy is aligned with humans.”).
Regardless, sounds like your definition is pretty similar to: “An agent is intent aligned if its behavioral objective is such that an arbitrarily powerful and competent agent pursuing this objective to arbitrary extremes wouldn’t act in ways that humans judge bad”? If you see it as importantly different from this, I’d be curious.