Good point—and I think that the reference to intent alignment is an important part of outer alignment, so I don’t want to change that definition. I further tweaked the intent alignment definition a bit to just reference optimal policies rather than outer alignment.
Cool (though FWIW, if you’re going to lean on the notion of policies being aligned with humans, I’d be inclined to define that as well, in addition to defining what it is for agents to be aligned with humans. But maybe the implied definition is clear enough: I’m assuming you have in mind something like “a policy is aligned with humans if an agent implementing that policy is aligned with humans.”).
Regardless, sounds like your definition is pretty similar to: “An agent is intent aligned if its behavioral objective is such that an arbitrarily powerful and competent agent pursuing this objective to arbitrary extremes wouldn’t act in ways that humans judge bad”? If you see it as importantly different from this, I’d be curious.
Good point—and I think that the reference to intent alignment is an important part of outer alignment, so I don’t want to change that definition. I further tweaked the intent alignment definition a bit to just reference optimal policies rather than outer alignment.
Cool (though FWIW, if you’re going to lean on the notion of policies being aligned with humans, I’d be inclined to define that as well, in addition to defining what it is for agents to be aligned with humans. But maybe the implied definition is clear enough: I’m assuming you have in mind something like “a policy is aligned with humans if an agent implementing that policy is aligned with humans.”).
Regardless, sounds like your definition is pretty similar to: “An agent is intent aligned if its behavioral objective is such that an arbitrarily powerful and competent agent pursuing this objective to arbitrary extremes wouldn’t act in ways that humans judge bad”? If you see it as importantly different from this, I’d be curious.