Joe Carlsmith comments on Clarifying inner alignment terminology

Joe Carlsmith 19 Feb 2021 21:33 UTC
5 points
Cool (though FWIW, if you’re going to lean on the notion of policies being aligned with humans, I’d be inclined to define that as well, in addition to defining what it is for agents to be aligned with humans. But maybe the implied definition is clear enough: I’m assuming you have in mind something like “a policy is aligned with humans if an agent implementing that policy is aligned with humans.”).
Regardless, sounds like your definition is pretty similar to: “An agent is intent aligned if its behavioral objective is such that an arbitrarily powerful and competent agent pursuing this objective to arbitrary extremes wouldn’t act in ways that humans judge bad”? If you see it as importantly different from this, I’d be curious.