This seems similar to the natural impact regularization/bounded agency things I’ve been bouncing around. (Though my frame to a greater extent expects it to happen “by default”?) I like your way of describing/framing it.
Let’s make it concrete: we cannot just ask a powerful corrigible AGI to “solve alignment” for us. There is no corrigible way to perform a task which the user is confused about; tools don’t do that.
This seems similar to the natural impact regularization/bounded agency things I’ve been bouncing around. (Though my frame to a greater extent expects it to happen “by default”?) I like your way of describing/framing it.
Strongly agree with this.