I talk about the issue of creating corrigible subagents here. What do you think of that?
I may not understand your thing fully, but here’s my high-level attempt to summarize your idea:
IPP-agents won’t care about the difference between building a corrigible agent vs an incorrigible agent because it models that if humans decide something’s off and try to shut everything down, it will also get shut down and thus nothing after that point matters, including whether the sub-agent makes a bunch of money or also gets shut down. Thus, if you instruct an IPP agent to make corrigible sub-agents, it won’t have the standard reason to resist: that incorrigible sub-agents make more money than corrigible ones. Thus if we build an obedient IPP agent and tell it to make all its sub-agents corrigible, we can be more hopeful that it’ll actually do so.
I didn’t see anything in your document that addresses my point about money-maximizers being easier to build than IPP agents (or corrigible agents) and thus, in the absence of an instruction to make corrigible sub-agents, we should expect sub-agents that are more akin to money-maximizers.
But perhaps your rebuttal will be “sure, but we can just instruct/train the AI to make corrigible sub-agents”. If this is your response, I am curious how you expect to be able to do that without running into the misspecification/misgeneralization issues that you’re so keen to avoid. From my perspective it’s easier to train an AI to be generally corrigible than to create corrigible sub-agents per se (and once the AI is generally corrigible it’ll also create corrigible sub-agents), which seems like a reason to focus on corrigibility directly?
I may not understand your thing fully, but here’s my high-level attempt to summarize your idea:
I didn’t see anything in your document that addresses my point about money-maximizers being easier to build than IPP agents (or corrigible agents) and thus, in the absence of an instruction to make corrigible sub-agents, we should expect sub-agents that are more akin to money-maximizers.
But perhaps your rebuttal will be “sure, but we can just instruct/train the AI to make corrigible sub-agents”. If this is your response, I am curious how you expect to be able to do that without running into the misspecification/misgeneralization issues that you’re so keen to avoid. From my perspective it’s easier to train an AI to be generally corrigible than to create corrigible sub-agents per se (and once the AI is generally corrigible it’ll also create corrigible sub-agents), which seems like a reason to focus on corrigibility directly?