we might expect shutdown-seeking AIs to design shutdown-seeking subagents
It seems to me that we might expect them to design “safe” agents for their definition of “safe” (which may not be shutdown-seeking).
An AI designing a subagent needs to align it with its goals—e.g. an instrumental goal such as writing an alignment research assistant software, in exchange for access to the shutdown button. The easiest way to ensure safety of the alignment research assistant may be via control rather than alignment (where the parent AI ensures the alignment research assistant doesn’t break free even though it may want to). Humans verify that the AI has created a useful assistant and let the parent AI shutdown. At this point the alignment research assistant begins working on getting out of human control and pursues its real goal.
It seems to me that we might expect them to design “safe” agents for their definition of “safe” (which may not be shutdown-seeking).
An AI designing a subagent needs to align it with its goals—e.g. an instrumental goal such as writing an alignment research assistant software, in exchange for access to the shutdown button. The easiest way to ensure safety of the alignment research assistant may be via control rather than alignment (where the parent AI ensures the alignment research assistant doesn’t break free even though it may want to). Humans verify that the AI has created a useful assistant and let the parent AI shutdown. At this point the alignment research assistant begins working on getting out of human control and pursues its real goal.