This assumes a task-first model of agency, whereas one could instead develop a resource-first model of agency.
If an AI learns to segment the universe into developable resources and important targets that the resources could be propagated into modifying, then the AI could simply remain under human control.
The conventional reason for why this cannot work is that the relevant theories of resource-development agency (as opposed to task-solution agency) haven’t been developed, but that is looking less and less important with current developments in AI. Like yes, current AIs can sort of do task-solution in environments like CTF where that is less relevant, but for serious and dangerous tasks, more effort will likely go into resource-development agency than task-solution agency because resource-development agency is safer. And resource-development agency provides a natural sort of impact measure etc. that restrains whatever fragments of task-solution agency develop in order to complement the resource-development agency.
(And an important aspect of resource-development agency is that you don’t really need a complete theory, you can just develop each part separately, because there’s only so many resources and so many
interesting targets to develop them towards. Like think stuff like metabolism or the interplanetary transport network, where there’s sort of a small canonical solutionspace that is very critical. Really all of reality is like that.)
The actual reason resource-development agency doesn’t work is security. In order to sufficiently quickly and sufficiently dynamically respond to adversarial threats, the AIs cannot wait for painfully slow humans to make decisions about what to do. So what constitutes a threat and what are acceptable ways of neutralizing them needs to be decided ahead of time, and it needs to be sufficiently aggressive against threats that the security-provider doesn’t get destroyed by something bad while being sufficiently open-ended that the security-provider doesn’t cause permanent stagnation of the world.
This assumes a task-first model of agency, whereas one could instead develop a resource-first model of agency.
If an AI learns to segment the universe into developable resources and important targets that the resources could be propagated into modifying, then the AI could simply remain under human control.
The conventional reason for why this cannot work is that the relevant theories of resource-development agency (as opposed to task-solution agency) haven’t been developed, but that is looking less and less important with current developments in AI. Like yes, current AIs can sort of do task-solution in environments like CTF where that is less relevant, but for serious and dangerous tasks, more effort will likely go into resource-development agency than task-solution agency because resource-development agency is safer. And resource-development agency provides a natural sort of impact measure etc. that restrains whatever fragments of task-solution agency develop in order to complement the resource-development agency.
(And an important aspect of resource-development agency is that you don’t really need a complete theory, you can just develop each part separately, because there’s only so many resources and so many interesting targets to develop them towards. Like think stuff like metabolism or the interplanetary transport network, where there’s sort of a small canonical solutionspace that is very critical. Really all of reality is like that.)
The actual reason resource-development agency doesn’t work is security. In order to sufficiently quickly and sufficiently dynamically respond to adversarial threats, the AIs cannot wait for painfully slow humans to make decisions about what to do. So what constitutes a threat and what are acceptable ways of neutralizing them needs to be decided ahead of time, and it needs to be sufficiently aggressive against threats that the security-provider doesn’t get destroyed by something bad while being sufficiently open-ended that the security-provider doesn’t cause permanent stagnation of the world.