If we ignore subagents and imagine a cartesian boundary, turned off can easily be defined as all future outputs are 0.
I also doubt that an AI working ASAP is safe in any meaningful sense. Of course you can move all the magic into “human judges world ok”. If you make lambda large enough, your AI is safe and useless.
If the utility function is 1 if widget exists, else 0. Where a widget is easily build-able, not currently existing object.
Suppose that ordering the parts through normal channels will take a few weeks. If it hacks the nukes and holds the world to ransom, then everyone at the widget factory will work nonstop, then drop dead of exhaustion.
Alternately it might be able to bootstrap self replicating nanotech in less time. The AI has no reason to care if the nanotech that makes the widget is highly toxic, and no reason to care if it has a shutoff switch or grey goos the earth after the widget is produced.
World looks ok at time T is not enough, you could still get something bad arising from the way seemingly innocuous parts were set up at time T. Being switched off and having no subagents in the conventional sense isn’t enough. What if the AI changed some physics data in such a way that humans would collapse the quantum vacuum state, believing the experiment they were doing was safe. Building a subagent is just a special case of having unwanted influence
If we ignore subagents and imagine a cartesian boundary, turned off can easily be defined as all future outputs are 0.
I also doubt that an AI working ASAP is safe in any meaningful sense. Of course you can move all the magic into “human judges world ok”. If you make lambda large enough, your AI is safe and useless.
If the utility function is 1 if widget exists, else 0. Where a widget is easily build-able, not currently existing object.
Suppose that ordering the parts through normal channels will take a few weeks. If it hacks the nukes and holds the world to ransom, then everyone at the widget factory will work nonstop, then drop dead of exhaustion.
Alternately it might be able to bootstrap self replicating nanotech in less time. The AI has no reason to care if the nanotech that makes the widget is highly toxic, and no reason to care if it has a shutoff switch or grey goos the earth after the widget is produced.
World looks ok at time T is not enough, you could still get something bad arising from the way seemingly innocuous parts were set up at time T. Being switched off and having no subagents in the conventional sense isn’t enough. What if the AI changed some physics data in such a way that humans would collapse the quantum vacuum state, believing the experiment they were doing was safe. Building a subagent is just a special case of having unwanted influence