Thanks for the pointer! (I only now got to fully process it)
Both of the problems presented there seem to occur in this setup.
Regarding the problem of preventing resistance, you wrote:
Hypothetically, maybe you can tell the humans not to use bombs if the button fails to depress. But this solution seems both brittle and unrealistic.
I just like to share an idea for how can this perhaps be tackled, which isn’t a great solution, but does not rely on humans adhering to precommitments.
Suppose our plan is to execute the agent program only for a short while (minutes/hours). We lock a (willing) human overseer in an isolated “control room” where they have access to the button, and only then the agent is invoked. The control room receives input from outside (e.g. it has monitors showing the lab with the robot that the agent controls) but otherwise the control room is completely isolated. The person in the control room won’t be able to communicate with the outside world or leave the control room (for the entire duration in which the agent program is supposed to run).
This idea requires that no humans outside of the control room would see what the agent is doing (so the agent won’t have an incentive to conceal/manipulate).
See also: Two problems with causal-counterfactual utility indifference.
Thanks for the pointer! (I only now got to fully process it)
Both of the problems presented there seem to occur in this setup.
Regarding the problem of preventing resistance, you wrote:
I just like to share an idea for how can this perhaps be tackled, which isn’t a great solution, but does not rely on humans adhering to precommitments.
Suppose our plan is to execute the agent program only for a short while (minutes/hours). We lock a (willing) human overseer in an isolated “control room” where they have access to the button, and only then the agent is invoked. The control room receives input from outside (e.g. it has monitors showing the lab with the robot that the agent controls) but otherwise the control room is completely isolated. The person in the control room won’t be able to communicate with the outside world or leave the control room (for the entire duration in which the agent program is supposed to run).
This idea requires that no humans outside of the control room would see what the agent is doing (so the agent won’t have an incentive to conceal/manipulate).