Nice proof with a thought-provoking example! Think it could benefit from being translated into a more AI-relevant setting with the following key:
Northland winning = no shutdown
Southland winning = shutdown
Send messenger to Northland = Act in a way that looks dangerous and causes evaluators to probably shut down
Send messenger to Southland = Act in a way that looks safe and causes evaluators to probably not shut down
Bet on Northland = Set up costly processes to do real work in future to attain high utility (e.g. build factories and labs that would need to be safely wound down in event of shut down)
Bet on Southland = Do very conservative work that can immediately be interrupted without any real cost
Cross-over costs could be understood to be the extra fake work to look dangerous when you’re actually very interruptible in the NS case and having to show humans how everything is safe in the SN case (not realistic for the utilities to end up as neat as in your example, but rephrasing this feels like it helps me keep my eyes on the ball with various toy scenarios I’ve considered myself).
However, this actually shows the weakness of relying on pure conditional expectation and the sane pure bets condition. Consider an AI tasked with using nanomedicine to improve human intelligence and which in the case of no shutdown values more humans with higher thinkoomph coming sooner (eg UN(x)=Σhintboost(h)/timeofintboost(h)). It considers two actions: (A) conduct sensible studies to eventually figure out a way of boosting human intelligence without loosing anything humans might value or any other kinds of negative side-effects and cooperates with humans to show this in small studies that can then be scaled up, and (B) device an intelligence boosting virus that also rewrites human values to love the AI and never want to shut it down. We can set up the hypothetical so that both of these cases have a ~0 risk of the shutdown button being pressed, but (B) will yield more utility in the normal utility function due to being able to bypass any human worries about rollout and getting the utility from earlier intelligence boosts. Under the sane pure bets condition (if we set up all the details correctly) the AI should prefer (B) to (A), but this is the opposite of what we’d want from a corrigible system. This is all not meant to be particularly realistic, but still applicable enough to the theory that it causes me to want to reject the suggested desiderata.
I think this shows that we actually need to consider something like counterfactuals for a satisfactory solution to the shutdown problem. The core issue in my example above is that the AI directly causes the shutdown button to not be pressed in case (B) by modifying human values in a way that it doesn’t in (A). Another approach might be something like respecting the humans as independent agents and not interfere with them, but this seems harder to formalise and harder to combine with an objective like augmenting human intelligence.
My own current best-attempt at a solution based on this
I have some ideas that still feel half-baked and that I don’t feel likely to write up in a full blog post any time soon, so I’ll write down a sketch here in case it can help someone else to puzzle more of the pieces together in case this maybe would actually turn out to be helpful at some point down the line.
We can call it Causal utility mixing as a nod to Naive utility mixing on the Arbital page on Utility Indifference. Pick an action a to maximise λNE[UN|do(a)∧do(¬s)]+λsE[Us|do(a)∧do(s)] where the lambdas are parameters picked based on our initial worry about needing to shut down. The parameters can be kept for actions over several time-steps, but we’ll need to adjust to one of the base utility functions once the truth value of s is certain (e.g. the shutdown button is actually pressed, since I think we want to avoid certainty of no shutdown). This does not seem to be represented by any utility function and so this agent must be irrational in some way, but in light of the above result I’m leaning towards this being something we actually want and then the question is if we can somehow prove that it stays consistent under ability to self-modify. This seems to handle all the counterexamples I’ve encountered so far (like the asteroid problem plaguing naive utility mixing and other approaches), but I could very well have missed some or failed to generate others.
Some of you might recognise the idea of using counterfactuals from Jessica Taylor’s and Chris Olah’s approach of Maximizing a quantity while ignoring effect through some channel (called Stable actions (causal conditioning) in the Arbital page), which is more advanced in that it actually tries to assign weights to the two different scenarios. I think that if that is a valid approach to the shutdown problem, so will this much simplified solution and it seems easier to analyse the simpler formula.
I’ve been thinking that maybe you can show that this is somehow rational based on the agent being one party in a two-player game where both players act counterfactually on a graph representing the world (the other being something like an idealised human deciding whether to terminate this hypothetical). I unfortunately haven’t had time to compare this to the game theory based approach in The Off-Switch Game by Hadfield-Menell et al., so don’t know if there are any similarities. I do feel less certain that it will still work with logical counterfactuals or any form of functional decision theory, so it does seem worth it to investigate a bit more.
Sorry for highjacking your comment feed to cause myself to write this up. Hope it was a bit interesting.
Nice proof with a thought-provoking example! Think it could benefit from being translated into a more AI-relevant setting with the following key:
Northland winning = no shutdown
Southland winning = shutdown
Send messenger to Northland = Act in a way that looks dangerous and causes evaluators to probably shut down
Send messenger to Southland = Act in a way that looks safe and causes evaluators to probably not shut down
Bet on Northland = Set up costly processes to do real work in future to attain high utility (e.g. build factories and labs that would need to be safely wound down in event of shut down)
Bet on Southland = Do very conservative work that can immediately be interrupted without any real cost
Cross-over costs could be understood to be the extra fake work to look dangerous when you’re actually very interruptible in the NS case and having to show humans how everything is safe in the SN case (not realistic for the utilities to end up as neat as in your example, but rephrasing this feels like it helps me keep my eyes on the ball with various toy scenarios I’ve considered myself).
However, this actually shows the weakness of relying on pure conditional expectation and the sane pure bets condition. Consider an AI tasked with using nanomedicine to improve human intelligence and which in the case of no shutdown values more humans with higher thinkoomph coming sooner (eg UN(x)=Σhintboost(h)/timeofintboost(h)). It considers two actions: (A) conduct sensible studies to eventually figure out a way of boosting human intelligence without loosing anything humans might value or any other kinds of negative side-effects and cooperates with humans to show this in small studies that can then be scaled up, and (B) device an intelligence boosting virus that also rewrites human values to love the AI and never want to shut it down. We can set up the hypothetical so that both of these cases have a ~0 risk of the shutdown button being pressed, but (B) will yield more utility in the normal utility function due to being able to bypass any human worries about rollout and getting the utility from earlier intelligence boosts. Under the sane pure bets condition (if we set up all the details correctly) the AI should prefer (B) to (A), but this is the opposite of what we’d want from a corrigible system. This is all not meant to be particularly realistic, but still applicable enough to the theory that it causes me to want to reject the suggested desiderata.
I think this shows that we actually need to consider something like counterfactuals for a satisfactory solution to the shutdown problem. The core issue in my example above is that the AI directly causes the shutdown button to not be pressed in case (B) by modifying human values in a way that it doesn’t in (A). Another approach might be something like respecting the humans as independent agents and not interfere with them, but this seems harder to formalise and harder to combine with an objective like augmenting human intelligence.
My own current best-attempt at a solution based on this
I have some ideas that still feel half-baked and that I don’t feel likely to write up in a full blog post any time soon, so I’ll write down a sketch here in case it can help someone else to puzzle more of the pieces together in case this maybe would actually turn out to be helpful at some point down the line.
We can call it Causal utility mixing as a nod to Naive utility mixing on the Arbital page on Utility Indifference. Pick an action a to maximise λNE[UN|do(a)∧do(¬s)]+λsE[Us|do(a)∧do(s)] where the lambdas are parameters picked based on our initial worry about needing to shut down. The parameters can be kept for actions over several time-steps, but we’ll need to adjust to one of the base utility functions once the truth value of s is certain (e.g. the shutdown button is actually pressed, since I think we want to avoid certainty of no shutdown). This does not seem to be represented by any utility function and so this agent must be irrational in some way, but in light of the above result I’m leaning towards this being something we actually want and then the question is if we can somehow prove that it stays consistent under ability to self-modify. This seems to handle all the counterexamples I’ve encountered so far (like the asteroid problem plaguing naive utility mixing and other approaches), but I could very well have missed some or failed to generate others.
Some of you might recognise the idea of using counterfactuals from Jessica Taylor’s and Chris Olah’s approach of Maximizing a quantity while ignoring effect through some channel (called Stable actions (causal conditioning) in the Arbital page), which is more advanced in that it actually tries to assign weights to the two different scenarios. I think that if that is a valid approach to the shutdown problem, so will this much simplified solution and it seems easier to analyse the simpler formula.
I’ve been thinking that maybe you can show that this is somehow rational based on the agent being one party in a two-player game where both players act counterfactually on a graph representing the world (the other being something like an idealised human deciding whether to terminate this hypothetical). I unfortunately haven’t had time to compare this to the game theory based approach in The Off-Switch Game by Hadfield-Menell et al., so don’t know if there are any similarities. I do feel less certain that it will still work with logical counterfactuals or any form of functional decision theory, so it does seem worth it to investigate a bit more.
Sorry for highjacking your comment feed to cause myself to write this up. Hope it was a bit interesting.