I think I feel a similar mix of love and frustration for your comment as I read your comment expressing with the post.
Let me be a bit theoretical for a moment. It makes sense for me to think of utilities as a sum where is the utility of things after singularity/superintelligence/etc and the utility for things before then (assuming both are scaled to have similar magnitudes so the relative importance is given by the scaling factors). There’s no arguing about the shape of these or what factors people chose because there’s no arguing about utility functions (although people can be really bad at actually visualizing ).
Separately form this we have actions that look like optimizing for (e.g. AI Safety research and raising awareness), and those that look like optimizing for (e.g. having kids and investing in/for their education). The post argues that some things that look like optimizing for are actually very useful for optimizing (as I understand, it mostly because AI timelines are long enough and the optimization space muddled enough that most people contribute more in expectation from maintaining and improving their general capabilities in a sustainable way at the moment).
Your comment (the pedantic response part) talks about how optimizing for is actually very useful for optimizing . I’m much more sceptical of this claim. The reason is due to expected impact per unit of effort. Let’s consider the sending your kids to college. It looks like top US colleges cost around $50k more per year than state schools, adding up to $200k for a four year programme. This is maybe not several times better as the price tags suggests, but if your child is interested and able to get in to such a school it’s probably at least 10% better (to be quite conservative). A lot of people would be extremely excited for an opportunity to lower the existential risk from AI by 10% for $200k. Sure, sending your kids to college isn’t everything there is to , but it looks like the sign remains the same for a couple of orders of magnitude.
Your talk of a pendulum makes it sound like you want to create a social environment that incentivizes things that look like optimizing for regardless of whether they’re actually in anyone’s best interest. I’m sceptical of trying to get anyone to act against their interests. Rather than make everyone signal that it makes more sense to have space for people with or even to optimize for their values and extract gains from trade. A successful AI Safety project probably looks a lot more like a network of very different people figuring out how to collaborate for mutual benefit than a cadre of self-sacrificing idealists.
Nice proof with a thought-provoking example! Think it could benefit from being translated into a more AI-relevant setting with the following key:
Northland winning = no shutdown
Southland winning = shutdown
Send messenger to Northland = Act in a way that looks dangerous and causes evaluators to probably shut down
Send messenger to Southland = Act in a way that looks safe and causes evaluators to probably not shut down
Bet on Northland = Set up costly processes to do real work in future to attain high utility (e.g. build factories and labs that would need to be safely wound down in event of shut down)
Bet on Southland = Do very conservative work that can immediately be interrupted without any real cost
Cross-over costs could be understood to be the extra fake work to look dangerous when you’re actually very interruptible in the NS case and having to show humans how everything is safe in the SN case (not realistic for the utilities to end up as neat as in your example, but rephrasing this feels like it helps me keep my eyes on the ball with various toy scenarios I’ve considered myself).
However, this actually shows the weakness of relying on pure conditional expectation and the sane pure bets condition. Consider an AI tasked with using nanomedicine to improve human intelligence and which in the case of no shutdown values more humans with higher thinkoomph coming sooner (eg UN(x)=Σhintboost(h)/timeofintboost(h)). It considers two actions: (A) conduct sensible studies to eventually figure out a way of boosting human intelligence without loosing anything humans might value or any other kinds of negative side-effects and cooperates with humans to show this in small studies that can then be scaled up, and (B) device an intelligence boosting virus that also rewrites human values to love the AI and never want to shut it down. We can set up the hypothetical so that both of these cases have a ~0 risk of the shutdown button being pressed, but (B) will yield more utility in the normal utility function due to being able to bypass any human worries about rollout and getting the utility from earlier intelligence boosts. Under the sane pure bets condition (if we set up all the details correctly) the AI should prefer (B) to (A), but this is the opposite of what we’d want from a corrigible system. This is all not meant to be particularly realistic, but still applicable enough to the theory that it causes me to want to reject the suggested desiderata.
I think this shows that we actually need to consider something like counterfactuals for a satisfactory solution to the shutdown problem. The core issue in my example above is that the AI directly causes the shutdown button to not be pressed in case (B) by modifying human values in a way that it doesn’t in (A). Another approach might be something like respecting the humans as independent agents and not interfere with them, but this seems harder to formalise and harder to combine with an objective like augmenting human intelligence.
My own current best-attempt at a solution based on this
I have some ideas that still feel half-baked and that I don’t feel likely to write up in a full blog post any time soon, so I’ll write down a sketch here in case it can help someone else to puzzle more of the pieces together in case this maybe would actually turn out to be helpful at some point down the line.
We can call it Causal utility mixing as a nod to Naive utility mixing on the Arbital page on Utility Indifference. Pick an action a to maximise λNE[UN|do(a)∧do(¬s)]+λsE[Us|do(a)∧do(s)] where the lambdas are parameters picked based on our initial worry about needing to shut down. The parameters can be kept for actions over several time-steps, but we’ll need to adjust to one of the base utility functions once the truth value of s is certain (e.g. the shutdown button is actually pressed, since I think we want to avoid certainty of no shutdown). This does not seem to be represented by any utility function and so this agent must be irrational in some way, but in light of the above result I’m leaning towards this being something we actually want and then the question is if we can somehow prove that it stays consistent under ability to self-modify. This seems to handle all the counterexamples I’ve encountered so far (like the asteroid problem plaguing naive utility mixing and other approaches), but I could very well have missed some or failed to generate others.
Some of you might recognise the idea of using counterfactuals from Jessica Taylor’s and Chris Olah’s approach of Maximizing a quantity while ignoring effect through some channel (called Stable actions (causal conditioning) in the Arbital page), which is more advanced in that it actually tries to assign weights to the two different scenarios. I think that if that is a valid approach to the shutdown problem, so will this much simplified solution and it seems easier to analyse the simpler formula.
I’ve been thinking that maybe you can show that this is somehow rational based on the agent being one party in a two-player game where both players act counterfactually on a graph representing the world (the other being something like an idealised human deciding whether to terminate this hypothetical). I unfortunately haven’t had time to compare this to the game theory based approach in The Off-Switch Game by Hadfield-Menell et al., so don’t know if there are any similarities. I do feel less certain that it will still work with logical counterfactuals or any form of functional decision theory, so it does seem worth it to investigate a bit more.
Sorry for highjacking your comment feed to cause myself to write this up. Hope it was a bit interesting.