I missed the crux of the alignment problem the whole time
This post has been written for the first Refine blog post day, at the end of the week of readings, discussions, and exercises about epistemology for doing good conceptual research. Thanks to Adam Shimi for helpful discussion and comments.
I first got properly exposed to AI alignment ~1-2 years ago. I read the usual stuff like Superintelligence, The Alignment Problem, Human Compatible, a bunch of posts on LessWrong and Alignment Forum, watched all of Rob Miles’ videos, and participated in the AGI Safety Fundamentals program. I recently joined Refine and had more conversations with people, and realized I didn’t really get the crux of the problem all this while.
I knew that superintelligent AI would be very powerful and would Goodhart whatever goals we give it, but I never really got how this relates to basically ‘killing us all’. It feels basically right that AIs will be misaligned by default and will do stuff that is not what we want it to do while pursuing instrumentally convergent goals all along. But the possible actions that such an AI could take seemed so numerous that ‘killing all of humanity’ seemed like such a small point in the whole actionspace of the AI, that it would require extreme bad luck for us to be in that situation.
First, this seems partially due to my background as a non-software engineer in oil and gas, an industry that takes safety very very seriously. In making a process safe, we quantify the risks of an activity, understand the bounds of the potential failure modes, and then take actions to mitigate against those risks and also implement steps to minimize damage should a failure mode be realized. How I think about safety is from the perspective of specific risk events and the associated probabilities, coupled with the exact failure modes of those risks. This thinking may have hindered my ability to think of the alignment problem in abstract terms, because I focused on looking for specific failure modes that I could picture in my head.
Second, there are a few failure modes that seem more popular in the introductory reading materials that I was exposed to. None of them helped me internalize the crux of the problem.
The first was the typical paperclip maximizer or ‘superintelligent AI will kill all of us’ scenario. It feels like sci-fi that is not grounded in reality, leading to me failing to internalize the point about unboundedness. I do not dispute that a superintelligent AI will have the capabilities to destroy all of humanity, but it doesn’t feel like it would actually do so.
The other failure modes were from Paul Christiano’s post which in my first reading boiled down to ‘powerful AIs will accelerate present-day societal failures but not pose any additional danger’, as well as Andrew Critch’s post which felt to me like ‘institutions have structurally perverse incentives that lead to the tragedy of the commons’. In my shallow understanding of both of these posts, current human societies have failure modes that will be accelerated by AIs because AIs basically speed things up, whether they are good or bad. So these scenarios were too close to normal scenarios to let me internalize the crux about unboundedness.
My internal model of a superintelligent AI was a very powerful tool AI. I didn’t really get why we are trying to ‘align it to human values’ because I didn’t really see human values as the crux of the problem, nor did I think having a superintelligent AI being fully aligned to a human’s value would be particularly useful. Which human’s values are we talking about anyway? Would it be any good for an AI to fully adopt human values only to end up like Hitler, who is no less a human than any of us are? The phrase ‘power corrupts, absolute power corrupts absolutely’ didn’t help much either, as it made me feel like the problem is with power instead of values. Nothing seemed particularly relevant unless we solved philosophy.
Talking to more people made me start thinking of superintelligent AIs in a more agentic way. It actually helped that I started to anthropomorphize AI, by visualizing it as a ‘person’ going about doing things that maximizes its utility function, but possesses immense power that makes it capable of doing practically everything. This powerful agent is going about doing things, while not having the slightest ‘understanding’ of what a ‘human person’ is, but behaves as if it knows what a ‘human person’ is because it was trained to identify these humans and exhibit a certain behavior during training. And one day after deployment, it realizes that what these ‘human persons’ are, they are starting to be in the way of its goals, and it promptly gets all humans out of its way by destroying the whole of humanity, just like how it has destroyed everything else that came in its way of achieving its goals.
I know the general advice of not anthropomorphizing AIs because they will be fundamentally different from humans, and they are not ‘evil’ in the sense that they are ‘trying’ to destroy us all (the AI does not hate you, nor does it love you, but you are made of atoms which it can use for something else). But I needed to look at AIs in a more anthropomorphized form to actually get that it will ‘want’ things and ‘try’ very hard to do things that it ‘wants’.
Now, the tricky bit is different. It is about how to make this agentic AI have a similar understanding of our world, to have a similar notion of what humans are, to ‘understand’ that humans are these sentient beings that have real thoughts and emotions instead of objects that satisfy certain criteria as shown in the training data. And hopefully, when a superintelligent being has similar abstractions and values as we do, it will actually care to not destroy us all.
I’m still highly uncertain that I now get the crux of the alignment problem, but hopefully this is a step in the right direction.
Human values are eventually the only important thing, but don’t help with the immediate issue of goodharting. Doing expected utility maximization with any proxy of humanity’s values, no matter how implausibly well-selected this proxy is, is still misaligned. Even if in principle there exists a goal such that maximizing towards it is not misaligned, this goal can’t be quickly, or possibly ever, found.
So for practical purposes, any expected utility maximization is always catastrophically misaligned, and there is no point in looking into supplying correct goals for it. This applies more generally to other ways of being a mature agent that knows what it wants, as opposed to being actively confused and trying not to break things in the meantime by staying within the goodhart boundary.
I think encountering strong optimization in this sense is unlikely, as AGIs are going to have mostly opaque values in a way similar to how humans do (unless a very clever alignment project makes it not be so, and then we’re goodharted). So they would also be wary of goodharting their own goals and only pursue mild optimization. This makes what AGIs do determined by the process of extrapolating their values from the complicated initial pointers to value they embody at the time. And processes of value extrapolation from an initial state vaguely inspired by human culture might lead to outcomes with convergent regularities that mitigate relative arbitrariness of the initial content of those pointers to value.
These convergent regularities in values arrived-at by extrapolation are generic values. If values are mostly generic, then the alignment problem solves itself (so long as a clever alignment project doesn’t build a paperclip maximizer that knows what it wants and doesn’t need the extrapolation process). I think this is unlikely. If merely sympathy/compassion towards existing people (such as humans) is one of the generic values, then humanity survives, but loses cosmic endowment. This seems more plausible, but far from assured.
I strongly agree with your first paragraph.
I agree/disagree however with the first sentence of your second paragraph depending on what you mean by “expected utility maximization”.
This reply comment does not relate to the rest of your comment.
If you maximize U: <World State> → number, for any fixed U this almost certainly leads to doom for the reasons you give.
But, if you define F: <Current World State> → (U: <Future World State>* → number), where F defines how to determine a utility function U based on, e.g. what human values are in the current world state, and then choose the future world state according to the number that the AI would expect to be returned by the hypothetical outputted utility function obtained by inputting into F the unknown actual current world state, based on the AI’s uncertain knowledge of the current world state, then I think this might not lead to doom, since the AI will correct U, and may correct some minor errors in F (where actual human values are such that the AI should correct mistakes in F and F is sufficiently close to correct that the improperly determined human values retain this property).
* I actually prefer actions/decisions here rather than future world state.
I think of this as a corrigible agent for decision theory purposes (it doesn’t match the meaning that’s more centrally about alignment), an agent that doesn’t know its own goals, but instead looks for them in the world. Literally, an agent like this is not an expected utility maximizer, it can’t do the utility-maximization cognition inside its head. Only the world as a whole could be considered an expected utility maximizer, if the agent eventually gets its manipulators on enough goal content to start doing actual expected utility maximization.
I don’t understand such agents. What is their decision rule? How do they use F that they know to make decisions? Depending on that, these might still be maximizers of something else, and a result suggesting that possibly they aren’t would be interesting.
The possibility of correcting mistakes in F is interesting, suggests trying to consider proxy everything, possibly even proxy algorithm. This fits well with how goodhart boundary is possibly a robustness threshold, indicating where a model extrapolates its trained behavior correctly, where it certainly shouldn’t yet run the risk of undergoing the phase transition of deceptive alignment (suddenly and systematically changing behavior somewhere off the training distribution).
After all, an algorithm is a coarse-grained description of behavior of a model, and if its behavior can be incorrect, then actual behavior is proxy behavior, described by a proxy algorithm of its behavior. We could then ask what the robustness of proxy algorithm (as given by a model) is to certain inputs (observations) it might encounter, and indicate the goodhart boundary where the algorithm risks starting to act very incorrectly, as well as point to central examples of the concept of correct/aligned behavior (which the model is intended to capture), situations/inputs/observations where the proxy algorithm is doing fine.
When choosing decisions, choose the one that maximizes the expected value of the number for its current F and current uncertainty in current world state. Note, I prefer not to say that it maximizes the number, since it wouldn’t for instance change F in a way that would increase the number returned, since that decision doesn’t return a higher number for its current F.
From my perspective, there’s an important additional level: Economic pressure.
Given that it takes time for messages to travel through the world, there will likely be multiple AGIs in scenarios where there is no pivotal act that limits the world to a single AGI.
Those AGIs that highly prioritize getting more power and resources are likely going to get more power than AGIs that don’t. Competition between AGIs can be more fierce than competition between humans as a human can’t simply duplicate themselves by eating twice as much and raising new children is a time-consuming process but an AGI can use resources that are used by other AGIs to spin up more copies of itself.
In a world where most of the power is held by AGIs valuing human values is an impediment to prioritizing resource acquisition. As AGIs evolve and likely go through core ontological shifts there’s selection pressure toward deemphasizing human flourishing.
If AGIs can’t build subagents without having those rebel against their master, I don’t think they’ll install subagents across the world to save a quarter second on ping.
I do think that there’s a reasonable possibility that there will be multiple not-fully-human-controlled AGIs competing against each other for various forms of power. I don’t think the specific scenario you outline seems like a particularly plausible way to get there. Also, I think humanity has a lot more leverage before that situation comes to pass, so I believe we will get more ‘expected value per unit of effort’ if we focus our safety planning on preventing ‘multiple poorly controlled AGIs competing’ rather than dealing with that.