My Alignment “Plan”: Avoid Strong Optimisation and Align Economy
Summary: Many people seem to put their hopes on something like the following “plan”:
(Optional) Step 0: Hope, or assume, that sharp left turn will not happen. Proceed with capabilities until we get ~human-level AI. Then jump to Step 3.
Step 1: Convince everybody to avoid building the kind of AI that could undergo sharp left turn.
Step 2: Use AI for automating processes that we trust to be safe.
Step 3: Solve the problems caused by automating the economy.
If this is true, I think there should be more acknowledgment of this fact, and more discussion of the failure modes of this plan.
Epistemic status: Descriptive, rather than normative.
Descriptive, rather than Normative
I label the epistemic status of this post as “descriptive, rather than normative”. What do I mean by that? And what do I mean by alignment “plan”?
While I thought a lot about AI alignment, I still have many uncertainties about the topic. And I don’t have any plan, for helping us build beneficial AGI, that I would be optimistic about. But I keep working on this, and I have opinions and preferences over which projects to undertake. So, the question I ask in this post is: To the extent that my actions and beliefs seem to be in line with any plan at all, what “plan” do they seem to be following?
I should disclaim that my actual beliefs are a bit more nuanced than the description given here. But for the sake of brevity, I will stick with the simpler formulations below.
The main reason I write this post is that I suspect that many other people might be putting their hopes into a “plan” similar to what I describe. (In the case of alignment researchers, this might be explicit and due to the absence of better ideas. In the case of capabilities researchers, this might happen implicitly, as a result of background assumptions and not having thought about the topic.) To the extent that this is the case, I think it would be useful to acknowledge that this is what is happening, such that we can discuss the plan explicity. To the extent that other people have a significantly different plan, I would be curious to know what the plan is.
Finally, note that I make no claim that the plan described here, or even my more nuanced version of it, is good. In fact, I do not think it is good—I just don’t have a better one. And I think that explicitly describing the plan is the first step towards improving it.
My Alignment “Plan”
(Optional) Step 0: Hope, or assume, that sharp left turn will not happen.
By sharp left turn, I mean a scenario where an AI undergoes a sudden and extreme growth in capability, possibly until it becomes vastly more powerful than anything else around it. Some people seem convinced that sharp left turn cannot, or will not, happen. I think that being confident about this is misguided.[1]
However, it does seem plausible to me that we live in a universe where sharp left turn is impossible.
I also find it plausible that sharp left turn is possible in principle, but it is still far away in the “technological tree”. In particular, it is possible that we still have a very long time until this problem needs to be addressed. Moreover, there is also the possiblity that kind of AI that could undergo sharp left turn will only become available at the point where the background level of capabilities is very large. In such a scenario, undergoing sharp left turn might no longer convey a sufficient advantage for the AI to make much of an impact.
Looking at my actions from the outside, it seems that aside from “don’t build AI capable of sharp left turn” (see Steps 1-2), my only “strategy” for handling sharp left turn is
hope somebody else solves it (despite being convinced that all of the current agendas fail if sharp left turn occurs, similarly to [1, 2]), and
hope we live in a universe where sharp left turn won’t happen anytime soon.[2]
Step 1: Convince everybody to avoid building the kind of AI that could undergo sharp left turn.
I don’t have any good ideas for controling the kind of AI that could undergo sharp left turn, and neither am I aware of any recent work that would make progress on this problem. Instead, I am excited[3] about work which demonstrates the dangers of powerful AI—ideally in ways that are salient even to ML researchers, policy makers, and the public. Two examples of such results are:
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (which shows how, if deceptive behaviour arises in an ML model, all current alignment techniques can fail to remove it),
Universal and Transferable Adversarial Attacks on Aligned Language Models (which shows that current LLMs are very vulnerable to jailbreaks).
It seems conceivable to me that with enough such results, a majority of people could adopt the view that powerful-AI-soon is probably unsurvivable. More specifically, the scenario that seems conceivable to me is that the groups that adopt this view are:
a significant majority of policy makers—enough that should they act on their beliefs, they will be able to change laws roughly however they please;
a significant majority of public, such that the policy makers can get away with acting on their beliefs;
a majority of AI researchers—but not enough that companies like Meta would run out of people to employ.
In scenarios like these, I expect that the change in opinion to suffice for civilisation to attempt to avoid building powerful AI. However, this does automatically mean the attempt will succeed. In particular, we still need to tackle issues such as:
AI companies avoiding or ignoring attempts to control AI progress,
rogue states that don’t go along with attempts to control AI progress,
global tensions creating incentives to defect and develop AI despite the risks.
Ultimately, the hope with this step is that we can delay the development of sharp-left-turn-capable AI until we solve the alignment problem for such AI, or until civilisation becomes sufficiently robust to stop being vulnerable to AI takeover. (Recall that I am merely describing the plan, rather than making any claims about how likely it is to succeed.)
Step 2: Build AI that automates trusted processes.
Even if there is a general consensus that powerful AI is unsurvivable, I still expect any attempts to pause all AI progress to be unsustainable. As a result, we might try to increase our chances of controlling AI progress by white-listing approaches that seem relatively safe. But which approaches are those?
One intuition is that if we are currently doing some process without the use of AI, and we already trust that process is safe, then automating that process and doing more of it is (probably) also safe. (I don’t think this intuition is completely right, but since I discuss those reservations in Step 3, I will leave them aside for now.) To give a few positive examples, we can consider:
Calculator: We understand the long-multiplication algorithm, and we are pretty confident that it doesn’t lead to any catastrophic outcomes. So we can likewise be pretty sure that nothing will go wrong if we build calculators that automate this algorithm.
GOFAI: More generally, many GOFAI approaches essentially consist of human researchers studying a given problem, figuring out a method they could in principle use to solve the problem, and then translating that method into an algorithm executable on a computer.
CoEms: Another approach which falls under this category is Conjecture’s Cognitive Emulation agenda.
Empowering democracy: Suppose we trust that democracy is guaranteed to lead to beneficial outcomes. And suppose we come up with ideas for making it work more efficiently (e.g., making it easier for people to give feedback for policy proposals). Then creating tools (such as Polis) which automate some parts of this process should likewise be safe.
Social media (aka, a not-so-subtle foreshadowing of the issues discussed in Step 3): Going back to the world before social media, it seems that there is nothing wrong with people exchanging messages and sharing interesting news and ideas with each other. This suggests that it should be safe to build platforms which enable us to do these things more efficiently.
In contrast, the following strategies would not fall under the approach above:
Reinforcement learning: Attempt to capture a given goal using a formal proxy, then have the AI perform an open-ended search through plans which score well according to that proxy. Suppose we believe that the process of training an RL agent is safe, and so is the process of the RL agent calculating which agent to take. Even so, unless we have a reason to believe that any sequence of actions the agent can take is safe, RL does not fall under the class of “automate trusted processes” approaches.
Neural networks: A similar story applies to neural networks. However, this example also nicely illustrates the ambiguity involved in this plan: If one took the view that “neural networks are just matrix multiplication, and we trust matrix multiplication is always safe”, it would follow that neural networks are also perfectly safe to use.
Search: More generally, any search over a sufficiently broad set of plans is inherently not “automating a trusted process”.
Overall, this approach to building AI seems much slower and more expensive than building larger and larger foundation models and turning them into agents. However, it should still be sufficient to eventually automate most of the economy, which should in turn allow us to eliminate poverty, greatly speed up science, solve all problems that can be solved using technology, etc. So the “only” issues are whether we can successfully take Steps 1-2 … and the minor detail of whether automating the economy might perhaps come with problems of its own.
Step 3: Solve problems caused by automating the economy.
As one might expect, even if the approach of automating trusted processes goes as well as possible, there will still be many remaining problems to solve. Some of these are:
More Is Different for AI: ambiguity. As suggested by the example neural networks, the line between processes that are and aren’t safe to automate is a vague one. As a result, we need to exercise caution when deciding which things to automate, and perhaps even be open to rolling back automation that proves dangerous.
Network effects and phase transitions. As suggested by the example of social media, automation can have qualitatively different effects once applied on a large scale.
Moloch, “aligning” economy, “aligning” society. Meditations on Moloch and This is the Dream Time suggest a disturbing possibility: The default state of future civilisation is one where most things we value have been sacrificed in order to appease selection pressures. Scott Alexander suggests that the fight against this future is not hopeless, since Moloch can be countered by coordination. However, this problem seems to be quite real, and one that we have essentially not even begun to study. Moreover, automating the economy could result in increasing the selection pressure that cause Moloch, and thus decrease the time we have left to solve the problem.
Loss of control. As outlined for example in What failure looks like, Another (outer) alignment failure story, and Ascended Economy?, widespread automation could result in humanity losing control of its future.
Loss of symbiosis with the economy (h/t Jan_Kulveit ). The economy and culture have, arguably, mostly had a positive impact on the human condition. At the very least, they have not been completely counter to human interests. For example, a meme that “jumping off cliffs is great fun” would kill many of the people who adopt it, so it will get outcompeted by the meme that “cliffs are dangerous”. Similarly, a company that pays its employees a below-subsistance wages will get outcompeted by companies that offer better conditions.
However, once we automate a large fraction of the economy and society, this relationship between competitiveness and being beneficial to humans can cease to hold.
All of these problems sound like they have the potential to cause human extinction, or worse. At the same time, most problems have the property that one can tell a scary story about how the given problem will cause the world to end. So, uhm, perhaps we can wing it and it will all be fine?
Follow-up Questions
Finally, here are some related questions that I have:
How many people are putting their hopes into something like this “plan”?
This can either be because they assume that sharp left turn won’t happen, or because they are trying to avoid building the kinds of AI for which it might happen.How likely is it that Steps 1-2 will succeed? And are there obvious ways of improving them?
Are there other (promising) ways of either avoiding or tackling sharp left turn?
Suppose we avoid the sharp left turn (ie, we build ~human-level AI without solving the alignment problem for superintelligent AI). Which other (AI-related) sources of X-risk will remain? And which of the problems listed above pose a serious X-risk?
- ^
Depending on how I wake up each day, I feel that the chance of sharp left turn happening in time to be relevant is something between 5% and 95%. And most days I am above 50%. (This is besides the point of this post, but it does seem somewhat relevant for context.)
- ^
Personally, I endorse the sentiment that one should first figure out in which universe they are, and then try to do the best they can in that universe—as opposed to focusing on worlds where they know how to make progress. That is why this plan has Steps 1-2.
- ^
Well, at least more excited than about any other work.
Walmart is one of the biggest employers in the world, and its salaries are notoriously so low that a large percentage of employees depend on welfare to survive (in addition to walmart salary). The economy is already pretty far from what I’d call aligned. If we want to align it, the best time to start was a couple centuries ago, the second best time is now. Let’s not wait until AI increases concentration of power even more.
I think some things we can do to better our chances include:
enforcing sandboxed testing of frontier models before they are deployed, using independent audits by governments or outside companies. This could potentially prevent a model which has undergone a sharp left turn from escaping.
better ways of testing for potential harms from AI systems, expanding the set of available evals of various sorts of risk
putting more collective resources into AI safety: alignment research, containment preparations, worldwide monitoring, international treaties
ensure that a militarily dominant coalition of nations is in agreement that should a Rogue AGI arise in the world, that their best chance of survival would be a rapid forceful response to stomp it out before it gains too much power. Have sufficient definitions and agreed upon procedures in place such that action could follow automatically from detection without need for lengthy discussion.
What about quickly distributing frontier AI when it is shown to be safe? That is risky of course if it isn’t safe, however if the deployed AI is as powerful and distributed as far as possible, then a bad AI would need to be more powerful comparatively to take over.
So
AI(x-1) is everywhere and protecting as much as possible, AI(x) is sandboxed
VS
AI(x-2) is protecting everything, AI(x-1) is in a few places, AI(x) is sandboxed.
or the bad ai is able to hack every copy of the widely distributed ai the same way, making the question moot.
But it would surely be more likely to hack x-2 than x-1?
Right, and it would be easier to hack, since it has the same adversarial examples, right?
Oh, wait, I see what you’re saying. No I think hacking x-1 and x-2 will both be trivial. AIs are basically zero secure right now.
I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the “distribute AI(x-1) quickly” part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the “single point of failure” effect, though it seems unclear how large.)
To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.
But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.
Excellent post. I think this is not a plan that’s likely to succeed, but I think you’ve correctly and explicitly layed out the plan that many are following without being explicit about it—and therefore its limitations.
I’m very curious how many alignment researchers would agree that this is roughly their plan.