The default outcome for aligned AGI still looks pretty bad
Most of the AI-related discussions I’ve read from the LW/EA community have rightly focused on the challenges and strategies for aligning AI with the intentions of its creator. After all, if that problem remains unsolved, humanity will almost certainly go extinct.
However, something I rarely see discussed is what happens if we manage to solve the narrower problem of aligning an AI with the desires of its creator, but fail to solve the wider problem of aligning an AI with the desires of humans as a whole. I think the default outcome if we solve the narrow alignment problem still looks pretty terrible for humans.
Much of the cutting-edge work on LLMs and other powerful models is being conducted in for-profit corporations. As we see clearly in existing companies, maximizing profits within such an entity often produces behavior that is highly misaligned with the interests of the general public. Social media companies, for example, monetize a large fraction of human attention for stunningly low value, often at the expense of user’s social relationships, mental health, and economic productivity.
OpenAI at least operates under a capped profit model, where investors can earn no more than 100x their initial investment. This is good since it at least leaves open the possibility that if they create AGI, the profit model may not automatically result in gigantic concentration of power. They have also hinted in their posts that future fundraising rounds will be capped at some multiple lower than 100x.
However, they still haven’t publicly specified how the benefits of AGI beyond this capped profit threshold will be distributed. If they actually succeed in their mission, this distribution method will become a huge, huge deal. Like I would fully expect nations to consider war (insofar as war could accomplish their goals) if they feel like they are given an unfair deal in the distribution of AGI-created profits.
Anthropic is registered as a public benefit corporation, which according to this explainer means:
Unlike standard corporations, where the Board generally must consider maximizing shareholder value as its prime directive, members of the Board of a PBC must also consider both the best interests of those materially affected by the company’s conduct, and the specific public benefit outlined in the company’s charter.
Unfortunately, it doesn’t seem that Anthropic has made its charter public, so there’s no easy way for me to even guess how well its non-profit-maximizing mission aligns with the general interests of humanity other than to hope that it is somewhat similar to the language in the four paragraphs listed on their page regarding company values.
Besides these two, every other entity that is a serious contender to create AGI is either a pure profit-maximizing enterprise or controlled by the CCP.
Even without the profit incentive, things still look kind of bleak
Imagine for a moment that somehow, we make it to a world in which narrowly aligned AI becomes commonplace and either non-corporate actors can exert enough influence to prevent AGI from purely benefitting stockholders, or the first group to create AGI is able to stick to the tenants of their public benefit charter. Now imagine that during the rollout of AGI access to the general public, you are one of the few lucky enough to have early access to the most powerful models. What’s the first thing you would use it for?
Different people will have different answers for this depending on their circumstances, but there are a few self-interested actions that seem quite obvious:
Make oneself smarter (possibly through brain uploading. Such technology seems within reach for AGI)
Make oneself immortal (hopefully with some caveats to prevent infinite torture scenarios)
Such actions seem very likely to lead to enormous power concentration in the hands of the few with privileged access to powerful early models. It seems entirely plausible that such a process could lead to a runaway feedback loop where those who engaged in such a strategy first could use that increased intelligence to accrue more and more resources (most importantly compute), until that person becomes more or less equivalent to a superintelligence.
The best case outcome in such a scenario would be a benevolent dictatorship of some sort, in which someone who broadly wants the best for all non-amplified humans ensures that no other entity can challenge their power, and takes steps to safeguard the welfare of other humans.
But I think a more likely outcome is one in which multiple people with early access to the models end up fighting over resources. Maybe by some miracle the winner of that process will be a reasonably benevolent dictator, but such competitive dynamics optimize for willingness to do whatever it takes to win. History suggests such traits are not well correlated (or perhaps even negatively correlated) with a general concern for human well-being.
I’ve read some good ideas about what to do with an aligned AGI, most notably Coherent Extrapolated Volition by MIRI, but there still remains the question of “can we actually get the interested parties involved to do something as sensible as CEV instead of just directly pursuing their own self-interest?”
Please let me know if you see anything obvious I’ve overlooked in this post. I am not a professional AI researcher, so it’s possible people have already figured out the answer to these challenges and I am simply unaware of their writings.
It’s a very likely conjecture that there’s no moat between getting an AI to do what’s good for one person, and getting an AI to do what’s good for humans. There is no technical difficulty getting one once you know how to get the other, because both involve the same amount of “interpreting humans the way they want to be interpreted.”
Which one we get isn’t about technical prpblems, but about the structures surrounding AGI projects.
Right, I guess that’s the main problem I’m gesturing at here. It seems pretty likely that if we create aligned AGI, there will be more than one of them (unless whoever creates the first makes a dedicated effort to prevent the creation of others).
In that circumstance, the concentration of power dynamics I described seems to still be concerning.
I suggest that we need an intelligent ecological value pluralism, some international laws that protect value plurality and possibly also a Guardian AI which protects value plurality.
https://www.lesswrong.com/posts/Kaz9miAuxSAAuGr9z/value-pluralism-and-ai
That’s suicide for humans. See above.
Why would an AI which has been given the role of defending plurality start to think that paper clips are much more fun?
Because we have given up all autonomy and have no purpose in existing.
I may be wrong here but it sounds like Goran is proposing some limits on the power humans can exert on one another, not an all-encompassing nanny-bot.
I suppose if one were to push the boundaries far enough in any direction they would inevitably end up limited by the consequences of their actions for others but it seems likely that whatever boundaries such an AI set would be far beyond the boundaries already imposed on virtually all of us by limited knowledge, technology, morality etc.
I guess what I’m trying to say is plurality limits don’t seem all that concerning.
Which devolves to “I don’t want other human nations doing things I don’t like. I need an absolutely reliable tool I can use, something that won’t turn against me, that I can use against them”.
And so you end up with arms race, tool AGIs, and the reality that if one faction gains a sufficiently large advantage they will use their tools against others.
All these “we need to slow down AGI” calls are actually saying “losing is better than winning...”
The default outcome is using the “aligned” AGI to create a misaligned uncontrollable AGI, with enormous power concentrated in its hands. Continued alignment is a costly constraint that forfeits the race to the bottom.
Under my model, the modal outcome for “we have single-single aligned TAI” is something like:
Most humans live in universe-poverty, which is that they don’t get to control any of the stars & galaxies, but they also don’t have to die or suffer, and live in (what we would perceive as) material abundance due to some UBI-ish scheme. (The cost to whoever controls TAI to do this is so negligible that they will probably do this, unless they are actively sadistic). I am unsure what will happen with Malthusian drives: will the people who control TAI put a cap on human reproduction or just not care enough, until humanity has grown enough so that the allotment of labor from TAI systems given by the TAI-controllers is “spread very thin”? My intuition says that the former will happen, but not very strongly.
The people who control TAI might colonise the universe, but on my best model they use the resources mainly to signal status to other people who control TAI systems. (Alternatively, they will use the resources to procreate a lot until the Malthusian limit is reached as well). This scenario means that the cosmic potential is not reached.
I don’t see why the first people to control TAI wouldn’t just upload themselves into a computer and amplify their own intelligence to the limits of physics.
Are you imagining that aligned AGI would prevent this?
I don’t see anything in my comment that conflicts with them uploading themselves. Or are you implying that uploaded superintelligent humans won’t signal or procreate anymore?
I kind of expect they’d still signal? At least to other equivalently powerful entities. I don’t really see why they would procreate other than through cloning themselves for strategic purposes.
But my point is simply that uploads of brains may not be constrained by alignment in the same way that de-novo AGIs would be. And to the extent that uploaded minds are misaligned with what we want AGI to do, that itself seems like a problem.
I’ve had similar thoughts. Two counterpoints:
This is basically misuse risk, which is not a weird problem that people need to be convinced even needs solving. To the extent AI appears likely to be powerful, society at large is already working on this. Of course, its efforts may be ineffective or even counterproductive.
They say power corrupts, but I’d say power opens up space to do what you were already inclined to do without constraints. Some billionaires, e.g. Bill Gates, seem to be sincerely trying to use their resources to help people. It isn’t hard for me to imagine that many people, if given power beyond what they can imagine, would attempt to use it to do helpful / altruistic things (at least, things they themselves considered helpful / altruistic).
I don’t in any sense think either of these are knockdowns, and I’m still pretty concerned about how controllable AI systems (whether that’s because they’re aligned, or just too weak and/or insufficiently agentic) may be used.
To me, aligning AI with the humanity seems to be much EASIER than aligning with the specific person. Because the common human values are much better “documented” and are much more stable than the wishes of the one person.
Also, a single person in control of powerful AI is an obvious weak point, which could be controlled by third party or by AI itself, giving the control of the AI through that.
From actual tractable AI architectures, this is it. This is the best we can do. We build tools that are heavily restricted in what they can remember, their every important output checked by other tools constructed a different way, we don’t give any system a “global” RL counter but have it run in short time episodes where it does what it can towards a goal before time limit expires.
This is the reality. Humans will be able to do anything they want. That’s the best alignment can do.
Asking for a “guardian” AGI that rules the planet instead of humans is just choosing suicide. (and so human groups armed with these tool AIs will resort to as much violence as is necessary—no limits—if someone attempts to build such an AGI and allow it to rule.)
There are unfortunately a large number of routes that lead to a new round of global wars once humans have these tools AI systems.
say more about why it would be suicide? it seems to me that it would only be able to succeed through diplomacy, and would take decades before enough trust had been established. but I see no fundamental reason that war and conflict couldn’t be near completely ended forever, once the diffusion of coprotective strategies is thorough enough.
Certainly pivotal acts are intense acts of war and beget retaliation, as you say.
My definition of a guardian AI : machine allowed to do anything it wants, but we tried to make it “be good”. It has military capabilities, can self improve, and we have no edit access.
That’s suicide. Murder pills etc.