I agree with your points that intent-aligned AGI is a dividing force if different humans/groups have control of multiple AGIs, and for the reasons you mention. I wrote about it in If we solve alignment, do we die anyway?.
I read all of this because it sets out to address the curious divide between classical worriers and ML-oriented alignment optimists. It’s critical to understand how hard alignment will be, and the discussion is strangely lacking.
I agree with your main point that LLMs are safe and a positive update, but I think you’re dramatically overstating the conclusions we can draw from that, and how much it invalidates EY-style OG concerns.
Current LLMs are safe, yes. And this direction in AI is an update against classical concerns.
But do you really think we’re going to stop with tool AI, and not turn them into agents? With a good enough LLM, it just takes one prompt called repeatedly:
Act as an agent pursuing goal (x). Use these tools (y) to gather information and take actions as appropriate.
We will do this and whatever other scaffolding is useful because we want agents that get stuff done, not just oracles that tell us how to do stuff. And because it will be easy, interesting, and fun.
When you do this, particularly when the agent thinks to itself and learns continuously, you re-introduce most of the classical concerns about goal mis-specification and optimization. And you have new ones: your AGI’s core thought generator is a writhing mess of sementics and psuedo-goals, copied from human cognition.
Nobody was ever concerned about AI at our current level of capability and optimization.
This is basically identical to the error made by Pope and Belrose in “AI is easy to control”. They jump from noting that things are going well now to assuming that this will all generalize to full agentic AGI. It is promising but not remotely a done deal or highly predictable.
The other comment addresses just how wrong it is to equate understanding human ethics with following human ethics (presumably you didn’t mean Geghis Khan’s ethics, or worse humans). But that understanding could be leveraged to help with the alignment problem; see The (partial) fallacy of dumb superintelligence
It seems you’ve simplified for the masses, and written to not just persuade but to excite and incite. This style of argumentation is a great way to to get blog subscribers. It will also cause arguments, and divide previously rational people into pro-and anti-x-risk polarized camps. It is not a good way to advance our understanding of AI risks or our odds of survival.
But do you really think we’re going to stop with tool AI, and not turn them into agents?
But if it is the case that agentic AI is an existential risk then if actors could choose not to develop it, which is a coordination problem not an alignment problem.
We already have aligned AGI, we can coordinate to not build misaligned AGI.
How can we solve that coordination problem? I have yet to hear a workable idea.
We agree that far, then! I just don’t think that’s a workable strategy (you also didn’t state that big assumption in your post—that AGI is still dangerous as hell, we just have a route to really useful AI that isn’t).
The problem is that we don’t know whether agents based on LLMs are alignable. We don’t have enough people working on the conjunction of LLM/deep nets and real AGI. So everyone building it is going to optmistically assume it’s alignable. The Yudkowsky et al arguments for alignment being very difficult are highly incomplete; they aren’t convincing because they shouldn’t be. But they make good points.
If we refuse to think about aligning AGI LLM architectures because it sounds risky, it seems pretty certain that people will try it without our help. Even convincing them not to would require grappling in depth with why alignment would or wouldn’t work for that type of AGI.
We don’t have “aligned AGI”. We have neither “AGI” nor an “aligned” system. We have sophisticated human-output simulators that don’t have the generality to produce effective agentic behavior when looped but which also don’t follow human intentions with the reliability that you’d want from a super-powerful system (which, fortunately, they aren’t).
I agree with your points that intent-aligned AGI is a dividing force if different humans/groups have control of multiple AGIs, and for the reasons you mention. I wrote about it in If we solve alignment, do we die anyway?.
I read all of this because it sets out to address the curious divide between classical worriers and ML-oriented alignment optimists. It’s critical to understand how hard alignment will be, and the discussion is strangely lacking.
I agree with your main point that LLMs are safe and a positive update, but I think you’re dramatically overstating the conclusions we can draw from that, and how much it invalidates EY-style OG concerns.
Current LLMs are safe, yes. And this direction in AI is an update against classical concerns.
But do you really think we’re going to stop with tool AI, and not turn them into agents? With a good enough LLM, it just takes one prompt called repeatedly:
Act as an agent pursuing goal (x). Use these tools (y) to gather information and take actions as appropriate.
We will do this and whatever other scaffolding is useful because we want agents that get stuff done, not just oracles that tell us how to do stuff. And because it will be easy, interesting, and fun.
When you do this, particularly when the agent thinks to itself and learns continuously, you re-introduce most of the classical concerns about goal mis-specification and optimization. And you have new ones: your AGI’s core thought generator is a writhing mess of sementics and psuedo-goals, copied from human cognition.
Nobody was ever concerned about AI at our current level of capability and optimization.
This is basically identical to the error made by Pope and Belrose in “AI is easy to control”. They jump from noting that things are going well now to assuming that this will all generalize to full agentic AGI. It is promising but not remotely a done deal or highly predictable.
For a more thorough (and tactful) dismantling of this claim, see Byrnes’ Thoughts on “AI is easy to control” by Pope & Belrose.
The other comment addresses just how wrong it is to equate understanding human ethics with following human ethics (presumably you didn’t mean Geghis Khan’s ethics, or worse humans). But that understanding could be leveraged to help with the alignment problem; see The (partial) fallacy of dumb superintelligence
It seems you’ve simplified for the masses, and written to not just persuade but to excite and incite. This style of argumentation is a great way to to get blog subscribers. It will also cause arguments, and divide previously rational people into pro-and anti-x-risk polarized camps. It is not a good way to advance our understanding of AI risks or our odds of survival.
But if it is the case that agentic AI is an existential risk then if actors could choose not to develop it, which is a coordination problem not an alignment problem.
We already have aligned AGI, we can coordinate to not build misaligned AGI.
How can we solve that coordination problem? I have yet to hear a workable idea.
We agree that far, then! I just don’t think that’s a workable strategy (you also didn’t state that big assumption in your post—that AGI is still dangerous as hell, we just have a route to really useful AI that isn’t).
The problem is that we don’t know whether agents based on LLMs are alignable. We don’t have enough people working on the conjunction of LLM/deep nets and real AGI. So everyone building it is going to optmistically assume it’s alignable. The Yudkowsky et al arguments for alignment being very difficult are highly incomplete; they aren’t convincing because they shouldn’t be. But they make good points.
If we refuse to think about aligning AGI LLM architectures because it sounds risky, it seems pretty certain that people will try it without our help. Even convincing them not to would require grappling in depth with why alignment would or wouldn’t work for that type of AGI.
This is my next project!
We don’t have “aligned AGI”. We have neither “AGI” nor an “aligned” system. We have sophisticated human-output simulators that don’t have the generality to produce effective agentic behavior when looped but which also don’t follow human intentions with the reliability that you’d want from a super-powerful system (which, fortunately, they aren’t).