Some thoughts on this:
One response I notice having to your points is: why the focus on value alignment?
“We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition.”
And conversely: “if we have value alignment, I don’t think there’s a guarantee that we wind up in a basin of convergent human values, so you still have the problem of—whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?”
(Using quotes bc these feel more like ‘text completions from system 1’ than all-things-considered takes from system 2.)
You’ve correctly noted the issue of why lots of people may not be safe even in a physical sense even assuming value alignment/corrigibility/intent alignment/instruction following is solved, and I do think you are correct that there is no guarantee that we wind up in a basin of convergence, and I’d even argue that it’s unlikely to converge and instead diverge because there is no 1 moral reality, and there are an infinite amount of correct moralities/moral realities, so yeah the oversight problem is pretty severe.
Maybe there’s a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.
I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:
It’s pretty likely we end up with AIs that care about at least some of human value, e.g. valuing conscious experience. (at least if AGIs resemble current LLMs, which seem to imprint on humans quite a lot.) AI experiences could themselves be deeply morally valuable, even if the AIs aren’t very human-aligned. (though you might need them to at minimum care about consciousness, so they don’t optimize it away) A some-humans-led world could be at least as bad as an AI-led world, and very plausibly could have negative rather than zero value. I think this is partly down to founder effects where Eliezer either didn’t buy these ideas or didn’t want to emphasize them (bc they cut against the framing of “alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana”).
So I’ll state a couple of things here.
On your first point, I think that AGIs probably will be quite different from current LLMs, mostly due to future AIs having continuous learning, a long-term memory and being more data efficient/sample efficient, and the most accessible way to make AIs more capable will route through using more RL.
On your second point, this as always depends on your point of view, because once again there’s no consistent answer that holds across all valid moralities.
On your third point, again this depends on your point of view, but if I use my inferred model of human values where most humans strongly disvalue dying/being tortured, I agree that a some-humans led world is at least as bad as an AI world, because I consider most of what makes humans being willing to be prosocial in situations where it’s low cost to do so to be unfortunately held up by things that are absolutely shredded once some humans can just not depend on other human beings anymore for a rich life, and not based on what the human values internally.
I also notice some internal tension where part of me is like “the AIs don’t seem that scary in Noosphere’s world”. But another part is like “dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society—why isn’t that scary?”
I think where this is coming from is that I tend to focus on “transition dynamics” to the AGI future rather than “equilibrium dynamics” of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don’t see it, the returns to scale seem so good).
Yes, this explains the dynamics of why I was more negative than you in your post, and the point was to argue against people like @Matthew Barnett and a lot of other people’s arguments that AI alignment doesn’t need to be solved, because AIs will follow human made laws and there will be enough positive sum trades such that the AIs, even if selfish will decide to not kill humans.
And my point is that unfortunately, in a post-AI takeover world any trade between most humans and AIs would be closer to an AI giving away stuff in return for nothing given up by the human, because the human as a living entity has 0 value, or even negative value from an economics perspective, and their land and property/capital are valuable, but are very easily stolen.
So if an AI didn’t terminally value the survival/thriving of people who have 0/negative value in an economics sense, then it’s quite likely that outright killing/warping the human severely is unfortunately favorable to the AI’s interest.
In essence, I was trying to say why conditional on you not controlling the AI (which I think happens in the long run), you really do need assumptions on the AI’s values to a much greater extent to survive than current humans in current human institutions.
So maybe I’d say “if you’re not talking a classic AI takeover scenario, and you’re imagining a somewhat gradual takeoff,
my attention gets drawn to the ways humans and fundamental competitive dynamics screw things up
the iterative aspect of gradual takeoff means I’m less worried about alignment on its own. (still needs to get solved, but more likely to get solved.)”
I do agree that in more gradual takeoffs, humans/competitive dynamics matter more, and alignment is more likely to be solved, defusing the implications I made (with the caveat that the standard of what counts as an AI being aligned will have to rise to extreme levels over time in a way people are not prepared for), so I agree that the alignment problem is less urgent, though I do think at least in the long run and even arguably in the medium term, a lot of the problems of competitive dynamics/human flaws screwing things up will ultimately require as a baseline leaders who actually value people/beings that have 0 power surviving and thriving, because if you do not have this, none of the other proposed solutions work, and I think it’s really important to say that compared to the 19th-21st century era in democracies, values are going to matter a lot more to how much humans thrive or die.
Not the original commenter, but I’d argue that the gradient descent on something that can also make architectural updates to itself may be possible, though I know nothing much about how gradient descent works, so this might not actually be possible.
But I do think that the ability for the AI to make small, continual architectural updates to itself is actually pretty important, and I’d argue that a lot of the reason AI is used very little so far has to do with the fact that if it cannot 0 or 1-shot a problem, it basically has no ability to learn from it’s failures, due to having 0 neuroplasticity after training, and if we assume that some level of learning from failure IRL is very important (which I agree with), then methods to make continuous learning practical will be incentivized, because all of the leading labs valuation and profit depend on the assumption that they will soon be able to automate away human workers, and continuous learning is a major, major blocker to this goal.
More from Dwarkesh below, including a relevant quote:
https://www.dwarkesh.com/p/timelines-june-2025