Have we eventually solved world hunger by giving 1% of GDP to the global poor?
Also, note it’s not obvious that ASI can be aligned.
Have we eventually solved world hunger by giving 1% of GDP to the global poor?
Also, note it’s not obvious that ASI can be aligned.
I’m founder of the Existential Risk Observatory, a nonprofit aiming to reduce xrisk by informing the public since 2021. We have published four TIME Ideas pieces (including the first one on xrisk ever) and about 35 other media pieces in six countries. We’re also doing research into AI xrisk comms, notably producing to my knowledge the first paper on the topic. Finally, we’re organizing events, coupling xrisk experts such as Bengio, Tegmark, Russell, etc. to leaders of the societal debate (incl. journalists from TIME, Economist, etc.) and policymakers.
First, I think you’re a bit too negative about online comms. Some Yud tweets, but also e.g. Lex Fridman xrisky interviews, actually have millions of views: that’s not a bubble anymore. I think online xrisk comms is firmly net positive, including AI Notkilleveryoneism Memes. Journalists are also on X.
But second, I definitely agree that there’s a huge opportunity informing the public about AI xrisk. We did some research on this (see paper above) and, perhaps unsurprising, an authority (leading AI prof) on a media channel people trust seems to work best. There’s also a clear link between length of the item and effect. I’d say: try to get Hinton, Bengio, and Russell in the biggest media possible, as much as possible, as long as possible (and expand: get other academics to be xrisk messengers as well). I think eg this item was great.
What also works really well: media moments. The FLI open letter and CAIS open statement created a ripple big enough to be reported by almost all media. Another example is the Nobel Prize of Hinton. Another easy one: Hinton updating his pdoom in an interview from 10% to 10-20%, that was news apparently. If anyone can create more of such moments: amazingly helpful!
All in all, I’d say the xrisk space is still unconvinced about getting the public involved. I think that’s a pity. I know projects that don’t get funded now, but could help spread awareness at scale. Re activism: I share your view that it won’t really work until the public is informed. However, I think groups like PauseAI are helpful in informing the public about xrisk, making them net positive too.
this is a constraint on how the data can be generated, not on how efficiently other models can be retrained
Maybe we can regulate data generation?
I didn’t read everything, but just flagging that there are also AI researchers, such as Francois Chollet to name one example, who believe that even the most capable AGI will not be powerful enough to take over. On the other side of the spectrum, we have Yud believing AGI will weaponize new physics within the day. If Chollet is a bit right, but not quite, and the best AI possible is just able to take over, than control approaches could actually stop it. I think control/defence should not be written off even as a final solution.
So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake
Again, I’m glad that we agree on this. I notice you want to do what I consider the right thing, and I appreciate that.
The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point.
I can see the following scenario occur: the AI, with its AC, decided rightly that a pivotal act needs to be undertaken to avoid xrisk (or srisk). However, the public mostly doesn’t recognize the existence of such risks. The AI will proceed sabotaging people’s unsafe AI projects against public will. What happens now is: the public gets absolutely livid at the AI, that is subverting human power by acting against human will. Almost all humans team up to try to shut down the AI. The AI recognizes (and had already recognized) that if it looses, humans risk going extinct, so it fights this war against humanity and wins. I think in this scenario, an AI, even one with artificial conscience, could become the most hated thing on the planet.
I think people underestimate the amount of pushback we’re going to get once you get into pivotal act territory. That’s why I think it’s hugely preferred to go the democratic route and not count on AI taking unilateral actions, even if it would be smarter or even wiser, whatever that might mean exactly.
All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.
So yes definitely agree with this. I don’t think lack of conscience or ethics is the issue though, but existential risk awareness.
Pick a goal where your success doesn’t directly cause obvious problems
I agree but I’m afraid value alignment doesn’t meet this criterion. (I’m copy pasting my response on VA from elsewhere below).
I don’t think value alignment of a super-takeover AI would be a good idea, for the following reasons:
1) It seems irreversible. If we align with the wrong values, there seems little anyone can do about it after the fact.
2) The world is chaotic, and externalities are impossible to predict. Who would have guessed that the industrial revolution would lead to climate change? I think it’s very likely that an ASI will produce major, unforseeable externalities over time. If we have aligned it in an irreversible way, we can’t correct for externalities happening down the road. (Speed also makes it more likely that we can’t correct in time, so I think we should try to go slow).
3) There is no agreement on which values are ‘correct’. Personally, I’m a moral relativist, meaning I don’t believe in moral facts. Although perhaps niche among rationalists and EAs, I think a fair amount of humans shares my beliefs. In my opinion, a value-aligned AI would not make the world objectively better, but merely change it beyond recognition, regardless of the specific values implemented (although it would be important which values are implemented). It’s very uncertain whether such change would be considered as net positive by any surviving humans.
4) If one thinks that consciousness implies moral relevance, AIs will be conscious, creating more happy morally relevant beings is morally good (as MacAskill defends), and AIs are more efficient than humans and other animals, the consequence seems to be that we (and all other animals) will be replaced by AIs. I consider that an existentially bad outcome in itself, and value alignment could point straight at it.
I think at a minimum, any alignment plan would need to be reversible by humans, and to my understanding value alignment is not. I’m somewhat more hopeful about intent alignment and e.g. a UN commission providing the AI’s input.
The killer app for ASI is, and always has been, to have it take over the world and stop humans from screwing things up
I strongly disagree with this being a good outcome, I guess mostly because I would expect the majority of humans to not want this. If humans would actually elect an AI to be in charge, and they could be voted out as well, I could live with that. But a takeover by force from an AI is as bad for me as a takeover by force from a human, and much worse if it’s irreversible. If an AI is really such a good leader, let them show it by being elected (if humans decide that an AI should be allowed to run at all).
Thanks for your reply. I think we should use the term artificial conscience, not value alignment, for what you’re trying to do, for clarity. I’m happy to see we seem to agree that reversibility is important and replacing humans is an extremely bad outcome. (I’ve talked to people into value alignment of ASI who said they “would bite that bullet”, in other words would replace humanity by more efficient happy AI consciousness, so this point does not seem to be obvious. I’m also not convinced that leading longtermists necessarily think replacing humans is a bad outcome, and I think we should call them out on it.)
If one can implement artificial conscience in a reversible way, it might be an interesting approach. I think a minimum of what an aligned ASI would need to do is block other unaligned ASIs or ASI projects. If humanity supports this, I’d file it under a positive offense defense balance, which would be great. If humanity doesn’t support it, it would lead to conflict with humanity to do it anyway. I think an artificial conscience AI would either not want to fight that conflict (making it unable to stop unaligned ASI projects), or if it would, people would not see it as good anymore. I think societal awareness of xrisk and from there, support for regulation (either by AI or not) is what should make our future good, rather than aligning an ASI in a certain way.
Care to elaborate? Are there posts on the topic?
Assuming positive defense/offense balance can be achieved in principle, what would an AGI-powered defense look like?
I don’t strongly disagree re architectures, but I do think we are uncertain about this. Depending on AGI architecture, different forms of regulation may or may not work. Work should be carried out to determine which regulation works for how many flops needed for takeover-level AI.
That it’s not happening yet is 1) no reason it won’t (xrisk awareness is just too low, but slowly rising) and 2) equally applicable to the alternative you propose, universal surveillance.
If we treat universal surveillance seriously, we should consider its downsides as well. First, there’s no proof it would work: I’m not sure an AI, even a future one, would necessarily catch all actions towards building AGI. I have no idea what these actions are, and no idea which actions a surveillance AI with some real-world sensors can catch (or could be blocked etc.). I think we should not be more than 70% confident this would technically work. Second, currently we have power vacuums in the world, such as failed states, revolutions, criminal groups, or just instances were those in power are unable to project their power effectively. How would we apply universal surveillance to those power vacuums? Or do we assume they won’t exist anymore, and if so, why is that assumption justified? Third, universal surveillance is arguably the world’s least popular policy. It seems outright impossible to implement this in any democratic way. Perhaps the plan is to implement it by force through an AGI, then I would file it as a form of pivotal act. If we’re anyway in pivotal act territory, I’d strongly prefer Yudkowsky’s “subtly modifying all GPUs such that they can no longer train an AGI” (kind of hardware regulation, really) over universal surveillance.
I think research is urgently required into how to implement a pause effectively. We have one report almost finished on the topic that mostly focuses on hardware regulation. PauseAI is working on a Building a pause button-project that is a bit similar. Other orgs should do work on this as well, and compare options such as hardware regulation, universal surveillance, data regulation, etc. and conclude in which AGI regime (how many flops, how much hardware required) these options are valid.
True, I guess we’re not in significant disagreement here.
I want to stress how I hugely like this post. What to do once we have an aligned AI of takeover level, or how to make sure no one will build an unaligned AI of takeover level, is in my opinion the biggest gap in many AI plans. I think answering this question might point to filling gaps that are currently completely unactioned, and I therefore really like this discussion. I previously tried to contribute to arguably the same question in this post, where I’m arguing that a pivotal act seems unlikely and therefore conclude that policy rather than alignment is likely to make sure we don’t go extinct.
They’d use their AGI to enforce that moratorium, along with hopefully minimal force.
I would say this is a pivotal act, although I like the sound of enforcing a moratorium better (and the opening it perhaps gives to enforcing a moratorium in the traditional, imo much preferred way of international policy).
I’m hereby providing a few reasons why I think a pivotal act might not happen:
A pivotal act is illegal. One needs to break into other people’s and other countries’ computer systems and do physical harm to property or possibly even people to enact it. Companies such as OpenAI and Anthropic are, although I’m not always a fan of them, generally law-abiding. It will be a big step for their leadership to do something as blatantly unlawful as a pivotal act.
There is zero indication that labs are planning to do a pivotal act. This may obviously have something to do with the point above, however, one would have expected hints from someone like Sam Altman who is hinting all the time, or leaks from people lower in the labs, if they were planning to do this.
The pivotal act is currently not even discussed seriously among experts and in fact highly unpopular in the discourse (see for example here).
If the labs are currently not planning to do this, it seems quite likely they won’t when the time comes.
Governments, especially the US government/ military, seem more likely in my opinion to perform a pivotal act. I’m not sure they will call it a pivotal act or necessarily have an existential reason in mind while performing it. They might see this as blocking adversaries from being able to attack the US, very much in their Overton window. However, for them as well, there is no certainty they would actually do this. There are large downsides: it is a hostile act towards another country, it could trigger conflict, they are likely to be uncertain how necessary this is at all, and uncertain what the progress is of an adversary project (perhaps underestimating it). For perhaps similar reasons, the US has not blocked the USSR atomic project before they had the bomb, even though this could have arguably preserved a unipolar instead of multipolar world order. Additionally, it is far from certain the US government will nationalize labs before they reach takeover level. Currently, there is little indication they will. I think it’s unreasonable to place more than say 80% confidence in the US government or military successfully blocking all adversaries’ projects before they reach takeover level.
I think it’s not unlikely that once an AI is powerful enough for a pivotal act, it will also be powerful enough to generally enforce hegemony, and not unlikely this will be persistent. I would be strongly against one country, or even lab, proclaiming and enforcing global hegemony for eternity. The risk that this might happen is a valid reason to support a pause, imo. If we get that lucky, I would much prefer a positive offense defense balance and many actors having AGI, while maintaining a power balance.
I think it’s too early to contribute to aligned ASI projects (Manhattan/CERN/Apollo/MAGIC/commercial/govt projects) as long as these questions are not resolved. For the moment, pushing for e.g. a conditional AI safety treaty is much more prudent, imo.
I don’t think value alignment of a super-takeover AI would be a good idea, for the following reasons:
1) It seems irreversible. If we align with the wrong values, there seems little anyone can do about it after the fact.
2) The world is chaotic, and externalities are impossible to predict. Who would have guessed that the industrial revolution would lead to climate change? I think it’s very likely that an ASI will produce major, unforseeable externalities over time. If we have aligned it in an irreversible way, we can’t correct for externalities happening down the road. (Speed also makes it more likely that we can’t correct in time, so I think we should try to go slow).
3) There is no agreement on which values are ‘correct’. Personally, I’m a moral relativist, meaning I don’t believe in moral facts. Although perhaps niche among rationalists and EAs, I think a fair amount of humans shares my beliefs. In my opinion, a value-aligned AI would not make the world objectively better, but merely change it beyond recognition, regardless of the specific values implemented (although it would be important which values are implemented). It’s very uncertain whether such change would be considered as net positive by any surviving humans.
4) If one thinks that consciousness implies moral relevance, AIs will be conscious, creating more happy morally relevant beings is morally good (as MacAskill defends), and AIs are more efficient than humans and other animals, the consequence seems to be that we (and all other animals) will be replaced by AIs. I consider that an existentially bad outcome in itself, and value alignment could point straight at it.
I think at a minimum, any alignment plan would need to be reversible by humans, and to my understanding value alignment is not. I’m somewhat more hopeful about intent alignment and e.g. a UN commission providing the AI’s input.
Offense/defense balance is such a giant crux for me. I would take quite different actions if I saw plausible arguments that defense will win over offense. I’m astonished that I don’t know any literature on this. Large parts of the space seem to be quite strongly convinced that offense will win or defense will win (at least, else their actions don’t make sense to me), but I’ve very rarely seen this assumption debated explicitly. It would really be very helpful if someone could point me to sources. Right now I have a twitter poll with 30 votes (result: offense wins) and an old LW post to go by.
I think that if government involvement suddenly increases, there will also be a window of opportunity to get an AI safety treaty passed. I feel a government-focused plan should include pushing for this.
(I think heightened public xrisk awareness is also likely in such a scenario, making the treaty more achievable. I also think heightened awareness in both govt and public will make short treaty timelines (a year to weeks), at least between the US and China, realistic.)
Our treaty proposal (a few other good ones exist): https://time.com/7171432/conditional-ai-safety-treaty-trump/
Also, I think end games should be made explicit: what are we going to do once we have aligned ASI? I think that’s both true for Marius’ plan, and for a government-focused plan with a Manhattan or CERN included in it.
I think this is a crucial question that has been on my mind a lot, and I feel it’s not adequately discussed in the xrisk community, so thanks for writing this!
While I’m interested in what people would do once they have an aligned ASI, what matters in the end is what labs would do, and what governments would do, because they are the ones who would make the call. Do we have any indications on that? What I would expect without thinking very deeply about it: labs wouldn’t try to block others. It’s risky, probably illegal and generally none of their business. They would try to make sure they are not blowing up the world themselves but otherwise let others solve this problem. Governments on the other hand would attempt to block other states from building super-takeover AI, since it’s generally their business to maintain power. I’m less sure they would also block their own citizens from building super-takeover AI, but leaning towards a yes.
Also two smaller points:
You’re pointing to universal surveillance as an (undesirable) way to enforce a pause. I think it’s not obvious that this way is best. My current guess is that hardware regulation has a better chance, even in a world with significant algorithmic and hardware improvement.
I think LWers tend to wave around with nuclear warfare too easily. In the real world, almost eighty years of all kinds of conflicts have not resulted in nuclear escalation. It’s unlikely that a software attack on a datacenter would.
Thanks for writing the post, it was insightful to me.
“This model is largely used for alignment and other safety research, e.g. it would compress 100 years of human AI safety research into less than a year”
In your mind, what would be the best case outcome of such “alignment and other safety research”? What would it achieve?
I’m expecting something like “solve the alignment problem”. I’m also expecting you to think this might mean that advanced AI would be intent-aligned, that is, it would try to do what a user wants it to do, while not taking over the world. Is that broadly correct?
If so, the biggest missing piece for me is to understand how this would help to avoid that someone else builds an unaligned AI somewhere else with sufficient capabilities to take over. DeepSeek released a model with roughly comparable capabilities nine weeks after OpenAI’s o1, probably without stealing weights. It seems to me that you have about nine weeks to make sure others don’t build an unsafe AI. What’s your plan to achieve that and how would the alignment and other safety research help?
AI is getting human-level at closed-ended tasks such as math and programming, but not yet at open-ended ones. They appear to be more difficult. Perhaps evolution brute-forced open-ended tasks by creating lots of agents. In a chaotic world, we’re never going to know which actions lead to a final goal, e.g. GDP growth. That’s why lots of people try lots of different things.
Perhaps the only way in which AI can achieve ambitious final goals is by employing lots of slightly diverse agents. Perhaps that would almost inevitably lead to many warning shots before a successful takeover?
AI is getting human-level at closed-ended tasks such as math and programming, but not yet at open-ended ones. They appear to be more difficult. Perhaps evolution brute-forced open-ended tasks by creating lots of agents. In a chaotic world, we’re never going to know which actions lead to a final goal, e.g. GDP growth. That’s why lots of people try lots of different things.
Perhaps the only way in which AI can achieve ambitious final goals is by employing lots of slightly diverse agents. Perhaps that would almost inevitably lead to many warning shots before a successful takeover?
I don’t have strong takes on what exactly is happening in this particular case but I agree that companies (and more generally, people at high-pressure positions) are very frequently doing the kind of thing you describe. I don’t think we have an indication that this would not be valid for leading AI labs as well.
Interesting and nice to play with a bit.
METR seems to imply 167 hours, approximately one working month, is the relevant project length for getting a well-defined, non-messy research task done.
It’s interesting that their doubling time varies between 7 months and 70 days depending on which tasks and which historical time horizon they look at.
For a lower bound estimate, I’d take 70 days doubling time and 167 hrs, and current max task length one hour. In that case, if I’m not mistaken,
2^(t/d) = 167 (t time, d doubling time)
t = d*log(167)/log(2) = (70/365)*log(167)/log(2) = 1.4 yr, or October 2026
For a higher bound estimate, I’d take their 7 months doubling time result and a task of one year, not one month (perhaps optimistic to finish SOTA research work in one month?). That means 167*12=2004 hrs.
t = d*log(2004)/log(2) = (7/12)*log(2004)/log(2) = 6.4 yr, or August 2031
Not unreasonable to expect AI that can autonomously do non-messy tasks in domains with low penalties for wrong answers in between these two dates?
It’s also noteworthy though that timelines for what the paper calls messy work, in the current paradigm, could be a lot longer, or could provide architecture improvements.