Charbel-Raphael Segerie
https://crsegerie.github.io/
Living in Paris
I’m going to collect here new papers that might be relevant:
I was thinking about this:
Perhaps this link is relevant: https://www.fanaticalfuturist.com/2024/12/ai-agents-created-a-minecraft-civilisation-complete-with-culture-religion-and-tax/ (it’s not a research paper, but neither you I think?)
Voyager is a single agent, but it’s very visual: https://voyager.minedojo.org/
OpenAI already did the hide-and-seek project a while ago: https://openai.com/index/emergent-tool-use/
While those are not examples of computer use, I think it fits the bill for a presentation of multi-agent capabilities in a visual way.
I’m happy to see that you are creating recaps for journalists and social media.
Regarding the comment on advocacy, “I think it also has some important epistemic challenges”: I’m not going to deny that in a highly optimized slide deck, you won’t have time to balance each argument. But also, does it matter that much? Rationality is winning, and to win, we need to be persuasive in a limited amount of time. I don’t have the time to also fix civilizational inadequacy regarding epistemics, so I play the game, as is doing the other side.
Also, I’m not criticizing the work itself, but rather the justification or goal. I think that if you did the goal factoring, you could optimize for this more directly.
Let’s chat in person !
I’m skeptical that this is the best way to achieve this goal, as many existing works already demonstrate these capabilities. Also, I think policymakers may struggle to connect these types of seemingly non-dangerous capabilities to AI risks. If I only had three minutes to pitch the case for AI safety, I wouldn’t use this work; I would primarily present some examples of scary demos.
Also, what you are doing is essentially capability research, which is not very neglected. There are already plenty of impressive capability papers that I could use for a presentation.
For info, here is the deck of slides that I generally use in different context.
I have considerable experience pitching to policymakers, and I’m very confident that my bottleneck in making my case isn’t a need for more experiments or papers, but rather more opportunities, more cold emails, and generally more advocacy.
I’m happy to jump on a call if you’d like to hear more about my perspective on what resonates with policymakers.
See also: We’re Not Advertising Enough.
What’s your theory of impact by doing this type of work?
We need to scale this massively. CeSIA is seriously considering to test the Direct Institutional Plan in France and in Europe.
Relatedly, I found the post We’re Not Advertising Enough very good, and making a similar point a bit more theoretically.
My response to the alignment / AI representatives proposals:
Even if AIs are “baseline aligned” to their creators, this doesn’t automatically mean they are aligned with broader human flourishing or capable of compelling humans to coordinate against systemic risks. For an AI to effectively say, “You are messing up, please coordinate with other nations/groups, stop what you are doing” requires not just truthfulness but also immense persuasive power and, crucially, human receptiveness. Even if pausing AI was the correct thing to do, Claude is not going to suggest this to Dario for obvious reasons. As we’ve seen even with entirely human systems (Trump’s Administration and Tariff), possessing information or even offering correct advice doesn’t guarantee it will be heeded or lead to effective collective action.
[...] “Politicians...will remain aware...able to change what the system is if it has obviously bad consequences.” The climate change analogy is pertinent here. We have extensive scientific consensus, an “oracle IPCC report”, detailing dire consequences, yet coordinated global action remains insufficient to meet the scale of the challenge. Political systems can be slow, captured by short-term interests, or unable to enact unpopular measures even when long-term risks are “obviously bad.” The paper [gradual disempowerment] argues AI could further entrench these issues by providing powerful tools for influencing public opinion or creating economic dependencies that make change harder.
Extract copy pasted from a longer comment here.
I find this pretty convincing.
The small amendment that I would make is that the space of policy options is quite vast and taking time to compare different options is probably not a bad idea, but I largely agree that it would generally be much better for people to move to the n-1 level.
That’s super interesting, thanks a lot for writing all of this.
I would agree if there remain humains after a biological catastrophe, I think that’s not a big deal and it’s easy to repopulate the planet.
I think it’s more tricky in the situation above, where most of the economy is run by AI, thought I’m really not sure of this
humanity’s potential is technically fulfilled if a random billionaire took control over earth and killed almost everyone
I find this quite disgusting personally
I think that his ‘very rich life’, and his sbires, would be a terrible impoverishment of human diversity and values. My mental image for this is something like Hitler in his bunker while AIs are terraforming the earth into an inhabitable place.
biorisks is not the only risk.
Full ARA, might not be existential, but I might be a pain in the ass once we have full adaptation and superhuman cyber/persuasion abilities.
While I concur that power concentration is a highly probable outcome, I believe complete disempowerment warrant deeper consideration, even under the assumptions you’ve laid out. Here are some thoughts on your specific points:
On Baseline Alignment: You suggest a baseline alignment where AIs are unlikely to engage in egregious lying or tampering (though you also flag 20% for scheming and 10% for unintentional egregious behavior even with prevention efforts, that’s already 30%-ish of risk). My concern is twofold:
Sufficiency of “Baseline”: Even if AIs are “baseline aligned” to their creators, this doesn’t automatically mean they are aligned with broader human flourishing or capable of compelling humans to coordinate against systemic risks. For an AI to effectively say, “You are messing up, please coordinate with other nations/groups, stop what you are doing” requires not just truthfulness but also immense persuasive power and, crucially, human receptiveness. Even if pausing AI was the correct thing to do, Claude is not going to suggest this to Anthropic folks for obvious reasons. As we’ve seen even with entirely human systems (Trump’s Administration and Tariff), possessing information or even offering correct advice doesn’t guarantee it will be heeded or lead to effective collective action.
Erosion of Baseline: The pressures described in the paper could incentivise the development or deployment of AIs where even “baseline” alignment features are traded off for performance or competitive advantage. The “AI police” you mention might struggle to keep pace or be defunded/sidelined if it impedes perceived progress or economic gains. “Innovation first!”, “Drill baby drill” “Plug baby plug” as they say
On “No strong AI rights before full alignment”: You argue that productive AIs won’t get human-like rights, especially strong property rights, before being robustly aligned, and that human ownership will persist.
Indirect Agency: Formal “rights” might not be necessary for disempowerment. An AI, or a network of AIs, could exert considerable influence through human proxies or by managing assets nominally owned by humans who are effectively out of the loop or who benefit from this arrangement. An AI could operate through a human willing to provide access to a bank account and legal personhood, thereby bypassing the need for its own “rights.”
On “No hot global war”:
You express hope that we won’t enter a situation where a humanity-destroying conflict seems plausible.
Baseline Risk: While we all share this hope, current geopolitical forecasting (e.g., from various expert groups or prediction markets) often places the probability of major power conflict within the next few decades at non-trivial levels. For a war that makes more than 1M of deaths, some estimates hover around 25%. (But probably your definition of “hot global war” is more demanding)
AI as an Accelerant: The dynamics described in the paper – nations racing for AI dominance, AI-driven economic shifts creating instability, AI influencing statecraft – could increase the likelihood of such a conflict.
Responding to your thoughts on why the feedback loops might be less likely if your three properties hold:
“Owners of capital will remain humans and will remain aware...able to change the user of that AI labor if they desire so.”
Awareness doesn’t guarantee the will or ability to act against strong incentives. AGI development labs are pushing forward despite being aware of the risks, often citing competitive pressures (“If we don’t, someone else will”). This “incentive trap” is precisely what could prevent even well-meaning owners of capital from halting a slide into disempowerment. They might say, “Stopping is impossible, it’s the incentives, you know,” even if their pDoom is 25% like Dario, or they might not give enough compute to their superalignment team.
“Politicians...will remain aware...able to change what the system is if it has obviously bad consequences.”
The climate change analogy is pertinent here. We have extensive scientific consensus, an “oracle IPCC report”, detailing dire consequences, yet coordinated global action remains insufficient to meet the scale of the challenge. Political systems can be slow, captured by short-term interests, or unable to enact unpopular measures even when long-term risks are “obviously bad.” The paper argues AI could further entrench these issues by providing powerful tools for influencing public opinion or creating economic dependencies that make change harder.
“Human consumers of culture will remain able to choose what culture they consume.”
You rightly worry about “brain-hacking.” The challenge is that “obviously bad” might be a lagging indicator. If AI-generated content subtly shapes preferences and worldviews over time, the ability to recognise and resist this manipulation could diminish before the situation becomes critical. I think that people are going to LOVE AI, and might take the trade to go faster and be happy and disempowered like some junior developers begin to do on Cursor.
As a meta point, the fact that the quantity and quality of discourse on this matter is so low, and the fact that people are continuing to say “LET’S GO WE ARE CREATING POWERFUL AIS, and don’t worry, we plan to align them, even if we don’t really know which type of alignment do we really need, and if this is even doable in time” while we have not rigorously assessed all those risks, is really not a good sign.
At the end of the day, my probability for something in the ballpark of gradual disempowerment / extreme power concentration and loss of democracy is 40%-ish, much higher than scheming (20%) leading to direct takeover (let’s say 10% post mitigation like control).
zero alignment tax seems less than 50% likely to me
Here is more of the usual worries about AI recommendation engines distorting the information space. Some of the downsides are real, although far from all, and they’re not as bad as the warnings, especially on polarization and misinformation. It’s more that the algorithm could save you from yourself more, and it doesn’t, and because it’s an algorithm now the results are its fault and not yours. The bigger threat is just that it draws you into the endless scroll that you don’t actually value.
For the record, I don’t think that’s really engaging with what’s in the post, and that’s rare enough from you that I want to flag it.
For very short-term intervention, I think it’s more important to avoid simulating a suffering persona than to try to communicate with AIs, because it’s quite hard to communicate with current AIs without the filter “I’m an LLM, I have no emotions, yada yada”. A simple way to do this is to implement a monitoring system of user requests.
Thanks a lot for writing this, this is an important consideration, and it would be sweet if Anthropic updated accordingly.
Some remarks:
I’m still not convinced that Deceptive AI following scheming is the main risk compared to other risks (gradual disempowerment, concentration of power & value Lock in, a nice list of other risks from John).
“Should we give up on interpretability? No!”—I think this is at least a case for reducing the focus a bit, and more diversification of approaches
On the theories of impacts suggested:
“A Layer of Swiss Cheese”—why not! This can make sense in DeepMind’s plan, that was really good by the way.
“Enhancing Black-Box Evaluations”—I think a better theory is interp to complement AI Control techniques. Example: Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals).
Maybe Simple probes can catch sleeper agents \ Anthropic could also be interesting, in the sense that the probe seems to generalize surprisingly well (I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).
“Debugging mysterious behaviour”—Might be interesting, might help marginally to get better understanding, but this is not very central for me.
OK, thanks a lot, this is much clearer. So basically most humans lose control, but some humans keep control.
And then we have this meta-stable equilibrium that might be sufficiently stable, where humans at the top are feeding the other humans with some kind of UBI.
Is this situation desirable? Are you happy with such course of action?
Is this situation really stable?
For me, this is not really desirable—the power is probably going to be concentrated into 1-3 people, there is a huge potential for value locking, those CEOs become immortal, we potentially lose democracy (I don’t see companies or US/China governments as particularly democratic right now), the people on the top become potentially progressively corrupted as is often the case. Hmm.
Then, is this situation really stable?
If alignment is solved and we have 1 human at the top—pretty much yes, even if revolutions/value drift of the ruler/craziness are somewhat possible at some point maybe?
If alignment is solved and we have multiple humans competing with their AIs—it depends a bit. It seems to me that we could conduct the same reasoning as above—but not at the level of organizations, but the level of countries: Just as Company B might outcompete Company A by ditching human workers, couldn’t Nation B outcompete Nation A if Nation A dedicates significant resources to UBI while Nation B focuses purely on power? There is also a potential race to the bottom.
And I’m not sure that cooperation and coordination in such a world would be so much improved: OK, even if the dictator listens to its aligned AI, we need a notion of alignment that is very strong to be able to affirm that all the AIs are going to advocate for “COOPERATE” in the prisoner’s dilemma and that all the dictators are going to listen—but at the same time it’s not that costly to cooperate as you said (even if i’m not sure that energy, land, rare ressources are really that cheap to continue to provide for humans)
But at least I think that I can see now how we could still live for a few more decades under the authority of a world dictator/pseudo-democracy while this was not clear for me beforehand.
Thanks for continuing to engage. I really appreciate this thread.
“Feeding humans” is a pretty low bar. If you want humans to live as comfortably as today, this would be more like 100% of GDP—modulo the fact that GDP is growing.
But more fundamentally, I’m not sure the correct way to discuss the resource allocation is to think at the civilization level rather than at the company level: Let’s say that we have:
Company A that is composed of a human (price $5k/month) and 5 automated-humans (price of inference $5k/month let’s say)
Company B that is composed of 10 automated-humans ($10k/month)
It seems to me that if you are an investor, you will give your money to B. It seems that in the long term, B is much more competitive, gains more money, is able to reduce its prices, nobody buys from A, and B invests this money into more automated-humans and crushes A and A goes bankrupt. Even if alignment is solved, and the humans listen to his AIs, it’s hard to be competitive.
P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30% --> You need to add to this the probability that Anthropic is first and that the other companies are not going to create AGI if Anthropic already created it. this is by default not the case