Charbel-Raphael Segerie
https://crsegerie.github.io/
Living in Paris
Couldn’t we privately ask Sam Altman “I would do X if Dario and Demis also commit to the same thing”?
Seems like the obvious thing one might like to do if people are stuck in a race and cannot coordinate.
X could be implementing some mitigation measures, supporting some piece of regulation, or just coordinating to tell the president that the situation is dangerous and we really do need to do something.
What do you think?
It seems like conditional statements have already been useful in other industries—Claude
Regarding whether similar private “if-then” conditional commitments have worked in other industries:
Yes, conditional commitments have been used successfully in various contexts:
International climate agreements often use conditional pledges—countries commit to certain emission reductions contingent on other nations making similar commitments
Industry standards adoption—companies agree to adopt new standards if their competitors do the same
Nuclear disarmament treaties—nations agree to reduce weapons stockpiles if other countries make equivalent reductions
Charitable giving—some major donors make pledges conditional on matching commitments from others
Trade agreements—countries reduce tariffs conditionally on reciprocal actions
The effectiveness depends on verification mechanisms, trust between parties, and sometimes third-party enforcement. In high-stakes competitive industries like AI, coordination challenges would be significant but not necessarily insurmountable with the right structure and incentives.
(Note, this is different from “if‑then” commitments proposed by Holden, which are more about if we cross capability X then we need to do mitigation Y)
Well done—this is super important. I think this angle might also be quite easily pitchable to governments.
I’m glad we agree “they’d be one of the biggest wins in AI safety to date.”
“Implement shutdown ability” would not in fact be operationalized in a way which would ensure the ability to shutdown an actually-dangerous AI, because nobody knows how to do that
How so? It’s pretty straightforward if the model is still contained in the lab.
“Implement reasonable safeguards to prevent societal-scale catastrophes” would in fact be operationalized as checking a few boxes on a form and maybe writing some docs, without changing deployment practices at all
I think ticking boxes is good. This is how we went to the Moon, and it’s much better to do this than to not do it. It’s not trivial to tick all the boxes. Look at the number of boxes you need to tick if you want to follow the Code of Practice of the AI Act or this paper from DeepMind.
we simply do not have a way to reliably tell which models are and are not dangerous
How so? I think capabilities evaluations are much simpler than alignment evals, and at the very least we can run those. You might say: “A model might sandbag.” Sure, but you can fine-tune it and see if the capabilities are recovered. If even with some fine-tuning the model is not able to do the tasks at all, modulo the problem of gradient hacking that is, I think, very unlikely, we can be pretty sure that the model wouldn’t be capable of doing such feat. I think at the very least, following the same methodology as the one followed by Anthropic in their last system cards is pretty good and would be very helpful.
You really think those elements are not helpful? I’m really curious
Then there’s the AI regulation activists and lobbyists. [...] Even if they do manage to pass any regulations on AI, those will also be mostly fake
SB1047 was a pretty close shot to something really helpful. The AI Act and its code of practice might be insufficient, but there are good elements in it that, if applied, would reduce the risks. The problem is that it won’t be applied because of internal deployment.
But I sympathise somewhat with stuff like this:
They lobby and protest and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People.
I disagree, Short Timelines Devalue at least a bit Long Horizon Research, and I think that practically this reduces the usefulness by probably a factor of 10.
Yes, having some thought put into a problem is likely better than zero thought. Giving a future AI researcher a half-finished paper on decision theory is probably better than giving it nothing. The question is how much better, and at what cost?
Opportunity Cost is Paramount: If timelines are actually short (months/few years), then every hour spent on deep theory with no immediate application is an hour not spent on:
Empirical safety work on existing/imminent systems (LLM alignment, interpretability, monitoring).
Governance, policy, coordination efforts.
The argument implicitly assumes agent foundations and other moonshot offers the highest marginal value as a seed compared to progress in these other areas. I am highly skeptical of this.
Confidence in Current Directions: How sure are we that current agent foundations research is even pointing the right way? If it’s fundamentally flawed or incomplete, seeding future AIs with it might be actively harmful, entrenching bad ideas. We might be better off giving them less, but higher-quality, guidance, perhaps focused on methodology or verification.
Cognitive Biases: Could this argument be motivated reasoning? Researchers invested in long-horizon theoretical work naturally seek justifications for its continued relevance under scenarios (short timelines) that might otherwise devalue it. This “seeding” argument provides such a justification, but its strength needs objective assessment, not just assertion.
Congrats Lucie! I wish more people were as determined to contribute to saving the world as you. Kudos for the Responsible Adoption of General Purpose AI Seminar and for AI Safety Connect, which were really well organized and quite impactful. Few people can say they’ve raised awareness among around a hundred policymakers. We need to help change the minds of many more policymakers, and organizing events like this seems like one of the most cost-effective ways to do this, so I think we should really implement this at scale.
Strong upvoted. I consider this topic really important.
My guess is that most of the reasons are historical ones that shouldn’t hold today. In the past, politics was the mind-killer on this platform, and it might still be, but progress can be made, and I think this progress is almost necessary for us to be saved:
The AI Act and its code of practice
SB1047 was very close to being a major success
The Seoul Summit during which major labs committed to publishing their safety and security frameworks
What’s the plan otherwise? Have a pivotal act from OpenAI, Anthropic, or Google? I don’t want this approach; it seems completely undemocratic honestly, and I don’t think it’s technically feasible.
I think the good Schelling point is a treaty of non-development of superintelligence (like advocated at aitreaty.org or this one). That’s the only reasonable option.
I think the real argument is that there are very few technical people willing to reconsider their careers, or they don’t know how to do it, or that there isn’t enough training available. Beyond entry level courses like BlueDot or AI Safety Colab, good advanced training is limited. Only Horizon Institute, Talos Network, and MATS (which accepts approximately 10 people per cohort), plus ML4Good (which is soon transitioning from a technical bootcamp to a governance one) offer resources to become proficient in AI Governance.
Here’s more detail on my position: https://x.com/CRSegerie/status/1907433122624622824
Happy to have a dialogue on the subject with someone who disagrees.
Although many AI alignment projects seem to rely on offense/defense balance favoring defense
Why do you think this is the case?
I really liked the format and content of this post. This is very very central, and I would be happy to see much more discussion about the strengths and weaknesses of all those arguments.
Coming back to position paper, I think this was a really solid contribution. Thanks a lot.
I think, in the policy world, perplexity will never be fashionable.
Training compute maybe, but if so, how to ban llama3? This is already too late
If so, the only policy that is see is red lines at full ARA.
And we need to pray that this is sufficient, and that the buffer between ara and takeover is sufficient. I think it is.
Indeed. We are in trouble, and there is no plan as of today. We are soon going to blow past autonomous replication, and then adaptation and R&D. There are almost no remaining clear red lines.
In light of this, we appeal to the AI research and policy communities to quickly increase research into and funding around this difficult topic.
hum, unsure, honestly I don’t think we need much more research on this. What kind of research are you proposing? like I think that the only sensible policy that I see for open-source AI is that we should avoid models that are able to do AI R&D in the wild, and a clear Shelling point for this is stopping before full ARA. But we definitely need more advocacy.
I found this story beautifully written.
I’m questioning the plausibility of this trajectory. My intuition is that I tend to think merged humans might not be competitive or flexible enough in the long run.
For a line of humans to successfully evolve all the way to hiveminds as described, AI development need to be significantly slowed or constrained. My default expectation is that artificial intelligence would likely bootstrap its own civilization and technology independently, leaving humanity behind rather than bringing us along this gradual transformation journey.
I’m the founder of CeSIA, the French Center for AI Safety.
We collaborated/advised/gave interviews with 9 French Youtubers, with one video reaching more than 3.5 million views in a month. Given that half of the French people watch Youtube, this video reached almost 10% of the French population using Youtube, which might be more than any AI safety video in any other language.
We think this is a very cost effective strategy, and encourage other organisations and experts in other country to do the same.
or obvious-to-us ways to turn chatbots into agents, are very much not obvious to them
I think that’s also surprisingly not obvious for many policy makers, and many people in the industry. I made introductory presentation in various institutions on AI risks, and they were not familiar with the idea of scaffolding at all.
Agreed, this is could be much more convincing, we still have a few shots, but I still think nobody will care even with a much stronger version of this particula warning shot.
Coming back to this comment: we got a few clear examples, and nobody seems to care:
“In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning.”—Anthropic, in the Alignment Faking paper.
This time we catched it. Next time, maybe we won’t be able to catch it.
X could also be agreeing to sign a public statement about the need to do something or whatever.