Charbel-Raphael Segerie
https://crsegerie.github.io/
Living in Paris
Yup, we should create an equivalent of the Nutri-Score for different recommendation AIs.
“I really don’t know how tractable it would be to pressure compagnies” seems weirdly familiar. We already used the same argument for AGI safety, and we know that governance work is much more tractable than expected.
I’m a bit surprised this post has so little karma and engagement. I would be really interested to hear from people who think this is a complete distraction.
Fair enough.
I think my main problem with this proposal is that under the current paradigm of AIs (GPTs, foundation models), I don’t see how you want to implement ATA, and this isn’t really a priority?
I believe we should not create a Sovereign AI. Developing a goal-directed agent of this kind will always be too dangerous. Instead, we should aim for a scenario similar to CERN, where powerful AI systems are used for research in secure labs, but not deployed in the economy.
I don’t want AIs to takeover.
Thank you for this post and study. It’s indeed very interesting.
I have two questions:
In what ways is this threat model similar to or different from learned steganography? It seems quite similar to me, but I’m not entirely sure.
If it can be related to steganography, couldn’t we apply the same defenses as for steganography, such as paraphrasing, as suggested in this paper? If paraphrasing is a successful defense, we could use it in the control setting, in the lab, although it might be cumbersome to apply paraphrasing for all users in the api.
Interesting! Is it fair to say that this is another attempt at solving a sub problem of misgeneralization?
Here is one suggestion to be able to cluster your SAEs features more automatically between gender and profession.
In the past, Stuart Armstrong with alignedAI also attempted to conduct works aimed at identifying different features within a neural network in such a way that the neural network would generalize better. Here is a summary of a related paper, the DivDis paper that is very similar to what alignedAI did:
The DivDis paper presents a simple algorithm to solve these ambiguity problems in the training set. DivDis uses multi-head neural networks, and a loss that encourages the heads to use independent information. Once training is complete, the best head can be selected by testing all different heads on the validation data.
DivDis achieves 64% accuracy on the unlabeled set when training on a subset of human_age and 97% accuracy on the unlabeled set of human_hair. GitHub : https://github.com/yoonholee/DivDis
I have the impression that you could also use DivDis by training a probe on the latent activations of the SAEs and then applying Stuart Armstrong’s technique to decorrelate the different spurious correlations. One of those two algos would enable to significantly reduce the manual work required to partition the different features with your SAEs, resulting in two clusters of features, obtained in an unsupervised way, that would be here related to gender and profession.
Here is the youtube video from the Guaranteed Safe AI Seminars:
It might not be that impossible to use LLM to automatically train wisdom:
Look at this: “Researchers have utilized Nvidia’s Eureka platform, a human-level reward design algorithm, to train a quadruped robot to balance and walk on top of a yoga ball.”
Strongly agree.
Related: It’s disheartening to recognize, but it seems the ML community might not even get past the first crucial step in reducing risks, which is understanding them. We appear to live in a world where most people, including key decision-makers, still don’t grasp the gravity of the situation. For instance, in France, we still hear influential figures like Arthur Mensch, CEO of Mistral, saying things like, “When you write this kind of software, you always control what’s going to happen, all the outputs the software can have.” As long as such individuals are leading AGI labs, the situation will remain quite dire.
+1 for the conflationary alliances point. It is especially frustrating when I hear junior people interchange “AI Safety” and “AI Alignment.” These are two completely different concepts, and one can exist without the other. (The fact that the main forum for AI Safety is the “Alignment Forum” does not help with this confusion). I’m not convinced the goal of the AI Safety community should be to align AIs at this point.
However, I want to make a small amendment to Myth 1: I believe that technical work which enhances safety culture is generally very positive. Examples of such work include scary demos like “BadLlama,” which I cite at least once a week, or benchmarks such as Evaluating Frontier Models for Dangerous Capabilities, which tries to monitor particularly concerning capabilities. More “technical” works like these seem overwhelmingly positive, and I think that we need more competent people doing this.
Strong agree. I think twitter and reposting stuff on other platforms is still neglected, and this is important to increase safety culture
doesn’t justify the strength of the claims you’re making in this post, like “we are approaching a point of no return” and “without a treaty, we are screwed”.
I agree that’s a bit too much, but it seems to me that we’re not at all on the way to stopping open source development, and that we need to stop it at some point; maybe you think ARA is a bit early, but I think we need a red line before AI becomes human-level, and ARA is one of the last arbitrary red lines before everything accelerates.
But I still think no return to loss of control because it might be very hard to stop ARA agent still seems pretty fair to me.
Link here, and there are other comments in the same thread. Was on my laptop, which has twitter blocked, so couldn’t link it myself before.
I agree with your comment on twitter that evolutionary forces are very slow compared to deliberate design, but that is not way I wanted to convey (that’s my fault). I think an ARA agent would not only depend on evolutionary forces, but also on the whole open source community finding new ways to quantify, prune, distill, and run the model in a distributed way in a practical way. I think the main driver this “evolution” would be the open source community & libraries who will want to create good “ARA”, and huge economic incentive will make agent AIs more and more common and easy in the future.
Thanks for this comment, but I think this might be a bit overconfident.
constantly fighting off the mitigations that humans are using to try to detect them and shut them down.
Yes, I have no doubt that if humans implement some kind of defense, this will slow down ARA a lot. But:
1) It’s not even clear people are going to try to react in the first place. As I say, most AI development is positive. If you implement regulations to fight bad ARA, you are also hindering the whole ecosystem. It’s not clear to me that we are going to do something about open source. You need a big warning shot beforehand and this is not really clear to me that this happens before a catastrophic level. It’s clear they’re going to react to some kind of ARAs (like chaosgpt), but there might be some ARAs they won’t react to at all.
2) it’s not clear this defense (say for example Know Your Customer for providers) is going to be sufficiently effective to completely clean the whole mess. if the AI is able to hide successfully on laptops + cooperate with some humans, this is going to be really hard to shut it down. We have to live with this endemic virus. The only way around this is cleaning the virus with some sort of pivotal act, but I really don’t like that.
While doing all that, in order to stay relevant, they’ll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources.
“at the same rate” not necessarily. If we don’t solve alignment and we implement a pause on AI development in labs, the ARA AI may still continue to develop. The real crux is how much time the ARA AI needs to evolve into something scary.
Superintelligences could do all of this, and ARA of superintelligences would be pretty terrible. But for models in the broad human or slightly-superhuman ballpark, ARA seems overrated, compared with threat models that involve subverting key human institutions.
We don’t learn much here. From my side, I think that superintelligence is not going to be neglected, and big labs are taking this seriously already. I’m still not clear on ARA.
Remember, while the ARA models are trying to survive, there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement). These seem much more concerning.
This is not the central point. The central point is:
At some point, ARA is unshutdownable unless you try hard with a pivotal cleaning act. We may be stuck with a ChaosGPT forever, which is not existential, but pretty annoying. People are going to die.
the ARA evolves over time. Maybe this evolution is very slow, maybe fast. Maybe it plateaus, maybe it does not plateau. I don’t know
This may take an indefinite number of years, but this can be a problem
the “natural selection favors AIs over humans” argument is a fairly weak one; you can find some comments I’ve made about this by searching my twitter.
I’m pretty surprised by this. I’ve tried to google and not found anything.
Overall, I think this still deserves more research
Why not! There are many many questions that were not discussed here because I just wanted to focus on the core part of the argument. But I agree details and scenarios are important, even if I think this shouldn’t change too much the basic picture depicted in the OP.
Here are some important questions that were voluntarily omitted from the QA for the sake of not including stuff that fluctuates too much in my head;
would we react before the point of no return?
Where should we place the red line? Should this red line apply to labs?
Is this going to be exponential? Do we care?
What would it look like if we used a counter-agent that was human-aligned?
What can we do about it now concretely? Is KYC something we should advocate for?
Don’t you think an AI capable of ARA would be superintelligent and take-over anyway?
What are the short term bad consequences of early ARA? What does the transition scenario look like.
Is it even possible to coordinate worldwide if we agree that we should?
How much human involvement will be needed in bootstrapping the first ARAs?
We plan to write more about these with @Épiphanie Gédéon in the future, but first it’s necessary to discuss the basic picture a bit more.
Thanks for writing this.
I like your writing style, this inspired me to read a few more things
Seems like we are here today
I don’t Tournesol is really mature currently, especially for non french content, and I’m not sure they try to do governance works, that’s mainly a technical projet, which is already cool.