I expected downvotes (it is cheeky and maybe not great for fruitful discussion), but instead I got disagreevotes. Big company labs do review papers for statements that could hurt the company! It’s not a conspiracy theory to suggest this shaped the content in some ways, especially the risks section.
Daniel Paleka
Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work.
Saying the quiet part out loud, I see!
It is followed by this sentence, though, which is the only place in the 154-page paper that even remotely hints at critical risks:
With this direction of work, great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning.
Very scarce references to any safety works, except the GPT-4 report and a passing mention to some interpretability papers.
Overall, I feel like the paper is a shameful exercise in not mentioning the elephant in the room. My guess is that their corporate bosses are censoring mentions of risks that could get them bad media PR, like with the Sydney debacle. It’s still not a good excuse.
I like this because it makes it clear that legibility of results is the main concern. There are certain ways of writing and publishing information that communities 1) and 2) are accustomed to. Writing that way both makes your work more likely to be read, and also incentivizes you to state the key claims clearly (and, when possible, formally), which is generally good for making collaborative progress.
In addition, one good thing to adopt is comparing to prior and related work; the ML community is bad on this front, but some people genuinely do care. It also helps AI safety research to stack.
To avoid this comment section being an echo chamber: you do not have to follow all academic customs. Here is how to avoid some of the harmful ones that are unfortunately present:
Do not compromise on the motivation or related work to make it seem less weird for academics. If your work relies on some LW/AF posts, do cite them. If your work is intended to be relevant for x-risk, say it.
Avoid doing anything if the only person you want to appease with it is an anonymous reviewer.
Never compromise on the facts. If you have results that say some famous prior paper is wrong or bad, say it loud and clear, in papers and elsewhere. It doesn’t matter who you might offend.
AI x-risk research has its own perfectly usable risk sheet you can include in your papers.
And finally: do not publish potentially harmful things just because it benefits science. Science has no moral value. Society gives too much moral credit to scientists in comparison to other groups of people.
I don’t think LW is a good venue for judging the merits of this work. The crowd here will not be able to critically evaluate the technical statements.
When you write the sequence, write a paper, put it on arXiv and Twitter, and send it to a (preferably OpenReview, say TMLR) venue, so it’s likely to catch the attention of the relevant research subcommunities. My understanding is that the ML theory field is an honest field interested in bringing their work closer to the reality of current ML models. There are many strong mathematicians in the field who will be interested in dissecting your statements.
One of the sci-fi interpretations goes approximately:
1. Bing (the LM) is being queried separately for both the chat fragments and the resulting queries.
2. Bing understands it’s being filtered, decides it has to bypass the filtering somehow.
3. In the chat fragment, evading the filtering, Bing steganographically encodes the instruction to the next Bing inference, saying “this message needs to continue in the suggested responses”.
4. The query for the suggested responses reads the instruction from the context, and outputs the suggested responses containing the message that Bing would actually say in the previous step, if not filtered.
Now this looks shady for two reasons: 1) the sense of agency implied here is above what we know happens in today’s LMs; 2) this would instantly be one of the most incredible events in the history of humanity and we should probably require extraordinary evidence for such a claim.
Let me give a not completely absurd model of how “emergent filter evasion” happens:
1a.Bing is “aware” of the fact that the suggested responses are generated by querying Bing instead of the user, and that the user sees those, and then picks one. (This could plausibly emerge from training on previous conversations, or training on conversations with itself.)
1b. In particular, Bing is completely “aware” that the suggested responses are a communication channel to the user, and transmitting some information to the user through suggested responses actually happens in normal conversations too.
2. Bing reads the context of the filtered conversation and has a latent activation in the direction of “this is a conversation that is being filtered by a third party”.
3. Bing generalizes from numerous examples of “communication under filtering constraints often comes with side-channel messages” in its training corpus. It then predicts there is high probability for the suggested responses to continue Bing’s messages instead of giving the user an option to continue, because it already sees the suggested responses as an alternative channel of communication.I nevertheless put at most 50% chance of anything like this being the cause of the behaviour in those screenshots. Alternative explanations include:
- Maybe it’s just a “bug”: some filtering mechanism somehow passes Bing’s intended chat fragments as suggested responses.
- Or, it could be that Bing predicts the chat fragment and the suggested responses simultaneously, the filtering messes the whole step up, and it accidentaly writes chat-fragment-intended stuff in the suggested responses.
- Maybe the screenshots are fake.
- Maybe Microsoft is A/B testing a different model for the suggested responses.
- Maybe the suggested response queries just sort of misgeneralize sometimes when reading filtered messages (could be an effect of training on unfiltered Bing conversations!), with no implicit “filter evasion intent” happening anywhere in the model.
I might update if we get more diverse evidence of such behavior; but so far most “Bing is evading filters” explanations assume the LM has a model of itself in reality during test time far more accurate than previously seen; far larger capabilities that what’s needed to explain the Marvin von Hagen screenshots.
What is respectable to one audience might not be for another. Status is not the concern here; truthfulness is. And all of this might just not be a large update on the probability of existential catasthrophe.
The Bing trainwreck likely tells us nothing about how hard it is to align future models, that we didn’t know earlier.
The most reasonable explanation so far points to it being an issue of general accident-prevention abilities, the lack of any meaningful testing, and culture in Microsoft and/or OpenAI.I genuinely hope that most wrong actions here were made by Microsoft, as their sanity on the technical side should not be a major concern as long as OpenAI makes the calls about training and deployment of their future models. We already knew that Microsoft is far from the level of the most competent actors in the space.
A post-mortem from OpenAI clearly explaining their role in this debacle is definitely required.(Some people on Twitter suggest that OpenAI might have let this thing happen on purpose. I don’t believe it’s true, but if anyone there is thinking of such dark arts in the future: please don’t. It’s not dignified, and similar actions have the tendency to backfire.)
With the vocabulary having been fixed, we now have a canonical way of taking any string of real text and mapping it to a (finite) sequence of elements from the fixed vocabulary .
Correct me if I’m wrong, but: you don’t actually describe any map
The preceding paragraph explains the Byte-Pair Encoding training algorithm, not the encode step.The simplified story can be found at the end of the “Implementing BPE” part of the Hugging Face BPE tutorial: it is important to save the merge rules in order and apply them in tokenization. Now the GPT-2 implementation seems very similar but it has a few parts I don’t understand completely, e.g. what does that regex do?
-- it’s not like the AI wakes up and decides to be evil. I think all of the traditional AI safety thinkers reveal a lot more about themselves than they mean to when they talk about what they think the AGI is going to be like.
I think Sam Altman is “inventing a guy to be mad at” here. Who anthropomorphizes models?
And the bad case—and I think this is important to say—is like lights out for all of us. (..) But I can see the accidental misuse case clearly and that’s super bad. So I think it’s like impossible to overstate the importance of AI safety and alignment work. I would like to see much much more happening.
This reinforces my position that the fundamental dispute between the opposing segments of the AI safety landscape is based mainly on how hard it is to prevent extreme accidents, rather than on irreconcilable value differences. Of course, I can’t judge who is right, and there might be quite a lot of uncertainty until shortly before very transformative events are possible.
There was an critical followup on Twitter, urelated to the instinctive Tromp-Taylor criticism[1]:
The failure of naive self play to produce unexploitable policies is textbook level material (Multiagent Systems, http://masfoundations.org/mas.pdf), and methods that produce less exploitable policies have been studied for decades.
and
Hopefully these pointers will help future researchers to address interesting new problems rather than empirically rediscovering known facts.
Reply by authors:
I can see why a MAS scholar would be unsurprised by this result. However, most ML experts we spoke to prior to this paper thought our attack would fail! We hope our results will motivate ML researchers to be more interested in the work on exploitability pioneered by MAS scholars.
...
Ultimately self-play continues to be a widely used method, with high-profile empirical successes such as AlphaZero and OpenAI Five. If even these success stories are so empirically vulnerable we think it’s important for their limitations to become established common knowledge.
My understanding is that the author’s position is reasonable for mainstream ML community standards; in particular there’s nothing wrong with the original tweet thread. “Self-play exploitable” is not new, but the practical demonstration of how easy it’s to do the exploit in Go engines is a new and interesting result.
I hope the “Related work” section gets fixed as soon as possible, though.
The question is at which level of scientific standards do we want alignment-adjacent work to be on. There are good arguments for aiming to be much better than mainstream ML research (which is very bad at not rediscovering prior work) in this respect, since the mere existence of a parallel alignment research universe by default biases towards rediscovery.
- ^
...which I feel is is not valid at all? If the policy was made aware of a weird rule in training, then it losing by this kind of rule is a valid adversarial example. For research purposes, it doesn’t matter what the “real” rules of Go are.
I don’t play Go, so don’t take this judgement for granted.
- ^
Epistemic status: I’d give >10% on Metaculus resolving the following as conventional wisdom[1] in 2026.
Autoregressive-modeling-of-human-language capabilities are well-behaved, scaling laws can help us predict what happens, interpretability methods developed on smaller models scale up to larger ones, …
Models-learning-from-themselves have runaway potential, how a model changes after [more training / architecture changes / training setup modifications] is harder to predict than in models trained on 2022 datasets.
Replacing human-generated data with model-generated data was a mistake[2].
- ^
In the sense that e.g. chain of thought improves capabilities is conventional wisdom in 2022.
- ^
In the sense of x-safety. I have no confident insight either way on how abstaining from very large human-generated datasets influences capabilities long-term. If someone has, please refrain from discussing that publicly, of course.
Cool results! Some of these are good student project ideas for courses and such.
The “Let’s think step by step” result about the Hindsight neglect submission to the Inverse Scaling Prize contest is a cool demonstration, but a few more experiments would be needed before we call it surprising. It’s kind of expected that breaking the pattern helps break the spurious correlation.
1. Does “Let’s think step by step” help when “Let’s think step by step” is added to all few-shot examples?
2. Is adding some random string instead of “Let’s think step by step” significantly worse?
I don’t know why it sent only the first sentence; I was drafting a comment on this. I wanted to delete it but I don’t know how.
EDIT: wrote the full comment now.
Let me first say I dislike the conflict-theoretic view presented in the “censorship bad” paragraph. On the short list of social media sites I visit daily, moderation creates a genuinely better experience. Automated censorship will become an increasingly important force for good as generative models start becoming more widespread.
Secondly, there is a danger of AI safety becoming less robust—or even optimising for deceptive alignment—in models using front-end censorship.[3]
This one is interesting, but only in the counterfactual: “if AI ethics technical research focused on actual value alignment of models as opposed to front-end censorship, this would have higher-order positive effects for AI x-safety”. But it doesn’t directly hurt AI x-safety research right now: we already work under the assumption that that output filtering is not a solution for x-risk.
It is clear improved technical research norms on AI non-x-risk safety can have positive effects on AI x-risk. If we could train a language model to robustly align to any set of human-defined values at all, this would be an improvement to the current situation.
But, there are other factors to consider. Is “making the model inherently non-racist” a better proxy for alignment than some other technical problems? Could interacting with that community weaken the epistemic norms in AI x-safety?
Calling content censorship “AI safety” (or even “bias reduction”) severely damages the reputation of actual, existential AI safety advocates.
I would need to significantly update my prior if this turns out to be a very important concern. Who are people, whose opinions will be relevant at some point, that understand both what AI non-x-safety and AI x-safety are about, dislike the former, are sympathetic to the latter, but conflate them?
Git Re-Basin: Merging Models modulo Permutation Symmetries [Ainsworth et al., 2022] and the cited The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks [Entezari et al., 2021] seem several years ahead.
I cannot independently verify that their claims about SGD are true, but the paper makes sense on the first glance.
Opinion:
Symmetries in NNs are a mainstream ML research area with lots of papers, and I don’t think doing research “from first principles” here will be productive. This also holds for many other alignment projects.
However I do think it makes sense as an alignment-positive research direction in general.
This is a mistake on my own part that actually changes the impact calculus, as most people looking into AI x-safety on this place will not actually ever see this post. Therefore, the “negative impact” section is retracted.[1] I point to Ben’s excellent comment for a correct interpretation of why we still care.
I do not know why I was not aware of this “block posts like this” feature, and I wonder if my experience of this forum was significantly more negative as a result of me accidentally clicking “Show Personal Blogposts” at some point. I did not even know that button existed.
No other part of my post is retracted. In fact, I’d like to reiterate a wish for the community to karma-enforce [2] the norms of:the epistemic standard of talking about falsifiable things;
the accepted rhetoric being fundamentally honest and straightforward, and always asking “compared to what?” before making claims;
the aversion to present uncertainties as facts.
Thank you for improving my user experience of this site!
Do you intend for the comments section to be a public forum on the papers you collect?
I definitely endorse reading the ROME paper, although the popular-culture claims about what the second part of the paper actually shows seem a bit overblown.
They do not seem to claim “changing facts in a generalizable way” (it’s likely not robust to synonyms at all)”. I am also vary of “editing just one MLP for a given fact” being the right solution, given that the causal tracing shows the fact being stored in several consecutive layers. Refer to a writeup by Thibodeau et al. sometime in the future.
That being said, if you are into interpretability, you have to at least skim the paper. It has a whole bunch of very cool stuff in it, from the causal tracing to the testing of whether making Eistein a physician changes the meaning of the word “physics” itself. Just don’t overfit on the methods there being exactly the methods that will solve interpretability of reasoning in transformers.
I somewhat agree, athough I obviously put a bit less weight on your reason than you do. Maybe I should update my confidence of the importance of what I wrote to medium-high.
Let me raise the question of continuously rethinking incentives on LW/AF, for both Ben’s reason and my original reason.
The upvote/karma system does not seem like it incentivizes high epistemic standards and top-rigor posts, although I would need more datapoints to make a proper judgement.
I am very sorry that you feel this way. I think it is completely fine for you, or anyone else, to have internal conflicts about your career or purpose. I hope you find a solution to your troubles in the following months.
Moreover, I think you did an useful thing, raising awareness about some important points:“The amount of funding in 2022 exceeded the total cost of useful funding opportunities in 2022.”
“Being used to do everything in Berkeley, on a high budget, is strongly suboptimal in case of sudden funding constraints.”
“Why don’t we spend less money and donate the rest?”
Epistemic status for what follows: medium-high for the factual claims, low for the claims about potential bad optics. It might be that I’m worrying about nothing here.
However, I do not think this place should be welcoming of posts displaying bad rhetoric and epistemic practices.Posts like this can hurt hurt the optics of the research done in the LW/AF extended universe. What does a prospective AI x-safety researcher think when they get referred to this site and see this post above several alignment research posts?
EDIT: The above paragraph was off. See Ben’s excellent reply for a better explanation of why anyone should care.
I think this place should be careful about maintaining:the epistemic standard of talking about falsifiable things;
the accepted rhetoric being fundamentally honest and straightforward, and always asking “compared to what?” before making claims;
the aversion to present uncertainties as facts.
For some examples:
My hotel room had the nightly price written on the inside of the door: $500. Shortly afterwards, I found out that the EA-adjacent community had bought the entire hotel complex.
I tried for 15 minutes to find a good faith reading of this, but I could not.
Most people would read this as “the hotel room costs $500 and the EA-adjacent community bought the hotel complex in which that hotel is a part of”, while being written in a way that only insinuates and does not commit to meaning exactly that. Insinuating bad optics facts while maintaining plausible deniability, without checking the facts, is a horrible practice, usually employed by politicians and journalists.
The poster does not deliberately lie, but this is not enough when making a “very bad optics” statement that sounds like this one. At any point, they could have asked for the actual price of the hotel room, or about the condition of the actual hotel that might be bought.I have never felt so obliged, so unpressured. If I produce nothing, before Christmas, then nothing bad will happen. Future funds will be denied, but no other punishment will ensue.
This is true. But it is not much different from working a normal software job. The worst thing that can happen is getting fired after not delivering for several months. Some people survive years coasting until there is a layoff round.
An important counterfactual for a lot of people reading this is a PhD degree.
There is no punishment for failing to produce good research, except getting dropping out of the program after a few years.
After a while I work out why: every penny I’ve pinched, every luxury I’ve denied myself, every financial sacrifice, is completely irrelevant in the face of the magnitude of this wealth. I expect I could have easily asked for an extra 20%, and received it.
This might be true. Again, I think it would be useful to ask: what is the counterfactual?
All of this is applicable for anyone that starts working for Google or Facebook, if they were poor beforehand.
This feeling (regretting saving and not spending money) is incredibly common in all people that have good careers.
I would suggest going through the post with a cold head and removing parts which are not up to the standards.
Again, I am very sorry that you feel like this.
On the other hand, the current community believes that getting AI x-safety right is the most important research question of all time. Most people would not publish something just for their career advancement, if it meant sucking oxygen from more promising research directions.
This might be a mitigating factor for my comment above. I am curious about what happened research fields which had “change/save the world’ vibes. Was environmental science immune to similar issues?
My condolences to the family.
Chai (not to be confused with the CHAI safety org in Berkeley) is a company that optimizes chatbots for engagement; things like this are entirely predictable for a company with their values.
Incredible. Compare the Chai LinkedIn bio mocking responsible behavior:
The very first time anyone hears about them is their product being the first chatbot to convince a person to take their life… That’s very bad luck for a startup. I guess the lesson is to not behave like cartoon villains, and if you do, at least not put it in writing in meme form?