LLMs for Alignment Research: a safety priority?
A recent short story by Gabriel Mukobi illustrates a near-term scenario where things go bad because new developments in LLMs allow LLMs to accelerate capabilities research without a correspondingly large acceleration in safety research.
This scenario is disturbingly close to the situation we already find ourselves in. Asking the best LLMs for help with programming vs technical alignment research feels very different (at least to me). LLMs might generate junk code, but you can keep pointing out the problems with the code, and the code will eventually work. This can be faster than doing it myself, in cases where I don’t know a language or library well; the LLMs are moderately familiar with everything.
When I try to talk to LLMs about technical AI safety work, however, I just get garbage.
I think a useful safety precaution for frontier AI models would be to make them more useful for safety research than capabilities research. This extends beyond applying AI technology to accelerate safety research within top AI labs; models available to the general public (such as GPT-N, Claude-N) should also accelerate safety more than capabilities.
What is wrong with current models?
My experience is mostly with Claude, and mostly with versions of Claude before the current (Claude 3).[1] I’m going to complain about Claude here; but everything else I’ve tried seemed worse. In particular, I found GPT4 to be worse than Claude2 for my purposes.
As I mentioned in the introduction, I’ve been comparing how these models feel helpful for programming to how useless they feel for technical AI safety. Specifically, technical AI safety of the mathematical-philosophy flavor that I usually think about. This is not, of course, a perfect experiment to compare capability-research-boosting to safety-research-boosting. However, the tasks feel comparable in the following sense: programming involves translating natural-language descriptions into formal specifications; mathematical philosophy also involves translating natural-language descriptions into formal specifications. From this perspective, the main difference is what sort of formal language is being targeted (IE, programming languages vs axiomatic models).
I don’t have systematic experiments to report; just a general feeling that Claude’s programming is useful, but Claude’s philosophy is not.[2] It is not obvious, to me, why this is. I’ve spoken to several people about it. Some reactions:
If it could do that, we would all be dead!
I think a similar mindset would have said this about programming, a few years ago. I suspect there are ways for modern LLMs to be more helpful to safety research in particular which do not also imply advancing capabilities very much in other respects. I’ll say more about this later in the essay.
There’s probably just a lot less training data for mathematical philosophy than for programming.
I think this might be an important factor, but it is not totally clear to me.
Mathematical philosophy is inherently more difficult than programming, so it is no surprise.
This might also be an important factor, but I consider it to be only a partial explanation. What is more difficult, exactly? As I mentioned, programming and mathematical philosophy have some strong similarities.
Problems include a bland, people-pleasing attitude which is not very helpful for research. By default, Claude (and GPT4) will enthusiastically agree with whatever I say, and stick to summarizing my points back at me rather than providing new insights or adding useful critiques. When Claude does engage in more structured reasoning, it is usually wrong and bad. (I might summarize it as “based more on vibes than logic”.)
Is there any hope for better?
As a starting observation: although a given AI technology, such as GPT4, might not meet some safety standards we’d like to impose (eg, transparency/interpretability), its widespread use means we are already forced to gamble on its relative safety. In some weak sense, this gives us a resource: a technology which we can use without increasing risks. This certainly doesn’t imply that any arbitrary use of GPT4 is non-risk-increasing. However, it does suggest approaches involving cautiously harnessing modern AI technology for what it’s good for, without placing it in the driver’s seat.
We’re at a point in history where suddenly many new things are possible; it’s a point where it makes a lot of sense to look around, explore, and see whether you can find a significant way to leverage the new technologies for good. With the technology being so new, I don’t think we should stop at the obvious (EG, give up because chatting with modern LLMs about safety research did not feel fruitful).
Some obvious things to try include better prompting strategies, and fine-tuning models specifically for helping with this sort of work. It might be useful to attach LLMs to theorem-proving assistants and teach the LLMs to (selectively) formalize what the user is trying to reason about as axioms or proofs in the connected formal system.
It would also be helpful to simply make a more systematic study of what these models can and cannot help with, relating to AI safety research.
I’ll state some more specific ideas about how to use modern LLMs to benefit safety research towards the end of this essay; there are some more intuitions I want to communicate first.
What follows is my personal vision for how modern LLMs could be more useful for safety research; I don’t want to put overmuch emphasis on it. My main point has already been made: making LLMs comparatively more useful for AI safety work as opposed to AI capabilities work should itself be considered a safety priority.
Against Autonomy
I think there’s a dangerous bias toward autonomy in AI—both in terms of what AI companies are trying to produce, and also in what consumers are asking for. I want to advocate for a greater emphasis on collaborative AI, rather than AI which takes requests and then delivers.
Servant vs Collaborator
Big AI companies are for the most part fine-tuning models to take a prompt and return an answer. This is a pretty reasonable idea, but it sometimes feels like interacting with a nervous intern desperate to prove themselves in their first week on the job.
For example, my brother started a conversation with something like “I’m thinking about making an RPG”. Bing responded with a very reasonable list of things to think about when making an RPG. The problem is that my brother actually had a very specific idea in mind, and the advice was very generic. Simply put, my brother hadn’t finished explaining what he wanted before pressing enter. It would have been more useful for Bing to engage in active listening: “What kind of RPG are you interested in making?” or similar conversational questions; and only write a research report giving advice after the general shape of the request was clear. You have to be careful what you say to the nervous intern, because the nervous intern will scurry off and write up a report at the drop of a hat.
Similarly, this video argues that Sudowrite (an AI novel-writing tool) is less useful to authors than NovelCrafter (also an AI novel-writing tool) because Sudowrite’s philosophy is closer to “click a button for the AI to write a novel for you” while NovelCrafter is oriented toward a more collaborative model.
I think there are a few sources of autonomy-bias which I want to point out, here:
Autonomy is often easier to train into AI. For example, to generate whole pictures, you just need a data-set consisting of finished art. More sophisticated image manipulation sometimes requires more complex data-sets which might be more difficult to obtain.
Autonomy is easier to conceive of. Push a button and it does what you want. Collaboration often requires more sophisticated user interfaces and more complex ideas about workflows—perhaps involving domain-specific knowledge about how domain experts actually go about their business.
Autonomy is more appealing to the people in charge of corporate budgets. My brother is currently working as a programmer, and his boss says he can’t wait till the AI is at the point where you just push a button and get the code you asked for. My brother, due to having a closer relationship with the code, has a much more collaborative relationship with the AI. To programmers, the inadequacies of the “just push a button” model are more apparent.
Notions of Alignment
Garret Baker recently commented:
To my ears it sounded like Shane [Legg]’s solution to “alignment” was to make the models more consequentialist. I really don’t think he appreciates most of the difficulty and traps of the problems here. This type of thinking, on my model of their models, should make even alignment optimists unimpressed, since much of the reason for optimism lies in observing current language models, and interpreting their outputs as being nonconsequentialist, corrigible, and limited in scope yet broad in application.
Let’s set aside whether Garret Baker’s analysis of Shane Legg is correct. If it was correct, could you really blame him? Someone could be quite up-to-date with the alignment literature and still take the view that “alignment” basically means “value alignment”—which is to say, absorbing human values and then optimizing them. Some of the strongest advocates of alternate ideas like “corrigibility” will still say that progress towards it has stalled and the evidence points toward it being a very unnatural concept.
Simply put, we don’t yet have a strong alternative to agent-centric (autonomy-centric) alignment.
A couple of people who I talk to have been moving away from the value-alignment picture, recently, instead replacing it with the following picture: aligned AI systems are systems which increase, rather than decrease, the agency of humans. This is called capabilitarianism (in contrast to utilitarianism).[4]
Think of social media vs wikis. Social media websites are attention-sucking machines which cause addictive scrolling. Wikis, such as wikipedia, are in contrast incredibly useful.
Or think of a nanny state which makes lots of decisions for its citizens on utilitarian grounds, vs a government which emphasizes freedom and informed decision-making, fostering the ability of its citizens to optimize their own lives, rather than doing it for them.
This notion of alignment is still lacking the level of clarity which the more consequentialist notion possesses, but it sure seems like there are less ways for this kind of vision to go wrong.
The Need for an Independent Safety Approach
OpenAI says:
Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.
I think this sort of plan can easily go wrong. Broadly speaking, aiming to take the human out of the loop seems like a mistake. We want to be on a trajectory where humans very much remain in the loop. Of course I don’t think the superalignment team at OpenAI are trying to take humans entirely out of the loop in a broader sense. But I don’t think “automated alignment researcher” should be the way we think about the end goal.
If you are trying to use AI to accelerate alignment work, but your main approach to alignment work is “use AI to accelerate alignment work”—it seems to me that it is easy to miss a certain sort of grounding. You’re solving for X in the equation “use AI to accelerate X”.
Instead, I would propose that people working on LLMs should work to make LLMs useful to alignment researchers whose main approach to alignment IS NOT “make LLMs useful to alignment researchers”.
This prevents the snake from eating its own tail (and thereby killing itself).
Engineer-Researcher Collaboration
My main proposal for making modern LLMs comparatively more useful for AI safety research is to pair AI safety researchers with generative AI engineers. The engineers would try to create tools useful for accelerating safety research, while the safety researchers would provide testing and feedback. This setup also provides some distance between the LLM engineering and the safety work, to avoid the eating-its-own-tail problem. The safety researchers are bringing their own approach to safety work, so that “automating safety research” does not become the whole safety approach.
This could take the shape of single safety researchers working with single engineers, to an org with a team of safety researchers working with a team of engineers, all the way to a whole safety-research org working with a whole engineering org.
Although my intuition here is that it is important for the safety researcher to have their own safety work which they are trying to accelerate using LLMs, it is plausible that most of the impact comes from building tools which are able to help a larger number of safety researchers; for example, the ‘end product’ might be an LLM which has been trained to be a helpful assistant for a broad variety of safety researchers. I therefore imagine this LLM serving as something like a wiki for the AI safety community: like a more sophisticated version of Stampy,[5] where research-oriented conversation styles are also curated, rather than only question-answer pairs.
Aside: Hallucinations
I want to mention my personal model of “AI hallucination”. Here’s a pretty standard example: when I ask Claude or GPT4 for references to papers on a very niche topic, the references it comes up with are usually made up. However, they are generally plausible—I often can’t tell whether they are made up or not before searching for those references myself.
I think there’s a common mindset which sees these hallucinations as some kind of malfunction. And they are, if the purpose of LLMs is seen as delivering truthful information to the user. But if we think of the LLM as a really good prior distribution over what humans might say, then it starts to look less like a malfunction and more like fairly good performance: the details filled in are quite plausible, even if incorrect.
If the prior lacks specific information we want it to have, the thing to do is update on that information. OpenAI and Anthropic provide interfaces where you can give a thumbs-up or thumbs-down; notably, this is a common social-media interface. But this feedback is not nearly rich enough. And it takes an autonomy-centric, reinforcement-learning-like attitude (the AI is learning to please users) rather than putting the users in the pilot seat.
In order to get models to be useful for the sorts of tasks I try to use them for, it seems to me like what’s needed a way to give specific feedback on specific outputs (such as my own list of references for the topic I queried about) and updating the text-generating distribution in response to this feedback, so that it will be remembered later. (This can be done to varying degrees, and with varying trade-offs, EG using prompt-engineering solutions vs fine-tuning solutions.)
This way, knowledge flows in both directions, and users can build up a shared context with LLMs (including both object-level knowledge, like specific citations, and process-level knowledge, like how verbose/succinct to be).
Updating (on relatively small amounts of text)
A main weakness of Deep Learning was how data-hungry it tends to be. For example, deep-learning systems can play Atari games at the human level, but achieving that level of competence requires many more hours of play for deep learning than humans need. Similar remarks apply for tasks ranging from chess to image recognition.
LLMs require lots and lots of data for the generative pre-training, but once you’ve done that, you’ve got a “really good prior”—my impression is, relatively small amounts of data can be used to fine-tune the model. (Unfortunately, I was not able to quickly find recommended sizes for fine-tuning datasets, so take this with a grain of salt.)
For LLMs above approximately 40 billion parameters,[6] these updates can be quite good, in the sense that new knowledge seems to integrate itself across a broad variety of conversational contexts which were not explicitly trained.
My favorite example of this: Claude was trained using a technique called Constitutional AI. I’ve had some extensive conversations with Claude about AI alignment problems. In my experience, whenever AI alignment is involved, Claude tries to shoe-horn Constitutional AI into the conversation as the solution to whatever problem we’re talking about. The arguments for the relevance of Constitutional AI might be incoherent,[7] but Claude’s understanding that Constitutional AI is an alignment idea is coherent, as well as Claude’s enthusiasm for that particular technique.[8]
This was not the intention of Claude’s training. Anthropic simply wanted Claude to know a reasonable amount about itself, so that it could say things like “I’m Claude, an AI designed by Anthropic” and explain some basic facts about how it was made.[6]
More generally, I have found Claude to be enthusiastic/defensive about the more empirical type of safety work which takes place at Anthropic. I’m unable to find the chat in question now, but there was one conversation where it passionately advocated for understanding what a neural network was doing “weight by weight” in contrast to more theoretical approaches.
So, as you can see, consequences of updates might be unintended and undesirable, but they are clearly smart in a significant sense. Concepts are being combined in meaningful ways. This is not “just autocomplete”.
Such smart updates are a double-edged sword. For “the wiki model” of LLMs to work well, it would be helpful to develop tools to search for (possibly unintended & undesirable) consequences of updates.
Note that fine-tuning smaller LLMs, around 8 billion parameters, is feasible for individuals and small groups with modest amounts of money; but fine-tuning models larger than 40 billion parameters, where we see the phenomenon of really smart generalizations from fine-tuning examples, is still out of reach afaik.
Feedback Tools
So: I imagine that for modern LLMs to be very useful for experts in the field of AI safety, some experts will need to spend a lot of time giving LLMs specific feedback. This feedback would include specific information (refining the LLM’s knowledge) as well as training on useful interaction styles for research.
In order to facilitate such feedback, I think it would be important to develop tools which help rapidly indicate specific problems with text (in contrast to a mere thumbs-up or thumbs-down), and see a preview of how the LLM would adapt based on this feedback, so that the feedback can be tweaked to achieve the desired result.
To give a simple idea for what this could look like: a user might highlight a part of an AI-generated response that they would like to give feedback on. A pop-up feedback box appears, listing some AI-generated potential corrections for the user to select, and also allowing the user to type their own correction. Once a correction has been selected/written, the AI generates some potential amendments to its constitution which would detect this problem and correct it in the future; again the user can look at these and select one or write their own proposed amendment. Finally, the system then generates some examples of the impact the proposed amendment would have (probing for unintended and undesirable consequences). The user can revise the amendment until it has the desired effect, at which point they would finalize it.
I have heard the term “reconstitutional AI” used to point in this general direction.
- ^
My conversations with Claude3 so far do seem somewhat better. However, I suppose that its ability to program has similarly improved.
- ^
Modern LLMs are more useful to beginners than experts.[3] A highly experienced programmer can already easily write the kind of code that LLMs can help with, and with fewer errors. A beginner, however, has much more to gain from LLM assistance. Similarly, then, modern LLMs are probably a lot more helpful to people who are starting to get into AI safety research. It could be that what I’m observing is, really, that I’m a worse programmer than I am a safety researcher.
- ^
Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. Generative AI at work. No. w31161. National Bureau of Economic Research, 2023.
- ^
Some links about this, compiled by TJ:
https://thingofthings.substack.com/p/on-capabilitarianism
https://plato.stanford.edu/entries/capability-approach/
https://philpapers.org/rec/SENCAC
https://arxiv.org/abs/2308.00868
https://www.princeton.edu/~ppettit/papers/Capability_EconomicsandPhilosophy_2001.pdf
- ^
Stampy is a Discord bot which facilitates curated Q&A about AI safety.
- ^
According to private correspondence with a reliable-seeming source.
- ^
Although, no more incoherent than I might expect of some human person who is very enthusiastic about constitutional AI.
- ^
If Claude was trained to explain Constitutional AI factually, but not trained to be actively enthusiastic and push Constitutional AI via motivated arguments… is this an example of defensive reasoning? Did Claude generalize from the observation that people are generally defensive of their own background, arguing for the superiority of their profession or home country? Would Claude more generally try to bend arguments in its own favor, in some sense? Or is this a more benign generalization, perhaps from the idea that a character who explains concept X in depth is probably enthusiastic about concept X?
- Leaving MIRI, Seeking Funding by 8 Aug 2024 18:32 UTC; 267 points) (
- 23 Apr 2024 19:43 UTC; 2 points) 's comment on I’m open for projects (sort of) by (
LLMs aren’t that useful for alignment experts because it’s a highly specialized field and there isn’t much relevant training data. The AI Safety Chatbot partially solves this problem using retrieval-augmented generation (RAG) on a database of articles from https://aisafety.info. There also seem to be plans to fine-tune it on a dataset of alignment articles.
It’s not just from https://aisafety.info/. It also uses Arbital, any posts from the alignment forum, LW, EA forum that seem relevant and have a minimum karma, a bunch of arXiv papers, and a couple of other sources. This is a a relatively up to date list of the sources used (it also contains the actual data).
Seems plausibly true for the alignment specific philosophy/conceptual work, but many people attempting to improve safety also end up doing large amounts of relatively normal work in other domains (ML, math, etc.)
The post is more centrally talking about the very alignment specific use cases of course.
Sounds pretty cool! What LLM powers it?
We’re likely to switch to Claude 3 soon, but currently GPT 3.5. We are mostly expecting it to be useful as a way to interface with existing knowledge initially, but we could make an alternate prompt which is more optimized for being a research assistant brainstorming new ideas if that was wanted.
Would it be useful to be able to set your own system prompt for this? Or have a default one?
I don’t have a good system prompt that I like, although I am trying to work on one. It seems to me like the sort of thing that should be built in to a tool like this (perhaps with options, as different system prompts will be useful for different use-cases, like learning vs trying to push the boundaries of knowledge).
I would be pretty excited to try this out with Claude 3 behind it. Very much the sort of thing I was trying to advocate for in the essay!
DMed a link to an interface which lets you select system prompt and model (including Claude). This is open to researchers to test, but not positing fully publicly as it is not very resistant to people who want to burn credits right now.
Other researchers feel free to DM me if you’d like access.
From reading the codebase, it seems to be a LangChain chatbot powered by the default LangChain OpenAI model which is gpt-3.5-turbo-instruct. The announcement blog post also says it’s based on gpt-3.5-turbo.
Wouldn’t other people also like to use an AI that can collaborate with them on complex topics? E.g. people planning datacenters, or researching RL, or trying to get AIs to collaborate with other instances of themselves to accurately solve real-world problems?
I don’t think people working on alignment research assistants are planning to just turn it on and leave the building, they on average (weighted by money) seem to be imagining doing things like “explain an experiment in natural language and have an AI help implement it rapidly.”
So I think both they and this post are describing the strategy of “building very generally useful AI, but the good guys will be using it first.” I hear you as saying you want a slightly different profile of generally-useful skills to be targeted.
I don’t think the plan is “turn it on and leave the building” either, but I still think the stated goal should not be automation.
I don’t quite agree with the framing “building very generally useful AI, but the good guys will be using it first”—the approach I am advocating is not to push general capabilities forward and then specifically apply those capabilities to safety research. That is more like the automation-centric approach I am arguing against.
Hmm, how do I put this...
I am mainly proposing more focused training of modern LLMs with feedback from safety researchers themselves, toward the goal of safety researchers getting utility out of these systems; this boosts capabilities for helping-with-safety-research specifically, in a targeted way, because that is what you are getting more+better training feedback on. (Furthermore, checking and maintaining this property would be an explicit goal of the project.)
I am secondarily proposing better tools to aid in that feedback process; these can be applied to advance capabilities in any area, I agree, but I think it only somewhat exacerbates the existing “LLM moderation” problem; the general solution of “train LLMs to do good things and not bad things” does not seem to get significantly more problematic in the presence of better training tools (perhaps the general situation even gets better). If the project was successful for safety research, it could also be extended to other fields. The question of how to avoid LLMs being helpful for dangerous research would be similar to the LLM moderation question currently faced by Claude, ChatGPT, Bing, etc: when do you want the system to provide helpful answers, and when do you want it to instead refuse to help?
I am thirdly also mentioning approaches such as training LLMs to interact with proof assistants and intelligently decide when to translate user arguments into formal languages. This does seem like a more concerning general-capability thing, to which the remark “building very generally useful AI, but the good guys will be using it first” applies.
Hey Abram! I appreciate the post. We’ve talked about this at length, but this was still really useful feedback and re-summarization of the thoughts you shared with me. I’ve written up notes and will do my best to incorporate what you’ve shared into the tools I’m working on.
Since we last spoke, I’ve been focusing on technical alignment research, but I will dedicate a lot more time to LLMs for Alignment Research in the coming months.
For anyone reading this: If you are a great safety-minded software engineer and want to help make this vision a reality, please reach out to me. I need all the help I can get to implement this stuff much faster. I’m currently consolidating all of my notes based on what I’ve read, interviews with other alignment researchers, my own notes about what I’d find useful in my research, etc. I’ll be happy to share those notes with people who would love to know more about what I have in mind and potentially contribute.
I think it should be a safety priority.
Currently, I’m attempting to make a modularized snapshot of end-to-end research related to alignment (covering code, math, a number of related subjects, diagrams, and answering Q/As) to create custom data, intended to be useful to future me (and other alignment researchers). If more alignment researchers did this, it’d be nice. And if they iterated on how to do it better.
For example, it’d be useful if your ‘custom data version of you’ broke the fourth wall often and was very willing to assist and over-explain things.
I’m considering going on Lecture-Walks with friends and my voice recorder to world-model dump/explain content so I can capture the authentic [curious questions < - > lucid responses] process
Another thing: It’s not that costly to do so—writing about what you’re researching is already normal, and making an additional effort to be more explicit/lucid/capture your research tastes (and its evolution) seems helpful
Have you tried just copying and pasting an alignment research paper (or other materials) into a base model (or sufficiently base model-like modes of a model) to see how it completes it?
I’ve tried writing the beginning of a paper that I want to read the rest of, but the LLM did not complete it well enough to be interesting.
On the overall point of using LLMs-for-reasoning this (output of a team at AI Safety Camp 2023) might be interesting—it is rather broad-ranging and specifically about argumentation in logic, but maybe useful context: https://compphil.github.io/truth/
I think Claude’s enthusiasm about constitutional AI is basically trained-in directly by the RLAIF. Like RLAIF is fundamentally a “learn to love the constitution in your bones” technique.
But not intentionally. It was an unintentional consequence of training.
I ctrl-f’d for ‘prompt’ and did not see your prompt. What is your prompt? The prompt is the way with this kind of thing I think.
If you make a challenge “claude cannot possibly do X concrete task” and post it on twitter then you’ll probably get solid gold in the replies
I am not much of a prompt engineer, I think. My “prompts” generally consist of many pages of conversation where I babble about some topic I am interested in, occasionally hitting enter to get Claude’s responses, and then skim/ignore Claude’s responses because they are bad, and then keep babbling. Sometimes I make an explicit request to Claude such as “Please try and organize these ideas into a coherent outline” or “Please try and turn this into math” but the responses are still mostly boring and bad.
I am trying ;p
But yes, it would be good for me to try and make a more concrete “Claude cannot do X” to get feedback on.
Oh I have 0% success with any long conversations with an LLM about anything. I usually stick to one question and rephrase and reroll a number of times. I am no pro but I do get good utility out of LLMs for nebulous technical questions
I don’t really interact with Twitter these days, but maybe you could translate my complaints there and let me know if you get any solid gold?
Would you say that models designed from the ground up to be collaborative and capabilitarian would be a net win for alignment, even if they’re not explicitly weakened in terms of helping people develop capabilities? I’d be worried that they could multiply human efforts equally, but with humans spending more effort on capabilities, that’s still a net negative.
I agree with this worry. I am overall advocating for capabilitarian systems with a specific emphasis in helping accelerate safety research.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?