Thanks!
JanB
What else should people be thinking about? You’d want to be sure that you’ll, in fact, be allowed to work on alignment. But what other hidden downsides are there?
This post is my attempt to understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying.
I just want to add that “whether you should consider applying” probably depends massively on what role you’re applying for. E.g. even if you believed that pushing AI capabilities was net negative right now, you might still want to apply for an alignment role.
Here’s how I’d quickly summarize my problems with this scheme:
Oversight problems:
Overseer doesn’t know: In cases where your unaided humans don’t know whether the AI action is good or bad, they won’t be able to produce feedback which selects for AIs that do good things. This is unfortunate, because we wanted to be able to make AIs that do complicated things that have good outcomes.
Overseer is wrong: In cases where your unaided humans are actively wrong about whether the AI action is good or bad, their feedback will actively select for the AI to deceive the humans.
Catastrophe problems:
Even if the overseer’s feedback was perfect, a model whose strategy is to lie in wait until it has an opportunity to grab power will probably be able to successfully grab power.
This is the common presentation of issues with RLHF, and I find it so confusing.
The “oversight problems” are specific to RLHF; scalable oversight schemes attempt to address these problems.
The “catastrophe problem” is not specific to RLHF (you even say that). This problem may appear with any form of RL (that only rewards/penalises model outputs). In particular, scalable oversight schemes (if they only supervise model outputs) do not address this problem.
So why is RLHF more likely to lead to X-risk than, say, recursive reward modelling?
The case for why the “oversight problems” should lead to X-risk is very rarely made. Maybe RLHF is particularly likely to lead to the “catastrophe problem” (compared to, say, RRM), but this case is also very rarely made.
Should we do more research on improving RLHF (e.g. increasing its sample efficiency, or understanding its empirical properties) now?
I think this research, though it’s not my favorite kind of alignment research, probably contributes positively to technical alignment. Also, it maybe boosts capabilities, so I think it’s better to do this research privately (or at least not promote your results extensively in the hope of raising more funding). I normally don’t recommend that people research this, and I normally don’t recommend that projects of this type be funded.
Should we do research on alignment schemes which use RLHF as a building block? E.g. work on recursive oversight schemes or RLHF with adversarial training?
IMO, this kind of research is promising and I expect a large fraction of the best alignment research to look like this.
Where does this difference come from?
From an alignment perspective, increasing the sample efficiency of RLHF looks really good, as this would allow us to use human experts with plenty of deliberation time to provide the supervision. I’d expect this to be at least as good as several rounds of debate, IDA, or RRM in which the human supervision is provided by human lay people under time pressure.
From a capabilities perspective, recursive oversight schemes and adversarial training also increase capabilities. E.g. a reason that Google currently doesn’t deploy their LLMs is probably that they sometimes produce offensive output, which is probably better addressed by increasing the data diversity during finetuning (e.g. adversarial training), rather than improving specification (although that’s not clear).
The key problem here is that we don’t know what rewards we “would have” provided in situations that did not occur during training. This requires us to choose some specific counterfactual, to define what “would have” happened. After we choose a counterfactual, we can then categorize a failure as outer or inner misalignment in a well-defined manner.
We often do know what rewards we “would have” provided. You can query the reward function, reward model, or human labellers. IMO, the key issue with the objective-based categorisation is a bit different: it’s nonsensical to classify an alignment failure as inner/outer based on some value of the reward function in some situation that didn’t appear during training, as that value has no influence on the final model.
In other words: Maybe we know what reward we “would have” provided in a situation that did not occur during training, or maybe we don’t. Either way, this hypothetical reward has no causal influence on the final model, so it’s silly to use this reward to categorise any alignment failures that show up in the final model.
My model is that if there are alignment failures that leave us neither dead nor disempowered, we’ll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we’ve found a reward signal that leaves us alive and in charge, we’ve solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).
If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned—the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.
OK, so this is wire-heading, right? Then you agree that it’s the wire-heading behaviours that kills us? But wire-heading (taking control of the channel of reward provision) is not, in any way, specific to RLHF. Any way of providing reward in RL leads to wire-heading. In particular, you’ve said that the problem with RLHF is that “RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what’s true.” How does incentivising palatable claims lead to the RL agent taking control over the reward provision channels? These issues seem largely orthogonal. You could have a perfect reward signal that only incentivizes “true” claims, and you’d still get wire-heading.
So in which way is RLHF particularly bad? If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?
I’m not trying to be antagonistic! I do think I probably just don’t get it, and this seems a great opportunity for me finally understand this point :-)
I just don’t know any plausible story for how outer alignment failures kill everyone. Even in Another (outer) alignment failure story, what ultimately does us in, is wire-heading (which I don’t consider an outer alignment problem, because it happens with any possible reward).
But if the reward model doesn’t learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won’t kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this “accidentally” involves killing humans.
I agree with this. I agree that this failure mode could lead to extinction, but I’d say it’s pretty unlikely. IMO, it’s much more likely that we’ll just eventually spot any such outer alignment issue and fix it eventually (as in the early parts of Another (outer) alignment failure story,)
How does an AI trained with RLHF end up killing everyone, if you assume that wire-heading and inner alignment are solved? Any half-way reasonable method of supervision will discourage “killing everyone”.
There is now also this write-up by Jan Leike: https://www.lesswrong.com/posts/FAJWEfXxws8pMp8Hk/link-why-i-m-optimistic-about-openai-s-alignment-approach
This response does not convince me.
Concretely, I think that if I’d show the prize to people in my lab and they actually looked at the judges (and I had some way of eliciting honest responses from them), I’d think that >60% would have some reactions according to what Sam and I described (i.e. seeing this prize as evidence that AI alignment concerns are mostly endorsed by (sometimes rich) people who have no clue about ML; or that the alignment community is dismissive of academia/peer-reviewed publishing/mainstream ML/default ways of doing science; or … ).
Your point 3.) about the feedback from ML researchers could convince me that I’m wrong, depending on whom exactly you got feedback from and how that looked like.
By the way, I’m highlighting this point in particular not because it’s highly critical (I haven’t thought much about how critical it is), but because it seems relatively easy to fix.
I think the contest idea is great and aimed at two absolute core alignment problems. I’d be surprised if much comes out of it, as these are really hard problems and I’m not sure contests are a good way to solve really hard problems. But it’s worth trying!
Now, a bit of a rant:
Submissions will be judged on a rolling basis by Richard Ngo, Lauro Langosco, Nate Soares, and John Wentworth.
I think this panel looks very weird to ML people. Very quickly skimming the Scholar profiles, it looks like the sum of first-author papers in top ML conferences published by these four people is one (Goal Misgeneralisation by Lauro et al.). The person with the most legible ML credentials is Lauro, who’s an early-year PhD student with 10 citations.
Look, I know Richard and he’s brilliant. I love many of his papers. I bet that these people are great researchers and can judge this contest well. But if I put myself into the shoes of an ML researcher who’s not part of the alignment community, this panel sends a message: “wow, the alignment community has hundreds of thousands of dollars, but can’t even find a single senior ML researcher crazy enough to entertain their ideas”.There are plenty of people who understand the alignment problem very well and who also have more ML credentials. I can suggest some, if you want.
(Probably disregard this comment if ML researchers are not the target audience for the contests.)
Currently, I’d estimate there are ~50 people in the world who could make a case for working on AI alignment to me that I’d think wasn’t clearly flawed. (I actually ran this experiment with ~20 people recently, 1 person succeeded.)
I wonder if this is because people haven’t optimised for being able to make the case. You don’t really need to be able to make a comprehensive case for AI risk to do productive research on AI risk. For example, I can chip away at the technical issues without fully understanding the governance issues, as long as I roughly understand something like “coordination is hard, and thus finding technical solutions seems good”.
Put differently: The fact that there are (in your estimation) few people who can make the case well doesn’t mean that it’s very hard to make the case well. E.g., for me personally, I think I could not make a case for AI risk right now that would convince you. But I think I could relatively easily learn to do so (in maybe one to three months???)
I had independently thought that this is one of the main parts where I disagree with the post, and wanted to write up a very similar comment to yours. Highly relevant link: https://www.fhi.ox.ac.uk/wp-content/uploads/Allocating-risk-mitigation.pdf My best guess would have been maybe 3-5x per decade, but 10x doesn’t seem crazy.
Relevant: Scaling Laws for Transfer: https://arxiv.org/abs/2102.01293
Anthropic is also working on inner alignment, it’s just not published yet.
Regarding what “the point” of RL from human preferences with language models is; I think it’s not only to make progress on outer alignment (I would agree that this is probably not the core issue; although I still think that it’s a relevant alignment issue).
See e.g. Ajeya’s comment here:
According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):
Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good to move the culture at AI labs away from automated and easy rewards and toward human feedback.
You need to have human feedback working pretty well to start testing many other strategies for alignment like debate and recursive reward modeling and training-for-interpretability, which tend to build on a foundation of human feedback.
Human feedback provides a more realistic baseline to compare other strategies to—you want to be able to tell clearly if your alignment scheme actually works better than human feedback.
With that said, my guess is that on the current margin people focused on safety shouldn’t be spending too much more time refining pure human feedback (and ML alignment practitioners I’ve talked to largely agree, e.g. the OpenAI safety team recently released this critiques work—one step in the direction of debate).
Furthermore, conceptual/philosophical pieces probably should be primarily posted on arXiv’s .CY section.
As an explanation, because this just took me 5 minutes of search: This is the section “Computers and Society (cs.CY)”
I agree that formatting is the most likely issue. The content of Neel’s grokking work is clearly suitable for arXiv (just very solid ML work). And the style of presentation of the blog post is already fairly similar to a standard paper (e.g. is has an Introduction section, lists contributions in bullet points, …).
So yeah, I agree that formatting/layout probably will do the trick (including stuff like academic citation style).
Ah, sorry to hear. I wouldn’t have predicted this from reading arXiv’s content moderation guidelines.
I wrote this post. I don’t understand where your claim (“Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv”) is coming from.