If this were a podcast, I’d totally listen to it!
JanB
This feels like a really adversarial quote. Concretely, the post says:
Sometimes, I think getting your forum post ready for submission can be as easy as creating a pdf of your post (although if your post was written in LaTeX, they’ll want the tex file). If everything goes well, the submission takes less than an hour.
However, if your post doesn’t look like a research article, you might have to format it more like one (and even then it’s not guaranteed to get in, see this comment thread).This looks correct to me; there are LW posts that already basically look like papers. And within the class of LW posts that should be on arXiv at all, which is the target audience of my post, posts that basically look like papers aren’t vanishingly rare.
I wrote this post. I don’t understand where your claim (“Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv”) is coming from.
Thanks!
What else should people be thinking about? You’d want to be sure that you’ll, in fact, be allowed to work on alignment. But what other hidden downsides are there?
This post is my attempt to understand Anthropic’s current strategy, and lay out the facts to consider in terms of whether Anthropic’s work is likely to be net positive and whether, as a given individual, you should consider applying.
I just want to add that “whether you should consider applying” probably depends massively on what role you’re applying for. E.g. even if you believed that pushing AI capabilities was net negative right now, you might still want to apply for an alignment role.
Here’s how I’d quickly summarize my problems with this scheme:
Oversight problems:
Overseer doesn’t know: In cases where your unaided humans don’t know whether the AI action is good or bad, they won’t be able to produce feedback which selects for AIs that do good things. This is unfortunate, because we wanted to be able to make AIs that do complicated things that have good outcomes.
Overseer is wrong: In cases where your unaided humans are actively wrong about whether the AI action is good or bad, their feedback will actively select for the AI to deceive the humans.
Catastrophe problems:
Even if the overseer’s feedback was perfect, a model whose strategy is to lie in wait until it has an opportunity to grab power will probably be able to successfully grab power.
This is the common presentation of issues with RLHF, and I find it so confusing.
The “oversight problems” are specific to RLHF; scalable oversight schemes attempt to address these problems.
The “catastrophe problem” is not specific to RLHF (you even say that). This problem may appear with any form of RL (that only rewards/penalises model outputs). In particular, scalable oversight schemes (if they only supervise model outputs) do not address this problem.
So why is RLHF more likely to lead to X-risk than, say, recursive reward modelling?
The case for why the “oversight problems” should lead to X-risk is very rarely made. Maybe RLHF is particularly likely to lead to the “catastrophe problem” (compared to, say, RRM), but this case is also very rarely made.
Should we do more research on improving RLHF (e.g. increasing its sample efficiency, or understanding its empirical properties) now?
I think this research, though it’s not my favorite kind of alignment research, probably contributes positively to technical alignment. Also, it maybe boosts capabilities, so I think it’s better to do this research privately (or at least not promote your results extensively in the hope of raising more funding). I normally don’t recommend that people research this, and I normally don’t recommend that projects of this type be funded.
Should we do research on alignment schemes which use RLHF as a building block? E.g. work on recursive oversight schemes or RLHF with adversarial training?
IMO, this kind of research is promising and I expect a large fraction of the best alignment research to look like this.
Where does this difference come from?
From an alignment perspective, increasing the sample efficiency of RLHF looks really good, as this would allow us to use human experts with plenty of deliberation time to provide the supervision. I’d expect this to be at least as good as several rounds of debate, IDA, or RRM in which the human supervision is provided by human lay people under time pressure.
From a capabilities perspective, recursive oversight schemes and adversarial training also increase capabilities. E.g. a reason that Google currently doesn’t deploy their LLMs is probably that they sometimes produce offensive output, which is probably better addressed by increasing the data diversity during finetuning (e.g. adversarial training), rather than improving specification (although that’s not clear).
The key problem here is that we don’t know what rewards we “would have” provided in situations that did not occur during training. This requires us to choose some specific counterfactual, to define what “would have” happened. After we choose a counterfactual, we can then categorize a failure as outer or inner misalignment in a well-defined manner.
We often do know what rewards we “would have” provided. You can query the reward function, reward model, or human labellers. IMO, the key issue with the objective-based categorisation is a bit different: it’s nonsensical to classify an alignment failure as inner/outer based on some value of the reward function in some situation that didn’t appear during training, as that value has no influence on the final model.
In other words: Maybe we know what reward we “would have” provided in a situation that did not occur during training, or maybe we don’t. Either way, this hypothetical reward has no causal influence on the final model, so it’s silly to use this reward to categorise any alignment failures that show up in the final model.
Looking for an alignment tutor
My model is that if there are alignment failures that leave us neither dead nor disempowered, we’ll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we’ve found a reward signal that leaves us alive and in charge, we’ve solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).
If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned—the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.
OK, so this is wire-heading, right? Then you agree that it’s the wire-heading behaviours that kills us? But wire-heading (taking control of the channel of reward provision) is not, in any way, specific to RLHF. Any way of providing reward in RL leads to wire-heading. In particular, you’ve said that the problem with RLHF is that “RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what’s true.” How does incentivising palatable claims lead to the RL agent taking control over the reward provision channels? These issues seem largely orthogonal. You could have a perfect reward signal that only incentivizes “true” claims, and you’d still get wire-heading.
So in which way is RLHF particularly bad? If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?
I’m not trying to be antagonistic! I do think I probably just don’t get it, and this seems a great opportunity for me finally understand this point :-)
I just don’t know any plausible story for how outer alignment failures kill everyone. Even in Another (outer) alignment failure story, what ultimately does us in, is wire-heading (which I don’t consider an outer alignment problem, because it happens with any possible reward).
But if the reward model doesn’t learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won’t kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this “accidentally” involves killing humans.
I agree with this. I agree that this failure mode could lead to extinction, but I’d say it’s pretty unlikely. IMO, it’s much more likely that we’ll just eventually spot any such outer alignment issue and fix it eventually (as in the early parts of Another (outer) alignment failure story,)
How does an AI trained with RLHF end up killing everyone, if you assume that wire-heading and inner alignment are solved? Any half-way reasonable method of supervision will discourage “killing everyone”.
There is now also this write-up by Jan Leike: https://www.lesswrong.com/posts/FAJWEfXxws8pMp8Hk/link-why-i-m-optimistic-about-openai-s-alignment-approach
[LINK] - ChatGPT discussion
Research request (alignment strategy): Deep dive on “making AI solve alignment for us”
This response does not convince me.
Concretely, I think that if I’d show the prize to people in my lab and they actually looked at the judges (and I had some way of eliciting honest responses from them), I’d think that >60% would have some reactions according to what Sam and I described (i.e. seeing this prize as evidence that AI alignment concerns are mostly endorsed by (sometimes rich) people who have no clue about ML; or that the alignment community is dismissive of academia/peer-reviewed publishing/mainstream ML/default ways of doing science; or … ).
Your point 3.) about the feedback from ML researchers could convince me that I’m wrong, depending on whom exactly you got feedback from and how that looked like.
By the way, I’m highlighting this point in particular not because it’s highly critical (I haven’t thought much about how critical it is), but because it seems relatively easy to fix.
I think the contest idea is great and aimed at two absolute core alignment problems. I’d be surprised if much comes out of it, as these are really hard problems and I’m not sure contests are a good way to solve really hard problems. But it’s worth trying!
Now, a bit of a rant:
Submissions will be judged on a rolling basis by Richard Ngo, Lauro Langosco, Nate Soares, and John Wentworth.
I think this panel looks very weird to ML people. Very quickly skimming the Scholar profiles, it looks like the sum of first-author papers in top ML conferences published by these four people is one (Goal Misgeneralisation by Lauro et al.). The person with the most legible ML credentials is Lauro, who’s an early-year PhD student with 10 citations.
Look, I know Richard and he’s brilliant. I love many of his papers. I bet that these people are great researchers and can judge this contest well. But if I put myself into the shoes of an ML researcher who’s not part of the alignment community, this panel sends a message: “wow, the alignment community has hundreds of thousands of dollars, but can’t even find a single senior ML researcher crazy enough to entertain their ideas”.There are plenty of people who understand the alignment problem very well and who also have more ML credentials. I can suggest some, if you want.
(Probably disregard this comment if ML researchers are not the target audience for the contests.)
Currently, I’d estimate there are ~50 people in the world who could make a case for working on AI alignment to me that I’d think wasn’t clearly flawed. (I actually ran this experiment with ~20 people recently, 1 person succeeded.)
I wonder if this is because people haven’t optimised for being able to make the case. You don’t really need to be able to make a comprehensive case for AI risk to do productive research on AI risk. For example, I can chip away at the technical issues without fully understanding the governance issues, as long as I roughly understand something like “coordination is hard, and thus finding technical solutions seems good”.
Put differently: The fact that there are (in your estimation) few people who can make the case well doesn’t mean that it’s very hard to make the case well. E.g., for me personally, I think I could not make a case for AI risk right now that would convince you. But I think I could relatively easily learn to do so (in maybe one to three months???)
I have been thinking roughly similar things about adept.ai; in particular, because they take a relatively different approach that still relies on scale.