Relevant: Scaling Laws for Transfer: https://arxiv.org/abs/2102.01293
JanB
Anthropic is also working on inner alignment, it’s just not published yet.
Regarding what “the point” of RL from human preferences with language models is; I think it’s not only to make progress on outer alignment (I would agree that this is probably not the core issue; although I still think that it’s a relevant alignment issue).
See e.g. Ajeya’s comment here:
According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):
Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good to move the culture at AI labs away from automated and easy rewards and toward human feedback.
You need to have human feedback working pretty well to start testing many other strategies for alignment like debate and recursive reward modeling and training-for-interpretability, which tend to build on a foundation of human feedback.
Human feedback provides a more realistic baseline to compare other strategies to—you want to be able to tell clearly if your alignment scheme actually works better than human feedback.
With that said, my guess is that on the current margin people focused on safety shouldn’t be spending too much more time refining pure human feedback (and ML alignment practitioners I’ve talked to largely agree, e.g. the OpenAI safety team recently released this critiques work—one step in the direction of debate).
Furthermore, conceptual/philosophical pieces probably should be primarily posted on arXiv’s .CY section.
As an explanation, because this just took me 5 minutes of search: This is the section “Computers and Society (cs.CY)”
I agree that formatting is the most likely issue. The content of Neel’s grokking work is clearly suitable for arXiv (just very solid ML work). And the style of presentation of the blog post is already fairly similar to a standard paper (e.g. is has an Introduction section, lists contributions in bullet points, …).
So yeah, I agree that formatting/layout probably will do the trick (including stuff like academic citation style).
Ah, sorry to hear. I wouldn’t have predicted this from reading arXiv’s content moderation guidelines.
It probably could, although I’d argue that even if not, quite often it would be worth the author’s time.
Ah, I had forgotten about this. I’m happy to endorse people or help them find endorsers.
Your posts should be on arXiv
Great post! This is the best (i.e. most concrete, detailed, clear, and comprehensive) story of existential risk from AI I know of (IMO). I expect I’ll share it widely.
Also, I’d be curious if people know of other good “concrete stories of AI catastrophe”, ideally with ample technical detail.
I’m super interested in this question as well. Here are two thoughts:
It’s not enough to look at the expected “future size of the AI alignment community”, you need to look at the full distribution.
Let’s say timelines are long. We can assume that the benefits of alignment work scale roughly logarithmically with the resources invested. The derivative of log is 1/x, and that’s how the value of a marginal contribution scales.
There is some probability, let’s say 50%, that the world starts dedicating many resources to AI risk and the number of people working on alignment is massive, let’s say 10000x today. In these cases, your contribution would be roughly zero. But there is some probability (let’s say 50%) that the world keeps being bad at preparing for potentially catastrophic events, the AI alignment community is not much larger than today. In total, you’d only discount your contribution by 50% (compared to short timelines).
This is just for illustration, and I made many implicit assumptions, like: the timing of the work doesn’t matter as long as it’s before AGI, early work does not influence the amount of future work, “the size of the alignment community at crunchtime” is identical to “future work done”, and so on...
It matters a lot how much better “work during crunchtime” is vs “work before crunchtime”.
Let’s say timelines are long, with AGI happening in 60 years. It’s totally conceivable that the world keeps being bad at preparing for potentially catastrophic events, and the AI alignment community in 60 years is not much larger than today. If mostly work done at “crunch-time” (in the 10 years before AGI) matters, then the world would not be in a better situation than in the short timelines scenario. If you could do productive work now to address this scenario, this would be pretty good (but you can’t, by assumption).
But if work done before crunchtime matters a lot, then even if the AI alignment community in 60 years is still small, we’ll probably at least have had 60 years of AI alignment work (from a small community). That’s much more than what we have in short timeline scenarios (e.g. 15 years from a small community)
So happy to see this, and such an amazing team!
Have you tried using automated adversarial attacks (common ML meaning) on text snippets that are classified as injurious but near the cutoff? Especially adversarial attacks that aim to retain semantic meaning. E.g. with a framework like TextAttack?
In the paper, you write: “There is a large and growing literature on both adversarial attacks and adversarial training for large language models [31, 32, 33, 34]. The majority of these focus on automatic attacks against language models. However, we chose to use a task without an automated source of ground truth, so we primarily used human attackers.”
But my best guess would be that if you use an automatic adversarial attack on a snippet that humans say is injurious, the result will quite often still be a snippet that humans say is injurious.
Some ideas for follow-up projects to Redwood Research’s recent paper
Amusing tid-bit, maybe to keep in mind when writing for an ML audience: The connotations with the term “adversarial examples” or “adversarial training” run deep :-)
I engaged with the paper and related blog posts for a couple of hours. It took really long until my brain accepted that “adversarial examples” here doesn’t mean the thing that it usually means when I encounter the term (i.e. “small” changes to an input that change the classification, for some definition of small).
There were several instances when my brain went “Wait, that’s not how adversarial examples work”, followed by short confusion, followed by “right, that’s because my cached concept of X is only true for “adversarial examples as commonly defined in ML”, not for “adversarial examples as defined here”.
I guess I’d recommend the AGI safety fundamentals course: https://www.eacambridge.org/technical-alignment-curriculum
On Stuart’s list: I think this list might be suitable for some types of conceptual alignment research. But you’d certainly want to read more ML for other types of alignment research.
Have we “given it” the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them?
The distinction that you’re pointing at is useful. But I would have filed it under “difference in the degree of agency”, not under “difference in goals”. When reading the main text, I thought this to be the reason why you introduced the six criteria of agency.
E.g., System A tries to prove the Riemann hypothesis by thinking about the proof. System B first seizes power and converts the galaxy into a supercomputer, to then prove the Riemann hypothesis. Both systems maybe have the goal of “proving the Riemann hypothesis”, but System B has “more agency”: it certainly has self-awareness, considers more sophisticated and diverse plans of larger scale, and so on.
Section 13 (page 47) discusses data/compute scaling and the comparison to Chinchilla. Some findings:
PaLM 540B uses 4.3 x more compute than Chinchilla, and outperforms Chinchilla on downstream tasks.
PaLM 540B is massively undertrained with regards to the data-scaling laws discovered in the Chinchilla paper. (unsurprisingly, training a 540B parameter model on enough tokens would be very expensive)
within the set of (Gopher, Chinchilla, and there sizes of PaLM), the total amount of training compute seems to predict performance on downstream tasks pretty well (log-linear relationship). Gopher underperforms a bit.
I am very surprised that the models do better on the generation task than on the multiple-choice task. Multiple-choice question answering seems almost strictly easier than having to generate the answer. Could this be an artefact of how you compute the answer in the MC QA task? Skimming the original paper, you seem to use average per-token likelihood. Have you tried other ways, e.g.
-
pointwise mutual information as in the Anthropic paper, or
-
adding the sentence “Which of these 5 options is the correct one?” to the end of the prompt and then evaluating the likelihood of “A”, “B”, “C”, “D”, and “E”?
I suggest this because the result is so surprising, it would be great to see if it appear across different methods of eliciting the MC answer.
-
We had two groups, one vegetarian/vegan, and one omnivore.
I had independently thought that this is one of the main parts where I disagree with the post, and wanted to write up a very similar comment to yours. Highly relevant link: https://www.fhi.ox.ac.uk/wp-content/uploads/Allocating-risk-mitigation.pdf My best guess would have been maybe 3-5x per decade, but 10x doesn’t seem crazy.