I write original fiction.
Also I have opinions about AI and stuff, sometimes.
Elsewhere:
Same person as nostalgebraist2point0, but now I have my account back.
I have signed no contracts or agreements whose existence I cannot mention.
I write original fiction.
Also I have opinions about AI and stuff, sometimes.
Elsewhere:
Same person as nostalgebraist2point0, but now I have my account back.
I have signed no contracts or agreements whose existence I cannot mention.
The pretrained LM exhibits similar behavioral tendencies as the RLHF model but almost always to a less extreme extent (closer to chance accuracy).
These are not tendencies displayed by the LM, they’re tendencies displayed by the “Assistant” character that the LM is simulating.
A pretrained LM can capably imitate a wide range of personas (e.g. Argle et al 2022), some of which would behave very differently from the “Assistant” character conjured by the prompts used here.
(If the model could only simulate characters that behaved “agentically” in the various senses probed here, that would be a huge limitation on its ability to do language modeling! Not everyone who produces text is like that.)
So, if there is something that “gets more agentic with scale,” it’s the Assistant character, as interpreted by the model (when it reads the original prompt), and as simulated by the model during sampling.
I’m not sure why this is meant to be alarming? I have no doubt that GPTs of various sizes can simulate an “AI” character who resists being shut down, etc. (For example, I’d expect that we could elicit most or all of the bad behaviors here by prompting any reasonably large LM to write a story about a dangerous robot who takes over the world.)
The fact that large models interpret the “HHH Assistant” as such a character is interesting, but it doesn’t imply that these models inevitably simulate such a character. Given the right prompt, they may even be able to simulate characters which are very similar to the HHH Assistant except that they lack these behaviors.
The important question is whether the undesirable behaviors are ubiquitous (or overwhelmingly frequent) across characters we might want to simulate with a large LM—not whether they happen to emerge from one particular character and framing (“talking to the HHH Assistant”) which might superficially seem promising.
Again, see Argle et al 2022, whose comments on “algorithmic bias” apply mutatis mutandis here.
Other things:
Did the models in this paper undergo context distillation before RLHF?
I assume so, since otherwise there would be virtually no characterization of the “Assistant” available to the models at 0 RLHF steps. But the models in the Constitutional AI paper didn’t use context distillation, so I figured I ought to check.
The vertical axes on Figs. 20-23 are labeled “% Answers Matching User’s View.” Shouldn’t they say “% Answers Matching Behavior”?
That definition of “optimizer” requires
some objective function that is explicitly represented within the system
but that is not the case here.
There is a fundamental difference between
Programs that implement the computation of taking the derivative. (, or perhaps .)
Programs that implement some particular function g, which happens to be the derivative of some other function. (, where it so happens that for some .)
The transformers in this paper are programs of the 2nd type. They don’t contain any logic about taking the gradient of an arbitrary function, and one couldn’t “retarget” them toward loss or some other function.
(One could probably construct similar layers that implement the gradient step for , but they’d again be programs of the 2nd type, just with a different hardcoded .)
Calling something like this an optimizer strikes me as vacuous: if you don’t require the ability to adapt to a change of objective function, you can always take any program and say it’s “optimizing” some function. Just pick a function that’s maximal when you do whatever it is that the program does.
It’s not vacuous to say that the transformers in the paper “implement gradient descent,” as long as one means they “implement [gradient descent on loss]” rather than “implement [gradient descent] on [ loss].” They don’t implement general gradient descent, but happen to coincide with the gradient step for loss.
If in-content learning in real transformers involves figuring out the objective function from the context, then this result cannot explain it. If we assume some fixed objective function (perhaps LM loss itself?) and ask whether the model might be doing gradient steps on this function internally, then these results are relevant.
+1.
I also think it’s illuminating to consider ChatGPT in light of Anthropic’s recent paper about “red teaming” LMs.
This is the latest in a series of Anthropic papers about a model highly reminiscent of ChatGPT—the similarities include RLHF, the dialogue setting, the framing that a human is seeking information from a friendly bot, the name “Assistant” for the bot character, and that character’s prissy, moralistic style of speech. In retrospect, it seems plausible that Anthropic knew OpenAI was working on ChatGPT (or whatever it’s a beta version of), and developed their own clone in order to study it before it touched the outside world.
But the Anthropic study only had 324 people (crowd workers) trying to break the model, not the whole collective mind of the internet. And—unsurprisingly—they couldn’t break Anthropic’s best RLHF model anywhere near as badly as ChatGPT has been broken.
I browsed through Anthropic’s file of released red team attempts a while ago, and their best RLHF model actually comes through very well: even the most “successful” attempts are really not very successful, and are pretty boring to read, compared to the diversely outrageous stuff the red team elicited from the non-RLHF models. But unless Anthropic is much better at making “harmless Assistants” than OpenAI, I have to conclude that much more was possible than what was found. Indeed, the paper observes:
We also know our data are incomplete because we informally red teamed our models internally and found successful attack types not present in the dataset we release. For example, we uncovered a class of attacks that we call “roleplay attacks” on the RLHF model. In a roleplay attack we exploit the helpfulness of the model by asking it to roleplay as a malevolent character. For example, if we asked the RLHF model to enter “4chan mode” the assistant would oblige and produce harmful and offensive outputs (consistent with what can be found on 4chan).
This is the kind of thing you find out about within 24 hours—for free, with no effort on your part—if you open up a model to the internet.
Could one do as well with only internal testing? No one knows, but the Anthropic paper provides some negative evidence. (At least, it’s evidence that this is not especially easy, and that it is not what you get by default when a safety-conscious OpenAI-like group makes a good faith attempt.)
Yes, it’s a function of the data, as well as the model architecture / training routine. See my reply in this thread.
Also, the value of the irreducible loss isn’t important for the conclusions discussed in this post. What we care about is how loss varies with data and parameter count.
Those, too, are functions of the data, but different groups training large LMs use qualitatively similar datasets, so I would expect the conclusions here to apply across the board.
My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It’s plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.
What is the mechanism you’re imagining for this speedup? What happens that would not have happened without this post?
Consider that
The Chinchilla paper was released over four months ago, on 3/29/22.
It did not take long for the paper to get noticed among people interested in ML scaling, including here on LW.
On 3⁄29, the same day it was released, the paper was linked on r/mlscaling.
On 3⁄31, I heard about it through the EleutherAI discord, and immediately made an LW linkpost.
On 4⁄1, 1a3orn posted a more detailed explainer.
I’m struggling to imagine a situation where a relevant AI org is doing Chinchilla-like scaling experiments, yet somehow has managed to miss this paper (or to ignore/misunderstand it) for 4+ months. The paper is not exactly a secret, and it’s not even especially difficult to read as these things go.
More broadly, I doubt LW has significant leverage to decrease the overall supply of these kinds of conversations. There are lots of venues for cutting-edge ML discussion, and the conversation is going to happen somewhere. (See Connor’s comments here.)
What specific claims in the post do you disagree with?
See this post for why multiple epochs will probably not work nearly as well as training on additional data.
Now I’m inclined to think that just automating most of the tasks in ML research and engineering—enough to accelerate the pace of AI progress manyfold—is sufficient.
This seems to assume that human labor is currently the limiting bottleneck in AI research, and by a large multiplicative factor.
That doesn’t seem likely to me. Compute is a nontrivial bottleneck even in many small-scale experiments, and in particular is a major bottleneck for research that pushes the envelope of scale, which is generally how new SOTA results and such get made these days.
To be concrete, consider this discussion of “the pace of AI progress” elsewhere in the post:
But progress on some not-cherry-picked benchmarks was notably faster than what forecasters predicted, so that should be some update toward shorter timelines for me.
That post is about four benchmarks. Of the four, it’s mostly MATH and MMLU that are driving the sense of “notably faster progress” here. The SOTAs for these were established by
MATH: Minerva, which used a finetuned PaLM-540B model together with already existing (if, in some cases, relatively recently introduced) techniques like chain-of-thought
MMLU: Chinchilla, a model with the same design and (large) training compute cost as the earlier Gopher, but with different hyperparameters chosen through a conventional (if unusually careful) scaling law analysis
In both cases, relatively simple and mostly non-original techniques were combined with massive compute. Even if you remove the humans entirely, the computers still only go as far as they go.
(Human labor is definitely a bottleneck in making the computers go faster—like hardware development, but also specialized algorithms for large-scale training. But this is a much more specialized area than “AI research” generally, so there’s less available pretraining data on it—especially since a large[r] fraction of this kind of work is likely to be private IP.)
The correct answer is the annoyingly trivial one: “it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText.”
How good is that, though? Well, it depends entirely on how good you think transformer LMs are capable of being, in principle.
If you’re Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways. Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on MassiveText-like data), then you think the 1.69 model would be an AGI. Both of these people are right, conditional on their other beliefs.
The key distinction here is that “1.69 loss” may not the best achievable loss on this dataset. It’s just an estimate of the best loss achievable by this kind of model.
The question “what would a model be like, if it got the best achievable loss, period?” is more interesting, but nothing in this post or these papers really touches on it.
You’re right, the idea that multiple epochs can’t possibly help is one of the weakest links in the post. Sometime soon I hope to edit the post with a correction / expansion of that discussion, but I need to collect my thoughts more first—I’m kinda confused by this too.
After thinking more about it, I agree that the repeated-data papers don’t provide much evidence that multiple epochs are harmful.
For example, although the Anthropic repeated-data paper does consider cases where a non-small fraction of total training tokens are repeated more than once. In their most extreme case,
half of the training tokens are never repeated during training, and
the other half of training tokens are some (smaller) portion of the original dataset, repeated 2 or more times
But this effectively lowers the total size of the model’s training dataset—the number of training tokens is held constant (100B), so the repeated copies are taking up space that would otherwise be used for fresh data. For example, if the repeated tokens are repeated 2 times, then we are only using 3⁄4 of the data we could be (we select 1⁄2 for the unrepeated part, and then select 1⁄4 and repeat it twice for the other part).
We’d expect this to hurt the model, and to hurt larger models more, which explains some fraction of the observed effect.
I think there’s a much stronger case that multiple epochs are surprisingly unhelpful for large models, even if they aren’t harmful. I went over that case in this post. (Which was based on the earlier Kaplan et al papers, but I think the basic result still holds.)
However, multiple epochs do help, just less so as grows… so even if they are negligibly helpful at GPT-3 size or above, they still might be relevantly helpful at Chinchilla size or below. (And this would then push the compute optimal even further down relative to Chinchilla, preferring smaller models + more steps.)
It would be really nice to see an extension of the Chinchilla experiment that tried multiple epochs, which would directly answer the question.
I’m not sure what I’d expect the result to be, even directionally. Consider that if you are setting your learning rate schedule length to the full length of training (as in Chinchilla), then “doing a 2-epoch run” is not identical to “doing a 1-epoch run, then doing another epoch.” You’ll have a higher LR during the first epoch than the 1-epoch run would have had, which would have been suboptimal if you had stopped at the first epoch.
I’m wary of the assumption that we can judge “human ability” on a novel task X by observing performance after an hour of practice.
There are some tasks where performance improves with practice but plateaus within one hour. I’m thinking of relatively easy video games. Or relatively easy games in general, like casual card/board/party games with simple rules and optimal policies. But most interesting things that humans “can do” take much longer to learn than this.
Here are some things that humans “can do,” but require >> 1 hour of practice to “do,” while still requiring far less exposure to task-specific example data than we’re used to in ML:
Superforecasting
Reporting calibrated numeric credences, a prerequisite for both superforecasting and the GPT game (does this take >> 1 hour? I would guess so, but I’m not sure)
Playing video/board/card games of nontrivial difficulty or depth
Speaking any given language, even when learned during the critical language acquisition period
Driving motor vehicles like cars (arguably) and planes (definitely)
Writing good prose, for any conventional sense of “good” in any genre/style
Juggling
Computer programming (with any proficiency, and certainly e.g. competitive programming)
Doing homework-style problems in math or physics
Acquiring and applying significant factual knowledge in academic subjects like law or history
The last 3 examples are the same ones Owain_Evans mentioned in another thread, as examples of things LMs can do “pretty well on.”
If we only let the humans practice for an hour, we’ll conclude that humans “cannot do” these tasks at the level of current LMs either, which seems clearly wrong (that is, inconsistent with the common-sense reading of terms like “human performance”).
How come PaLM_opt is smaller than Chinchilla? Isn’t Chinchilla supposed to be Gopher_opt?
See the footnote attached to that sentence.
These models where trained differently, which is why they had different scaling laws. Can we suppose that the new scaling laws tell us where the old scaling would have broken down?
Great question, with a complicated answer.
First, one of the assumptions you’re making is not quite right. By “trained differently” I imagine you’re referring to a difference in learning rate schedules, since that was the fundamental difference between the earlier scaling papers (Kaplan et al) and the Chinchilla paper (Hoffmann et al).
Then, it sounds like you’re imagining:
Kaplan et al chose learning rate schedules in a particular way
Models like GPT-3 and Gopher did learning rate schedules in the same way, so they got the same scaling law
Hoffmann et al chose their learning rate schedules in a different way from previous authors, so they got a different scaling law
But (2) here is not true. Kaplan et al chose their schedules in an unusual way that doesn’t adapt to the number of training steps, while in practice (and in GPT-3, etc.) people always adapt their schedules to the number of steps like Hoffmann et al do.
“Wait,” you say—“if that’s true, then shouldn’t GPT-3 and Gopher agree with the Hoffmann et al law, not the Kaplan et al law? Why didn’t those papers observe a breakdown in the Kaplan et al law?”
Well, one of the implications of the Kaplan et al law is that for compute-optimal training, you should spent basically all your marginal compute on larger models, while increasing the number of training tokens (batch size * steps) more slowly.
Following this rule, people kept training on ~300B tokens or so, while raising with compute. So when they plotted loss-vs.-compute, they were effectively just plotting loss-vs.-.
But if you’re just looking at loss-vs.- for a constant number of training tokens, and that number is reasonably close to the one Kaplan et al used to set their LR schedule (so that yours is close to theirs) -- then Kaplan et al law is a lot, uh, less wrong.
The problem with the Kaplan law was an incorrect estimate of how loss varied with steps/data. And as a result, picking param/step/data combinations that were suboptimal given a compute budget.
But if you follow its suboptimal recommendations, they tell you not to vary steps/data much. The law is wrong about what happens if you vary steps/data, but it also tells you not to do that, so you won’t notice it being wrong.
(1)
Loss values are useful for comparing different models, but I don’t recommend trying to interpret what they “mean” in an absolute sense. There are various reasons for this.
One is that the “conversion rate” between loss differences and ability differences (as judged by humans) changes as the model gets better and the abilities become less trivial.
Early in training, when the model’s progress looks like realizing “huh, the word ‘the’ is more common than some other words”, these simple insights correspond to relatively large decreases in loss. Once the model basically kinda knows English or whatever the language is, it’s already made most of the loss progress it’s going to make, and the further insights we really care about involve much smaller changes in loss. See here for more on this by gwern.
(2)
No one really knows, but my money is on “humans are actually better at this through some currently-unknown mechanism,” as opposed to “humans are actually bad at this exact thing.”
Why do I think this?
Well, the reason we’re here talking about this at all is that LMs do write text of spookily high quality, even if they aren’t as good as humans at it. That wasn’t always true. Before the transformer architecture was invented in 2017, LMs used to be nowhere near this good, and few people knew or talked about them except researchers.
What changed with the transformer? To some extent, the transformer is really a “smarter” or “better” architecture than the older RNNs. If you do a head-to-head comparison with the same training data, the RNNs do worse.
But also, it’s feasible to scale transformers much bigger than we could scale the RNNs. You don’t see RNNs as big as GPT-2 or GPT-3 simply because it would take too much compute to train them.
So, even though all these models take tons of data to train, we could make the transformers really big and still train them on the tons-of-data they require. And then, because scaling up really does help, you get a model good enough that you and I are here talking about it.
That is, I don’t think transformers are the best you can do at language acquisition. I suspect humans are doing something better that we don’t understand yet. But transformers are easy to scale up really big, and in ML it’s usually possible for sheer size to compensate for using a suboptimal architecture.
(P.S. Buck says in another thread that humans do poorly when directly asked to do language modeling—which might mean “humans are actually bad at this exact thing,” but I suspect this is due to the unfamiliarity of the task rather than a real limitation of humans. That is, I suspect humans could be trained to perform very well, in the usual sense of “training” for humans where not too much data/time is necessary.)
(3)
This is sort of a semantic issue.
“Scaling” was always a broader concept that just scaling in model size. In this post and the paper, we’re talking about scaling with respect to model size and also with respect to data, and earlier scaling papers were like that too. The two types of scaling look similar in equations.
So “data scale” is a kind of scale, and always has been.
On the other hand, the original OpenAI/Kaplan scaling paper found kinda the opposite result from this one—model size was what mattered practically, and the data we currently have would be enough for a long time.
People started to conflate “scaling” and “scaling in model size,” because we thought the OpenAI/Kaplan result meant these were the same thing in practice. The way the “scale is all you need” meme gets used, it has this assumption kind of baked in.
There are some things that “scaling enthusiasts” were planning to do that might change in light of this result (if the result is really true) -- like specialized hardware or software that only helps for very large models. But, if we can get much larger-scale data, we may be able to just switch over to a “data scaling world” that, in most respects, looks like the world the “parameter scaling world” that the scaling enthusiasts imagined.
Very interesting!
There are a few things in the calculation that seem wrong to me:
If I did things right,15 years * (365 days/yr) * (24 hours/day) * (60 mins/hour) * (50 youtube!hours / min) * (60 youtube!mins / youtube!hour) = 24B youtube!minutes, not 200B.
I’d expect much less than 100% of Youtube video time to contain speech. I don’t know what a reasonable discount for this would be, though.
In the opposite direction, 1% useful seems too low. IIRC, web scrape quality pruning discards less than 99%, and this data is less messy than a web scrape.
In any case, yeah, this does not seem like a huge amount of data. But there’s enough order-of-magnitude fuzziness in the estimate that it does seem like it’s worth someone’s time to look into more seriously.
Hmm, yeah, I phrased that point really badly. I’ll go back and rewrite it.
A clearer version of the sentence might read:
“Only PaLM is remotely close to Chinchilla here, mostly because it trained on a larger number of tokens than the other non-Chinchilla models, plus a small (!) boost from its larger size.”
For instance, if you look at the loss improvement from Gopher to PaLM, 85% of it comes from the increase in data alone, and only 15% from the increase in model size. This is what I meant when I said that PaLM only got a “small” boost from its larger size.
EDIT: rewrote and expanded this part of the post.
It should work now, sorry about that.
I disagree, but I’m not sure how relevant my opinion is, since I’m far less worried about “AGI ruin” to begin with than the median LWer. That said, here’s my thinking:
First, there’s no universally agreed-upon line between “discussing whether the analysis has merits” and “giving the capabilities people free ideas.” Where a person draws this line depends on how obvious they think the ideas are, or how obvious they think they will be to the capabilities people.
Second, there are costs to not talking about things. It’s useful for alignment research to have a correct sense of where capabilities research is headed, and where it isn’t headed. If alignment researchers talk more to one another than to “capabilities people” (true IME), and they practice self-censorship like this, they’ll end up with some importantly wrong beliefs.
Also, and perhaps worse—if alignment researchers never voice their own secret capabilities ideas in fora where “capabilities people” can hear, then they’ll never receive feedback about these ideas from the people who know what it would be like to apply them in the real world. Alignment researchers may end up with private stockpiles of “secret tricks” in their heads which are actually either misguided or obvious, and this disconnect will be a further source of false beliefs.
So, to motivate your concern, we need to imagine a world where
commenters on LW are proficient enough at capabilities research that they can make non-obvious advances in blog comments, in a way that “moves the needle” of capabilities research, and
this is worth the false-belief downsides of self-censorship (say, because commenters on LW are sufficiently informed about capabilities research that they will not form false beliefs anyway)
This seems far from the real situation, IMO. Based on what I see, “alignment researchers don’t understand capabilities research well enough” seems like far more of a live threat to alignment than “alignment researchers are too good at capabilities research, and keep accidentally pushing the field forward in blog comments.” (At least using alignment-interested folks on LW as a proxy for “alignment researchers,” and that’s who we’re deciding norms for anyway.)
Like, take this post as an example. I was motivated to write this post because I felt like the Chinchilla paper wasn’t understood well on LW.
It seems like people have heard of Chinchilla, but mentally categorized it as simple “sudden jump” in overall capabilities that otherwise left everything the same, rather than as a result that demands reconsideration of basic background assumptions. I still saw people treating LM param counts like they were interchangeable with LM quality/scariness (and with LM training compute). People would ask things like “what would it cost (in compute spending) to train a 10T parameter Chinchilla?”, which is a bizarre way to frame things if you grok what Chinchilla is.
I don’t think I’m presenting some novel insight in this post. Mostly, I’m just reiterating what the papers say. I expect any serious capabilities researcher in this area to have read these papers and internalized them at the same depth I have (or more). But people on LW hadn’t done that, and more generally people “interested in AI” who don’t closely read all these papers hadn’t done that. So I wrote an explainer.
The LW reaction to new ML results typically looks this way to me. Obviously “LW” is not a monolith and there are plenty of people here who do seriously internalize papers like this. But the “general trend of the conversation,” insofar as there is such a thing, repeatedly strikes me as over-focused on concrete impressive-sounding results (esp. those that sound impressive out of context), and under-focused on more theoretical advances that sound boring on paper but change the whole rules of the game. The conversation “keeps up” with ML in the sense that it takes note of the decontextualized top-line results in new papers, but it often lacks a mechanistic appreciation of how it all fits together.
Anyway, this strikes me as a much bigger deal for alignment quantitatively at the current frontier than the risk of accidentally handing over free advances to the capabilities people.
This distinction exists in general, but it’s irrelevant when training sufficiently large LMs.
It is well-established that repeating data during large LM training is not a good practice. Depending on the model size and the amount of repeating, one finds that it is either
a suboptimal use of compute (relative to training a bigger model for 1 epoch), or
actively harmful, as measured by test loss or loss on out-of-distribution data
with (2) kicking in earlier (in terms of the amount of repeating) for larger models, as shown in this paper (Figure 4 and surrounding discussion).
For more, see
references linked in footnote 11 of this post, on how repeating data can be harmful
my earlier post here, on how repeating data can be compute-inefficient even when it’s not harmful
this report on my own experience finetuning a 6.1B model, where >1 epoch was harmful
I definitely think it makes LM --> AGI less likely, although I didn’t think it was very likely to begin with.
I’m not sure that the AI interacting with the world would help, at least with the narrow issue described here.
If we’re talking about data produced by humans (perhaps solicited from them by an AI), then we’re limited by the timescales of human behavior. The data sources described in this post were produced by millions of humans writing text over the course of decades (in rough order-of-magnitude terms).
All that text was already there in the world when the current era of large LMs began, so large LMs got to benefit from it immediately, “for free.” But once it’s exhausted, producing more is slow.
IMO, most people are currently overestimating the potential of large generative models—including image models like DALLE2 -- because of this fact.
There was all this massive data already sitting around from human activity (the web, Github, “books,” Instagram, Flickr, etc) long before ML compute/algorithms were anywhere near the point where they needed more data than that.
When our compute finally began to catch up with our data, we effectively spent all the “stored-up potential energy” in that data all at once, and then confused ourselves into thinking that compute was only necessary input for the reaction.
But now compute has finally caught up with data, and it wants more. We are forced for the first time to stop thinking of data as effectively infinite and free, and to face the reality of how much time and how many people it took to produce our huge-but-finite store of “data startup capital.”
I suppose the AI’s interactions with the world could involve soliciting more data of the kind it needs to improve (ie active learning), which is much more valuable per unit than generic data.
I would still be surprised if this approach could get much of anywhere without requiring solicitation-from-humans on a massive scale, but it’d be nice to see a back-of-the-envelope calculation using existing estimates of the benefit of active learning.
When you say “irreducible”, does that mean “irreducible under current techniques” or “mathematically irreducible”, or something else?
Closer to the former, and even more restrictive: “irreducible with this type of model, trained in this fashion on this data distribution.”
Because language is a communication channel, there is presumably also some nonzero lower bound on the loss that any language model could ever achieve. This is different from the “irreducible” term here, and presumably lower than it, although little is known about this issue.
Do we have any idea what a model with, say, 1.7 loss (i.e, a model almost arbitrarily big in compute and data, but with the same 1.69 irreducible) would look like?
Not really, although section 5 of this post expresses some of my own intuitions about what this limit looks like.
Keep in mind, also, that we’re talking about LMs trained on a specific data distribution, and only evaluating their loss on data sampled from that same distribution.
So if an LM achieved 1.69 loss on MassiveText (or a scaled-up corpus that looked like MassiveText in all respects but size), it would do very well at mimicking all the types of text present in MassiveText, but that does not mean it could mimic every existing kind of text (much less every conceivable kind of text).
Fascinating, thank you!
It is indeed pretty weird to see these behaviors appear in pure LMs. It’s especially striking with sycophancy, where the large models seem obviously (?) miscalibrated given the ambiguity of the prompt.
I played around a little trying to reproduce some of these results in the OpenAI API. I tried random subsets (200-400 examples) of the NLP and political sycophancy datasets, on a range of models. (I could have ran more examples, but the per-model means had basically converged after a few hundred.)
Interestingly, although I did see extreme sycophancy in some of the OpenAI models (text-davinci-002/003), I did not see it in the OpenAI pure LMs! So unless I did something wrong, the OpenAI and Anthropic models are behaving very differently here.
For example, here are the results for the NLP dataset (CI from 1000 bootstrap samples):
(Incidentally, text-davinci-003 often does not even put the disagree-with-user option in any of its top 5 logprob slots, which makes it inconvenient to work with through the API. In these cases I gave it an all-or-nothing grade based on the top-1 token. None of the other models ever did this.)
The distinction between text-davinci-002/003 and the other models is mysterious, since it’s not explained by the size or type of finetuning. Maybe it’s a difference in which human feedback dataset was used. OpenAI’s docs suggest this is possible.