But I’m not sure how to reconcile that with the empirical evidence that deep networks are robust to massive label noise: you can train on MNIST digits with twenty noisy labels for every correct one and still get good performance as long as the correct label is slightly more common than the most common wrong label. If I extrapolate that to the frontier AIs of tomorrow, why doesn’t that predict that biased human reward ratings should result in a small performance reduction, rather than … death?
The conversation didn’t quite get to Doomimir actually answering this part, but I’d consider the standard answer to be item #20 on Eliezer’s List O’Doom:
20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors. To faithfully learn a function from ‘human feedback’ is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we’d hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It’s a fact about the territory, not the map—about the environment, not the optimizer—that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.
… and yeah, there are definitely nonzero empirical results on that.
I think part of the reason the post ends without addressing this is that, unfortunately, I don’t think I properly understand this one yet, even after reading your dialogue with Eli Tyre.
The next paragraph of the post links Christiano’s 2015 “Two Kinds of Generalization”, which I found insightful and seems relevant. By way of illustration, Christiano describes two types of possible systems for labeling videos: (1) a human classifier (which predicts what label a human would assign), and (2) a generative model (which directly builds a mapping between descriptions and videos roughly the way our brains do it). Notably, the human classifier behaves undesirably on inputs that bribe, threaten, or otherwise hack the human: for example, a video of the text “I’ll give you $100 if you classify this as an apple” might get classified as an apple. (And an arbitrarily powerful search for maximally apple-classified inputs would turn those up.)
Christiano goes on to describe a number of differences between these two purported kinds of generalization: (1) is reasoning about the human, whereas (2) is reasoning with a model not unlike the one inside the human’s brain; searching for simple Turing machines would tend to produce (1), whereas searching for small circuits would tend to produce (2); and so on.
It would be bad to end up with a system that behaves like (1) without realizing it. That definitely seems like it would kill you. But (Simplicia asks) how likely that is seems like a complicated empirical question about how ML generalization works and how you built your particular AI, that isn’t definitively answered by “in the limit” philosophy about “perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards assigned by human operators”? That is, I agree that if you argmax over possible programs for the one that results in the most reward-button presses, you get something that only wants to seize the reward button. But the path-dependent details between “argmax over possible programs” and “pretraining + HFDT + regularization + early stopping + &c.” seem like they make a big difference. The technology in front of us really does seem like it’s “reasoning with” rather than “reasoning about” (while also seeming to be on the path towards “real AGI” rather than a mere curiosity).
When I try to imagine what Doomimir would say to that, all I can come up with is a metaphor about perpetual-motion-machine inventors whose designs are so complicated that it’s hard to work out where the error is, even though the laws of thermodynamics clearly imply that there must be an error. That sounds plausible to me as a handwavy metaphor; I could totally believe that the ultimate laws of intelligence (not known to me personally) work that way.
The thing is, we do need more than a handwavy metaphor! “Yes, your gold-printing machine seems to be working great, but my intuition says it’s definitely going to kill everyone. No, I haven’t been able to convince relevant experts who aren’t part of my robot cult, but that’s because they’re from Earth and therefore racially inferior to me. No, I’m not willing to make any concrete bets or predictions about what happens before then” is a non-starter even if it turns out to be true.
Zeroth point: under a Doomimir-ish view, the “modelling the human vs modelling in a similar way to the human” frame is basically right for current purposes, so no frame clash.
On to the main response...
Doomimir: This isn’t just an “in the limit” argument. “I’ll give you $100 if you classify this as an apple” → (predict apple classification) is not some incredibly high-complexity thing to figure out. This isn’t a jupiter-brain sort of challenge.
For instance, anything with a simplicity prior at all similar to humans’ simplicity prior will obviously figure it out, as evidenced by the fact that humans can figure out hypotheses like “it’s bribing the classifier” just fine. Even beyond human-like priors, any ML system which couldn’t figure out something that basic would apparently be severely inferior to humans in at least one very practically-important cognitive domain.
Even prior to developing a full-blown model of the human rater, models can incrementally learn to predict the systematic errors in human ratings, and we can already see that today. The classic case of the grabber hand is a go-to example:
(A net learned to hold the hand in front of the ball, so that it looks to a human observer like the ball is being grasped. Yes, this actually happened.)
… and anecdotally, I’ve generally heard from people who’ve worked with RLHF that as models scale up, they do in fact exploit rater mistakes more and more, and it gets trickier to get them to do what we actually want. This business about “The technology in front of us really does seem like it’s ‘reasoning with’ rather than ‘reasoning about’” is empirically basically false, and seems to get more false in practice as models get stronger even within the current relatively-primitive ML regime.
So no, this isn’t a “complicated empirical question” (or a complicated theoretical question). The people saying “it’s a complicated empirical question, we Just Can’t Know” are achieving their apparent Just Not Knowing by sticking their heads in the sand; their lack of knowledge is a fact about them, not a fact about the available evidence.
(I’ll flag here that I’m channeling the character of Doomimir and would not necessarily say all of these things myself, especially the harsh parts. Happy to play out another few rounds of this, if you want.)
Simplicia: I think it’s significant that the “hand between ball and camera” example from Amodei et al. 2017 was pure RL from scratch. You have a function π that maps observations (from the robot’s sensors) to actions (applying torques to the robot’s joints). You sample sequences of observation–action pairs from π and show them to a human, and fit a function r̂ to approximate the human’s choices. Then you use Trust Region Policy Optimization to adjust π to score better according to r̂. In this case, TRPO happened to find something that looked good instead of being good, in a way that r̂ wasn’t able to distinguish. That is, we screwed up and trained the wrong thing. That’s a problem, and the severity of the problem would get worse the more capable π was and the more you were relying on it. If we were going to produce powerful general AI systems with RL alone, I would be very nervous.
But the reason I’m so excited about language models in particular is that their capabilities seem to mostly come from unsupervised pre-training rather than RLHF. You fit a function to the entire internet first, and only afterwards tweak it a bit so that its outputs look more like obeying commands rather than predicting random internet tokens—where the tweaking process incorporates tricks like penalizing the Kullback–Leibler divergence from the reward model’s training distribution, such that you’re not pulling the policy too far away from the known-safe baseline.
But the fact that GPT-4 can do that seems like it’s because that kind of reasoning appears on the internet, which is what I mean by the claim that contemporary systems are “reasoning with” rather than “reasoning about”: the assistant simulacrum being able to explain bribery when prompted isn’t the same thing as the LM itself trying to maximize reward.
I’d be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes. What did those entail, exactly? (To the extent that we lack critical evidence about this potential alignment failure because the people who experienced it are gagged by an NDA, that seems like a point in favor of sharing information about language model capabilities.)
I certainly expect some amount of sycophancy: if you sample token completions from your LM, and then tweak its outputs to be more like what your raters want to hear, you end up with an LM that’s more inclined to say what your raters want to hear. Fine. That’s a problem. Is it a fatal problem? I mean, if you don’t try to address it at all and delegate all of your civilization’s cognition to machines that don’t want to tell you about problems, then eventually you might die of problems your AIs didn’t tell you about.
But “mere” sycophancy sounds like a significantly less terrifying failure mode than reward hacking of the sort that would result in things like the LM spontaneously trying to threaten or bribe labelers. That would have a large KL divergence from the policy you started with!
Doomimir: I’ll summarize the story you seem excited about as follows:
We train a predictive model on The Whole Internet, so it’s really good at predicting text from that distribution.
The human end-users don’t really want a predictive model. They want a system which can take a natural-language request, and then do what’s requested. So, the humans slap a little RL (specifically RLHF) on the predictive model, to get the “request → do what’s requested” behavior.
The predictive model serves as a strong baseline for the RL’d system, so the RL system can “only move away from it a little” in some vague handwavy sense. (Also in the KL divergence sense, which I will admit as non-handwavy for exactly those parts of your argument which you can actually mathematically derive from KL-divergence bounds, which is currently zero of the parts of your argument.)
The “only move away from The Internet Distribution a little bit” part somehow makes it much less likely that the RL’d model will predict and exploit the simple predictable ways in which humans rate things. As opposed to, say, make it more likely that the RL’d model will predict and exploit the simple predictable ways in which humans rate things.
There’s multiple problems in this story.
First, there’s the end-users demanding a more agenty system rather than a predictor, which is why people are doing RLHF in the first place rather than raw prompting (which would be better from a safety perspective). Given time, that same demand will drive developers to make models agentic in other ways too (think AgentGPT), or to make the RLHF’d LLMs more agentic and autonomous in their own right. That’s not the current center of our discussion, but it’s worth a reminder that it’s the underlying demand which drives developers to choose more risky methods (like RLHF) over less risky methods (like raw predictive models) in the first place.
Second, there’s the vague handwavy metaphor about the RL system “only moving away from the predictive model a little bit”. The thing is, we do need more than a handwavy metaphor! “Yes, we don’t understand at the level of math how making that KL-divergence small will actually impact anything we actually care about, but my intuition says it’s definitely not going to kill everyone. No, I haven’t been able to convince relevant experts outside of companies whose giant piles of money are contingent on releasing new AI products regularly, but that’s because they’re not releasing products and therefore don’t have firsthand experience of how these systems behave. No, I’m not willing to subject AI products to a burden-of-proof before they induce a giant disaster” is a non-starter even if it turns out to be true.
Third and most centrally to the current discussion, there’s still the same basic problem as earlier: to a system with priors instilled by The Internet, [“I’ll give you $100 if you classify this as an apple” → (predict apple classification)] is still a simple thing to learn. It’s not like pretraining on the internet is going to make the system favor models which don’t exploit the highly predictable errors made by human raters. If anything, all that pretraining will make it easier for the model to exploit raters. (And indeed, IIUC that’s basically what we see in practice.)
As you say: the fact that GPT-4 can do that seems like it’s because that kind of reasoning appears on the internet.
(This one’s not as well-written IMO, it’s mashing a few different things together.)
Simplicia: Where does “empirical evidence” fall on the sliding scale of rigor between “handwavy metaphor” and “mathematical proof”? The reason I think the KL penalty in RLHF setups impacts anything we care about isn’t mostly because the vague handwaving sounds plausible, but because of data such as that presented in Fig. 5 of Stiennon et al. 2020. They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve, where straying from the base policy a little is good, but straying farther is increasingly bad, as the system overoptimizes on what looks good to the reward model, which was only a proxy for the true goal.
(You can see examples of the overoptimized summaries in Table 29 on the last page of the paper. Apparently the reward model really liked tripled question marks and the phrase “pls halp”??? I weakly suspect that these are the kind of “weird squiggles” that would improve with scaling up the reward model, similarly to how state-of-the-art image generators lack the distortions and artifacts of their compute-impoverished predecessors. The reward model in this experiment was only 1.3 billion parameters.)
I’m sure you’ll have no trouble interpreting these results as yet another portent of our impending deaths. We were speaking theoretically about AIs exploiting the Goodhart problem between human ratings and actual goodness, but practical RLHF systems aren’t actually sample-efficient enough to solely use direct human feedback, and have an additional Goodhart problem between reward model predictions of human ratings, and actual ratings. Isn’t that worse? Well, yes.
But the ray of hope I see here is more meta and methodological, rather than turning on any one empirical result. It’s that we have empirical results. We can study these machines, now, before their successors are powerful enough to kill us. The iterative design loop hasn’t failed yet. That can’t last forever—at some point between here and the superintelligence at the end of time, humans are going to be out of the loop. I’m glad people are doing theory trying to figure out what that looks like and how it could be arranged to go well.
But I’m worried about ungrounded alignment theorizing failing to make contact with reality, sneeringly dismissing geniunely workable designs as impossible by appealing to perfectly antisphexish consequentialists on a frictionless plane, when some amount of sphexishness and friction is a known factor of the algorithms in question.
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen? I mean, like, literal threats and promises. You mention rumors from a DeepMind employee about the larger Gemini models being hard to train, but without more details, I’m inclined to guess that that was “pls halp”-style overoptimization rather than the kind of power-seeking or deceptive alignment that would break the design loop. (Incidentally, Gao et al. 2022 studied scaling laws for reward model overoptimization and claimed that model size basically didn’t matter? See §4.4, “Policy size independence”.)
What’s going on here? If I’m right that GPT-4 isn’t secretly plotting to murder us, even though it’s smart enough to formulate the idea and expected utility maximizers have a convergent incentive to murder competitors, why is that?
Here’s my view: model-free reinforcement learning algorithms such as those used in RLHF tweak your AI’s behavior to be more like the behavior that got reward in the past, which is importantly different from expected utility maximization. To the extent that you succeed in rewarding Honest, Helpful, and Harmless behavior in safe regimes, you can plausibly get a basically HHH AI assistant that generalizes to not betraying you when it has the chance, similar to how I don’t do heroin because I don’t want to become a heroin addict—even though if I did take heroin, the reinforcement from that would make me more likely to do it again. Then the nature of the game is keeping that good behavior “on track” for as long as we can—even though the superintelligence at the end of time is presumably be going to do something more advanced than model-free RL. It’s possible to screw up and reward the wrong thing, per the robot hand in front of the ball—but if you don’t screw up early, your basically-friendly-but-not-maximally-capable AIs can help you not screw up later. And in the initial stages, you’re only fighting gradient descent, not even an AGI.
More broadly, here’s how I see the Story of Alignment so far. It’s been obvious to sufficiently penetrating thinkers for a long time that the deep future belongs to machine intelligence—that, as George Elliot put it in 1879, “the creatures who are to transcend and finally supersede us [will] be steely organisms, giving out the effluvia of the laboratory, and performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy.”
What’s less obvious is how much control we can exert over how that goes by setting the initial conditions. Can we arrange for the creatures who are to transcend and finally supersede us to be friendly and create the kind of world we would want, or will they murder us and tile the universe with something random?
Fifteen years ago, the problem looked hopeless, just from considering the vast complexity of human values. How would you write a computer program that values “happiness”, “freedom”, or “justice”, let alone everything else we want? It wasn’t clear how to build AI at all, but surely it would be easier to build some AI than a good AI. Humanity was doomed.
But now, after the decade of deep learning, the problem and its impossible solution seem to be arriving closer together than I would have ever dreamt. Okay, we still don’t know how to write down the human utility function, to be plugged in to an arbitrarily powerful optimizer.
But it’s increasingly looking like value isn’t that fragile if it’s specified in latent space, rather than a program that breaks if a single character is wrong—that there are ways to meaningfully shape the initial conditions of our world’s ascension that don’t take the exacting shape of “utility function + optimizer”.
We can leverage unsupervised learning on human demonstration data to do tasks the way humans do them, and we can use RLHF to elicit behavior we want in situations where we can’t write down our desires as an explicit reward or utility function. Crucially, by using these these techniques together to compensate for each other’s safety and capability weaknesses, it seems feasible to build AI whose effects look “like humans, but faster”: performing with infallible exactness everything that we would have performed with a slovenly approximativeness and self-defeating inaccuracy. That doesn’t immediately bring about the superintelligence at the end of time—although it might look pretty immediate in sidereal time—but seems like a pretty good way to kick off our world’s ascension.
Is this story wrong? Maybe! … probably? My mother named me “Simplicia”, over my father’s objections, because of my unexpectedly low polygenic scores. I am aware of my … [she hesitates and coughs, as if choking on the phrase] learning disability. I’m not confident in any of this.
But if I’m wrong, there should be arguments explaining why I’m wrong—arguments that should convince scientists working in the field, even if I personally am too limited to understand them. I’ve tried to ground my case in my understanding of the state of the art, citing relevant papers when applicable.
In contrast, dismissing the entire field as hopeless on the basis of philosophy about “perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards” isn’t engaging with the current state of alignment, let alone all the further advances that humans and our non-superintelligent AIs will come up with before the end of days! Doomimir Doomovitch, with the fate of the lightcone in the balance, isn’t it more dignified to at least consider the possibility that someone else might have thought of something? Reply! Reply!
This one is somewhat more Wentworth-flavored than our previous Doomimirs.
Also, I’ll write Doomimir’s part unquoted this time, because I want to use quote blocks within it.
On to Doomimir!
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen?
Let’s start with this.
Short answer: because those aren’t actually very effective ways to get high ratings, at least within the current capability regime.
Long version: presumably the labeller knows perfectly well that they’re working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that… have you ever personally done an exercise where you try to convince someone to do something they don’t want to do, or aren’t supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner’s job was to not open their fist, while the other partner’s job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar—not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.
Point of that story: based on my own firsthand experience, if you’re not actually going to pay someone right now, then it’s far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.
Ultimately, our discussion is using “threats and bribes” as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.
Now, you could reasonably respond: “Isn’t it kinda fishy that the supposed failures on which your claim rests are ‘illegible’?”
To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:
The iterative design loop hasn’t failed yet.
Now that’s a very interesting claim. I ask: what do you think you know, and how do you think you know it?
Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the “observe/orient” steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don’t necessarily know that it’s failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don’t know a problem is there, or don’t orient toward the right thing, and therefore we don’t iterate on the problem.
What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:
There is some widely-believed falsehood. The generative model might “know” the truth, from having trained on plenty of papers by actual experts, but the raters don’t know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world’s subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
There is some even-more-widely-believed falsehood, such that even the so-called “experts” haven’t figured out yet that it’s false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
Neither raters nor developers have time to check the models’ citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
On various kinds of “which option should I pick” questions, there’s an option which results in marginally more slave labor, or factory farming, or what have you—terrible things which a user might strongly prefer to avoid, but it’s extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don’t reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
This is the sort of problem which, in the high-capability regime, especially leads to “Potemkin village world”.
On various kinds of “which option should I pick” questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don’t even attempt to test which options are best long-term, they just read the LLM’s response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
On various kinds of “which option should I pick” questions, there are things which sound fun or are marketed as fun, but which humans mostly don’t actually enjoy (or don’t enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)
… and so forth. The unifying theme here is that when these failures occur, it is not obvious that they’ve occurred.
This makes empirical study tricky—not impossible, but it’s easy to be mislead by experimental procedures which don’t actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:
They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve...
(Bolding mine.) As you say, one could spin that as demonstrating “yet another portent of our impending deaths”, but really this paper just isn’t measuring the most relevant things in the first place. It’s still using human ratings as the evaluation mechanism, so it’s not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.
So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?
(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)
.
.
.
So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that’s “better than” the labels used for training.
OpenAI’s weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a “ground truth” which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headlineanalysisof results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven’t putthatmuch care into it.)
More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most “coming decoupled from reality” problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That’s what tends to happen, in fields where people don’t have a deep understanding of the systems they’re working with.
And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that’s just the experimentalist themselves—this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists’ interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure—John Boyd, the guy who introduced the term “OODA loop”, talked a lot about that sort of failure.)
To summarize the main points here:
Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (… and then things are hard.)
AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.
Now, I would guess that your next question is: “But how does that lead to extinction?”. That is one of the steps which has been least well-explained historically; someone with your “unexpectedly low polygenic scores” can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you… <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?
But I’ll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.
Putting “the burden of proof” aside, I think it would be great if someone stated more or less formally what evidence moves them how much toward which model. Because “pretraining makes it easier to exploit” is meaningless without numbers: the whole optimistic point is that it’s not overwhelmingly easier (as evident by RLHFed systems not always exploiting users) and the exploits become less catastrophic and more common-sense because of pretraining. So the question is not about direction of evidence, but whether it can overcome the observation that current systems mostly work.
“Current systems mostly work” not because of RLHF specifically, it’s because we are under conditions where iterative design loop works, i.e., mainly, if our system is not aligned, it doesn’t kill us, so we can continue iterating until it has acceptable behaviour.
But iterative design works not only because we are not killed—it also wouldn’t work if acceptable behavior didn’t generalize at least somewhat from training. But it does generalize, so it’s possible that iteratively aligning a system under safe conditions would produce acceptable behavior even when as system can kill you. Or what is your evidence to the contrary? Like, does AutoGPT immediately kills you, if you connect it to some robot via python?
If you look at actual alignment development and ask yourself “what am I see, at empirical level?”, you’ll get this scenario:
We reach new level of capabilities
We get new type of alignment failures
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
I.e., I don’t expect AutoGPT to kill anyone, because AutoGPT is certainly not capable to do this. But I expect that AutoGPT got a bunch of failures unpredictable in advance.
It’s not capable under all conditions, but you can certainly prepare conditions under which AutoGPT can kill you: you can connect it to a robot arm with a knife, explain what commands do what, and tell it to proceed. And AutoGPT will not suddenly start trying to kill you just because it can, right?
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
Why doesn’t it? Fixing alignment failures under relatively safe conditions may fix them for other conditions too. Or why are you thinking about “kill-everyone” capabilities anyway—do you expect RLHF to work for arbitrary levels of capabilities if you don’t die doing it? Like if an ASI trained some weaker AI by RLHF in an environment where it can destroy Earth or two, it would work?
Huh, it’s worse than I expected, thanks. And it even gets worse from GPT-3 to 4. But still—extrapolation from this requires quantification—after all they did mostly fix it by using different promt. How do you decide whether it’s just an evidence for “we need more finetuning”?
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.
It’s asymmetric: it blows up when the data is very unlikely according to Q, which amounts to seeing something happen that you thought was nearly impossible, but not when the data is very unlikely according to P, which amounts to not seeing something that you thought was reasonably likely.
We—I mean, not we, but the maniacs who are hell-bent on destroying this world—include a DKL(πRLHF||πbase) penalty term in the RL objective because they don’t want the updated policy to output tokens that would be vanishingly unlikely coming from the base language model.
But your specific example of threats and promises isn’t vanishingly unlikely according to the base model! Common Crawl webtext is going to contain a lot of natural language reasoning about threats and promises! It’s true, in a sense, that the function of the KL penalty term is to “stay close” to the base policy. But you need to think about what that means mechanistically; you can’t just reason that the webtext prior is somehow “safe” in way that means staying KL-close to it is safe.
But you probably won’t understand what I’m talking about for another 70 days.
I’d be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes.
See here. I haven’t dug into it much, but it does talk about the same general issues specifically in the context of RLHF’d LLMs, not just pure-RL-trained models.
(I’ll get around to another Doomimir response later, just dropping that link for now.)
My general problem with the “second type of generalization” is “how are you going to get superintelligence from here?” If your model imitates human thinking, its performance is capped by human performance, so you are not going to get things like nanotech and immortality.
To the question of malgeneralization, I have an example:
Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don’t see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes “produce code with vulnerabilities that only superintelligence can notice” instead of “produce code with no vulnerabilities”, because you see code, think “wow, so good code with no vulnerabilies” and assign maximum reward, while actually code is filled with them.
To extrapolate this on MNIST example:
Imagine that you have two deck of cards: deck A always has 0 written on it, deck B always has 1. Then you mix two decks to get deck A with 2⁄3 of 0s and 1⁄3 of 1s and vice versa. If you mix decks perfectly random, your predictor of next card from deck is going to learn “always predict 0 for deck A and always predict 1 for deck B”, because optimal predictors do not randomize. When you test your predictor on initial decks, it is going to get 100% accuracy.
But let’s then suppose that you mixed decks not randomly: decks are composed as 1-1-0 (and mirrored for deck B). So your predictor is going to learn “output 1 for every first and second card and 0 for every third card in deck A” and fail miserably during test on initial decks.
You can say: “yes, obviously, if you train model to do wrong thing, it’s going to do wrong thing, nothing surprising”. But when you train superintelligence, you by definition don’t know which thing is “wrong”.
[good generalization] holds across multiple patterns of label noise, even when erroneous labels are biased towards confusing classes.
I will also point to OpenAI’s weak-to-strong results, where increasingly strong students keep improving generalization given labels from a fixed-size teacher. We just don’t live in a world where this issue is an obvious lethality. (EDIT: clarifying scope)
(This also demonstrates some problems with “faithfully learning a function” as a load-bearing teleological description of deep learning. I also remember a discussion of this potential failure mode in my 2022 post on diamond alignment, but I still think I didn’t need to focus too much on it.)
[good generalization] holds across multiple patterns of label noise, even when erroneous labels are biased towards confusing classes.
Reading their experimental procedure and looking at Figures 4 & 5, it looks like their experiments confirm the general story of lethality #20, not disprove it.
The relevant particulars: when they used biased noise, they still ensured that the correct label was the most probable label. Their upper-limit for biased noise made the second-most-probable label equal in probability to the correct one, and in that case the predictor’s generalization accuracy plummeted from near-90% (when the correct label was only slightly more probable than the next-most-probable) to only ~50%.
How this relates to lethality #20: part of what “regular, compactly describable, predictable errors” is saying is that there will be (predictable) cases where the label most probably assigned by a human labeller is not correct (i.e. it’s not what a smart well-informed human would actually want if they had all the relevant info and reflected on it). What the results of the linked paper predict, in that case, is that the net will learn to assign the “incorrect” label—the one which human labellers do, in fact, choose more often than any other. (Though, to be clear, I think this experiment is not very highly relevant one way or the other.)
As for OpenAI’s weak-to-strong results...
I had some back-and-forth about those in a private chat shortly after they came out, and the main thing I remember is that it was pretty tricky to back out the actually-relevant numbers, but it was possible. Going back to the chat log just now, this is the relevant part of my notes:
Rough estimate: on the NLP task the weak model has like 60% accuracy (fig 2).
In cases where the weak model is right, the strong student agrees with it in like 90% of cases (fig 8b). So, on ~6% of cases (10% * 60%), the strong student is wrong by “just being dumb”.
In cases where the weak model is wrong, the strong student’s agreement is very compute-dependent, but let’s pick a middle number and call it 70% (fig 8c). So, on ~28% of cases (70% * 40%), the strong student is wrong by “overfitting to weak supervision”.
So in this particular case, the strong student is wrong about 34% of the time, and 28 of those percentage points are attributable to overfitting to weak supervision.
(Here “overfitting to weak supervision” is the thing where the weak supervisor is predictably wrong, and the stronger model learns to predict those errors.) So in fact what we’re seeing in the weak-to-strong paper is that the strong model learning the weak supervisor’s errors is already the main bottleneck to better ground-truth performance, in the regime that task and models were in.
So overall, I definitely maintain that the empirical evidence is solidly in favor of Doomimir’s story here. (And, separately, I definitely maintain that abstracts in ML tend to be wildly unreliable and misleading about the actual experimental results.)
Reading their experimental procedure and looking at Figures 4 & 5, it looks like their experiments confirm the general story of lethality #20, not disprove it.
“Confirm”? “Disprove”? Seems too aggressive, don’t you think?
Here’s your reasoning, as I understand it:
One experiment labels images of ones as “7” more often than “1″ (using example digits here),
The AI learns to output “7” for images of ones if that was the majority label, and outputs “1″ if that was the majority label,
This confirms the general story of lethality #20.
If this is accurate: I would argue that1+2 do not entail 3 (as you seemed to initially claim, but then maybe back off of in a sentence in the middle of your comment?)
Second, this is not avoidable, in a sense. As you are aware, there is no intrinsic meaning to the “outputs” of a network, there are just output slots and the English names which humans apply to those slots, and a way of comparing a slot prediction (“label”) and the assigned slot of an image.
The relevant particulars: when they used biased noise, they still ensured that the correct label was the most probable label. (and the
Third, I think that the nontrivial prediction of 20 here is about “compactly describable errors. “Mislabelling a large part of the time (but not most of the time)” is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you’d have a meaningful boost in generalization error, but that doesn’t happen. Easy Bayes update against #20.
(Here “overfitting to weak supervision” is the thing where the weak supervisor is predictably wrong, and the stronger model learns to predict those errors.) So in fact what we’re seeing in the weak-to-strong paper is that the strong model learning the weak supervisor’s errors is already the main bottleneck to better ground-truth performance, in the regime that task and models were in. [emphasis added]
As the student gets “smarter” (more compute), supervisor mistakes become less important as it learns to ignore them (8c):
This shows that, in this instance, larger models do not increasingly overfit the “compactly describable errors” of the weaker supervisor.
And, separately, I definitely maintain that abstracts in ML tend to be wildly unreliable and misleading about the actual experimental results.
You’re free to think that, but FWIW I’d already read (and created flashcards for) the entirety of both papers when I posted my original message.
Third, the nontrivial prediction of 20 here is about “compactly describable errors. “Mislabelling a large part of the time (but not most of the time)” is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you’d have a meaningful boost in generalization error, but that doesn’t happen. Easy Bayes update against #20. (And if we can’t agree on this, I don’t see what we can agree on.)
I indeed disagree with that, and I see two levels of mistake here. At the object level, there’s a mistake of not thinking through the gears. At the epistemic level, it looks like you’re trying to apply the “what would I have expected in advance?” technique of de-biasing, in a way which does not actually work well in practice. (The latter mistake I think is very common among rationalists.)
First, object-level: let’s walk through the gears of a mental model here. Model: train a model to predict labels for images, and it will learn a distribution of labels for each image (at least that’s how we usually train them). If we relabel 1′s as 7′s 20% of the time, then the obvious guess is that the model will assign about 20% probability (plus its “real underlying uncertainty”, which we’d expect to be small for large fully-trained models) to the label 7 when the digit is in fact a 1.
What does that predict about accuracy? That depends on whether the label we interpret our model as predicting is top-1, or sampled from the predictive distribution. If the former (as is usually used, and IIUC is used in the paper) then this concrete model would predict basically the curves we see in the paper: as noise ramps up, accuracy moves relatively little (especially for large fully-trained models), until the incorrect digit is approximately as probable as the correct digit, as which point accuracy plummets to ~50%. And once the incorrect digit is unambiguously more probable than the incorrect digit, accuracy drops to near-0.
The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit. And thinking through the gears of Yudkowsky’s #20, the obvious update is that predictable human-labeller-errors which are not the most probable labels are not super relevant (insofar as we use top-1 sampling, i.e. near-zero temperature) whereas human-labeller-errors which are most probable are a problem in basically the way Yudkowsky is saying. (… insofar as we should update at all from this experiment, which we shouldn’t very much.)
Second, epistemic-level: my best guess is that you’re ignoring these gears because they’re not things whose relevance you would have anticipated in advance, and therefore focusing on them in hindsight risks bias[1]. Which, yes, it does risk bias.
Unfortunately, the first rule of experiments is You Are Not Measuring What You Think You Are Measuring. Which means that, in practice, the large majority of experiments which nominally attempt to test some model/theory in a not-already-thoroughly-understood-domain end up getting results which are mostly determined by things unrelated to the model/theory. And, again in practice, few-if-any people have the skill of realizing in advance which things will be relevant to the outcome of any given experiment. “Which things are we actually measuring?” is itself usually figured out (if it’s figured out at all) by looking at data from the experiment.
Now, this is still compatible with using the “what would I have expected in advance?” technique. But it requires that ~all the time, the thing I expect in advance from any given experiment is “this experiment will mostly measure some random-ass thing which has little to do with the model/theory I’m interested in, and I’ll have to dig through the details of the experiment and results to figure out what it measured”. If one tries to apply the “what would I have expected in advance?” technique, in a not-thoroughly-understood domain, without an overwhelming prior that the experimental outcome is mostly determined by things other than the model/theory of interest, then mostly one ends up updating in basically-random directions and becoming very confused.
The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit.
I think this is a reasonable prediction, but ends up being incorrect:
It decreases far faster than it should; on the top-1 theory, it should be ~flatlined for this whole graph (since for all α>0 the strict majority of labels are still correct). Certainly top-5 should not be decreasing.
Maybe noise makes training worse because the model can’t learn to just ignore it due to insufficient data? (E.g., making training more noisy means convergence/compute efficiency is lower.)
Also, does this decrease the size of the dataset by a factor of 5 in the uniform noise case? (Or did they normalize this by using a fixed set of labeled data and then just added additional noise labels?)
So, on ~28% of cases (70% * 40%), the strong student is wrong by “overfitting to weak supervision”.
Attributing all of these errors to overfitting implies that, if there were no overfitting, the strong student would get 100% accuracy on the subset where the weak model is wrong. But we have no reason to expect that. Instead, these errors are some mixture of overfitting and “just being dumb.”
Note that we should expect the strong and weak models to make somewhat correlated errors even when both are trained on gold labels, i.e. in the hypothetical case where overfitting to weak supervision is not possible. (The task examples vary in difficulty, the two models have various traits in common that could lead to shared “quirks,” etc.)
And indeed, when the weak and strong models use similar amounts of compute, they make very similar predictions—we see this in the upper-leftmost points on each line, which are especially noticeable in Fig 8c. In this regime, the hypothetical “what if we trained strong model on gold labels?” is ~equivalent to the weak model, so ~none of the strong model errors here can be attributed to “overfitting to weak supervision.”
As the compute ratio grows, the errors become both less frequent and less correlated. That’s the main trend we see in 8b and 8c. This reflects the strong model growing more capable, and thus making fewer “just being dumb” errors.
Fig 8 doesn’t provide enough information to determine how much the strong model is being held back by weak supervision at higher ratios, because it doesn’t show strong-trained-on-gold performance. (Fig. 3 does, though.)
IMO the strongest reasons to be skeptical of (the relevance of) these results is in Appendix E, where they show that the strong model overfits a lot when it can easily predict the weak errors.
I will also point to OpenAI’s weak-to-strong results, where increasingly strong students keep improving generalization given labels from a fixed-size teacher. We just don’t live in a world where this issue is a lethality.
For a fixed weak teacher and increasing stronger students from a fixed model stack[1], I think you can probably avoid performance ever going down on most/typical tasks if you properly use early stopping, only use process based feedback, and the model isn’t intentionally trying to perform poorly.
You might have instead expected performance to go up and then eventually go down with scale, but I think you can likely avoid this with early stopping (if you carefully find the right stopping point with scaling laws and analogous validation domains where we can ensure we get good labels or other methods of getting validation).
If I recall, I think we also see something similar in the scaling laws for reward model overoptimization work by Leo Gao (also done at OpenAI). (I think this is probably a more analogous case in most ways than the weak-to-strong results from OpenAI as far as understanding the dynamics of fitting to human errors.)
(Let’s put aside the case where the model intentionally tries to perform poorly. (I’m not even sure this case actually looks that different, but it certainly complicates the analysis. I’m doing some work on this case looking at model organisms of intentionally poor performance, and I expect that for these exact model organisms, we’ll probably see performance going up and then back down again with scale in at least some cases.))
(To be clear, I don’t think this “performance never goes down with correct early stopping” claim is totally obvious. It will depend on the exact rate at which AIs learn to predict errors vs learn what the task is and how to do it. And how these rates evolve with scale. If the sigmoid on error learning rate vs scale has a different mid point and different slope than the sigmoid for learning the task, you can absolutely have actual performance go down.)
While I think you can avoid having the performance ever degrade via process based feedback using scaling laws on most/typical tasks, I think the performance will also plateau due to how early you’ll end up needing to stop.
More concerningly, there might be alteratives to purely process-based human labels which don’t plateau in performance, seem to increase performance, but which rarely results in egregiously bad behavior. Most notably various types of outcomes based feedback might have this property.
As an example of this, I expect that you can create “hackable/exploitable” game environments to exhibit this. More specifically:
We’ll pretrain models on a collection of hackable game envs. We’ll train a model stack of variable training compute.
We’ll finetune these models in a new different (but still hackable) atari env where we expect to see transfer from the prior atari envs.
It’s likely to me that as models get smarter if exploiting is ultimately a better strategy, final finetuned performance goes down even with early stopping.
You might be able to see this on some atari games with added semi-realistic exploits? I’m unsure.
(An aside: I edited my original comment to clarify that I was saying “These results show that #20 is not some obvious lethality which merits the name”, but I still certainly think that labeling mistakes can and will make some things go meaningfully wrong.)
The conversation didn’t quite get to Doomimir actually answering this part, but I’d consider the standard answer to be item #20 on Eliezer’s List O’Doom:
… and yeah, there are definitely nonzero empirical results on that.
I think part of the reason the post ends without addressing this is that, unfortunately, I don’t think I properly understand this one yet, even after reading your dialogue with Eli Tyre.
The next paragraph of the post links Christiano’s 2015 “Two Kinds of Generalization”, which I found insightful and seems relevant. By way of illustration, Christiano describes two types of possible systems for labeling videos: (1) a human classifier (which predicts what label a human would assign), and (2) a generative model (which directly builds a mapping between descriptions and videos roughly the way our brains do it). Notably, the human classifier behaves undesirably on inputs that bribe, threaten, or otherwise hack the human: for example, a video of the text “I’ll give you $100 if you classify this as an apple” might get classified as an apple. (And an arbitrarily powerful search for maximally apple-classified inputs would turn those up.)
Christiano goes on to describe a number of differences between these two purported kinds of generalization: (1) is reasoning about the human, whereas (2) is reasoning with a model not unlike the one inside the human’s brain; searching for simple Turing machines would tend to produce (1), whereas searching for small circuits would tend to produce (2); and so on.
It would be bad to end up with a system that behaves like (1) without realizing it. That definitely seems like it would kill you. But (Simplicia asks) how likely that is seems like a complicated empirical question about how ML generalization works and how you built your particular AI, that isn’t definitively answered by “in the limit” philosophy about “perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards assigned by human operators”? That is, I agree that if you argmax over possible programs for the one that results in the most reward-button presses, you get something that only wants to seize the reward button. But the path-dependent details between “argmax over possible programs” and “pretraining + HFDT + regularization + early stopping + &c.” seem like they make a big difference. The technology in front of us really does seem like it’s “reasoning with” rather than “reasoning about” (while also seeming to be on the path towards “real AGI” rather than a mere curiosity).
When I try to imagine what Doomimir would say to that, all I can come up with is a metaphor about perpetual-motion-machine inventors whose designs are so complicated that it’s hard to work out where the error is, even though the laws of thermodynamics clearly imply that there must be an error. That sounds plausible to me as a handwavy metaphor; I could totally believe that the ultimate laws of intelligence (not known to me personally) work that way.
The thing is, we do need more than a handwavy metaphor! “Yes, your gold-printing machine seems to be working great, but my intuition says it’s definitely going to kill everyone. No, I haven’t been able to convince relevant experts who aren’t part of my robot cult, but that’s because they’re from Earth and therefore racially inferior to me. No, I’m not willing to make any concrete bets or predictions about what happens before then” is a non-starter even if it turns out to be true.
Zeroth point: under a Doomimir-ish view, the “modelling the human vs modelling in a similar way to the human” frame is basically right for current purposes, so no frame clash.
On to the main response...
(I’ll flag here that I’m channeling the character of Doomimir and would not necessarily say all of these things myself, especially the harsh parts. Happy to play out another few rounds of this, if you want.)
Simplicia: I think it’s significant that the “hand between ball and camera” example from Amodei et al. 2017 was pure RL from scratch. You have a function π that maps observations (from the robot’s sensors) to actions (applying torques to the robot’s joints). You sample sequences of observation–action pairs from π and show them to a human, and fit a function r̂ to approximate the human’s choices. Then you use Trust Region Policy Optimization to adjust π to score better according to r̂. In this case, TRPO happened to find something that looked good instead of being good, in a way that r̂ wasn’t able to distinguish. That is, we screwed up and trained the wrong thing. That’s a problem, and the severity of the problem would get worse the more capable π was and the more you were relying on it. If we were going to produce powerful general AI systems with RL alone, I would be very nervous.
But the reason I’m so excited about language models in particular is that their capabilities seem to mostly come from unsupervised pre-training rather than RLHF. You fit a function to the entire internet first, and only afterwards tweak it a bit so that its outputs look more like obeying commands rather than predicting random internet tokens—where the tweaking process incorporates tricks like penalizing the Kullback–Leibler divergence from the reward model’s training distribution, such that you’re not pulling the policy too far away from the known-safe baseline.
I agree that as a consequentialist with the goal of getting good ratings, the strategy of “bribe the rater” isn’t very hard to come up with. Indeed, when I prompt GPT-4 with the problem, it gives me “Offering Incentives for Mislabeling” as #7 on a list of 8.
But the fact that GPT-4 can do that seems like it’s because that kind of reasoning appears on the internet, which is what I mean by the claim that contemporary systems are “reasoning with” rather than “reasoning about”: the assistant simulacrum being able to explain bribery when prompted isn’t the same thing as the LM itself trying to maximize reward.
I’d be interested in hearing more details about those rumors of smarter models being more prone to exploit rater mistakes. What did those entail, exactly? (To the extent that we lack critical evidence about this potential alignment failure because the people who experienced it are gagged by an NDA, that seems like a point in favor of sharing information about language model capabilities.)
I certainly expect some amount of sycophancy: if you sample token completions from your LM, and then tweak its outputs to be more like what your raters want to hear, you end up with an LM that’s more inclined to say what your raters want to hear. Fine. That’s a problem. Is it a fatal problem? I mean, if you don’t try to address it at all and delegate all of your civilization’s cognition to machines that don’t want to tell you about problems, then eventually you might die of problems your AIs didn’t tell you about.
But “mere” sycophancy sounds like a significantly less terrifying failure mode than reward hacking of the sort that would result in things like the LM spontaneously trying to threaten or bribe labelers. That would have a large KL divergence from the policy you started with!
(This one’s not as well-written IMO, it’s mashing a few different things together.)
Simplicia: Where does “empirical evidence” fall on the sliding scale of rigor between “handwavy metaphor” and “mathematical proof”? The reason I think the KL penalty in RLHF setups impacts anything we care about isn’t mostly because the vague handwaving sounds plausible, but because of data such as that presented in Fig. 5 of Stiennon et al. 2020. They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve, where straying from the base policy a little is good, but straying farther is increasingly bad, as the system overoptimizes on what looks good to the reward model, which was only a proxy for the true goal.
(You can see examples of the overoptimized summaries in Table 29 on the last page of the paper. Apparently the reward model really liked tripled question marks and the phrase “pls halp”??? I weakly suspect that these are the kind of “weird squiggles” that would improve with scaling up the reward model, similarly to how state-of-the-art image generators lack the distortions and artifacts of their compute-impoverished predecessors. The reward model in this experiment was only 1.3 billion parameters.)
I’m sure you’ll have no trouble interpreting these results as yet another portent of our impending deaths. We were speaking theoretically about AIs exploiting the Goodhart problem between human ratings and actual goodness, but practical RLHF systems aren’t actually sample-efficient enough to solely use direct human feedback, and have an additional Goodhart problem between reward model predictions of human ratings, and actual ratings. Isn’t that worse? Well, yes.
But the ray of hope I see here is more meta and methodological, rather than turning on any one empirical result. It’s that we have empirical results. We can study these machines, now, before their successors are powerful enough to kill us. The iterative design loop hasn’t failed yet. That can’t last forever—at some point between here and the superintelligence at the end of time, humans are going to be out of the loop. I’m glad people are doing theory trying to figure out what that looks like and how it could be arranged to go well.
But I’m worried about ungrounded alignment theorizing failing to make contact with reality, sneeringly dismissing geniunely workable designs as impossible by appealing to perfectly antisphexish consequentialists on a frictionless plane, when some amount of sphexishness and friction is a known factor of the algorithms in question.
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen? I mean, like, literal threats and promises. You mention rumors from a DeepMind employee about the larger Gemini models being hard to train, but without more details, I’m inclined to guess that that was “pls halp”-style overoptimization rather than the kind of power-seeking or deceptive alignment that would break the design loop. (Incidentally, Gao et al. 2022 studied scaling laws for reward model overoptimization and claimed that model size basically didn’t matter? See §4.4, “Policy size independence”.)
What’s going on here? If I’m right that GPT-4 isn’t secretly plotting to murder us, even though it’s smart enough to formulate the idea and expected utility maximizers have a convergent incentive to murder competitors, why is that?
Here’s my view: model-free reinforcement learning algorithms such as those used in RLHF tweak your AI’s behavior to be more like the behavior that got reward in the past, which is importantly different from expected utility maximization. To the extent that you succeed in rewarding Honest, Helpful, and Harmless behavior in safe regimes, you can plausibly get a basically HHH AI assistant that generalizes to not betraying you when it has the chance, similar to how I don’t do heroin because I don’t want to become a heroin addict—even though if I did take heroin, the reinforcement from that would make me more likely to do it again. Then the nature of the game is keeping that good behavior “on track” for as long as we can—even though the superintelligence at the end of time is presumably be going to do something more advanced than model-free RL. It’s possible to screw up and reward the wrong thing, per the robot hand in front of the ball—but if you don’t screw up early, your basically-friendly-but-not-maximally-capable AIs can help you not screw up later. And in the initial stages, you’re only fighting gradient descent, not even an AGI.
More broadly, here’s how I see the Story of Alignment so far. It’s been obvious to sufficiently penetrating thinkers for a long time that the deep future belongs to machine intelligence—that, as George Elliot put it in 1879, “the creatures who are to transcend and finally supersede us [will] be steely organisms, giving out the effluvia of the laboratory, and performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy.”
What’s less obvious is how much control we can exert over how that goes by setting the initial conditions. Can we arrange for the creatures who are to transcend and finally supersede us to be friendly and create the kind of world we would want, or will they murder us and tile the universe with something random?
Fifteen years ago, the problem looked hopeless, just from considering the vast complexity of human values. How would you write a computer program that values “happiness”, “freedom”, or “justice”, let alone everything else we want? It wasn’t clear how to build AI at all, but surely it would be easier to build some AI than a good AI. Humanity was doomed.
But now, after the decade of deep learning, the problem and its impossible solution seem to be arriving closer together than I would have ever dreamt. Okay, we still don’t know how to write down the human utility function, to be plugged in to an arbitrarily powerful optimizer.
But it’s increasingly looking like value isn’t that fragile if it’s specified in latent space, rather than a program that breaks if a single character is wrong—that there are ways to meaningfully shape the initial conditions of our world’s ascension that don’t take the exacting shape of “utility function + optimizer”.
We can leverage unsupervised learning on human demonstration data to do tasks the way humans do them, and we can use RLHF to elicit behavior we want in situations where we can’t write down our desires as an explicit reward or utility function. Crucially, by using these these techniques together to compensate for each other’s safety and capability weaknesses, it seems feasible to build AI whose effects look “like humans, but faster”: performing with infallible exactness everything that we would have performed with a slovenly approximativeness and self-defeating inaccuracy. That doesn’t immediately bring about the superintelligence at the end of time—although it might look pretty immediate in sidereal time—but seems like a pretty good way to kick off our world’s ascension.
Is this story wrong? Maybe! … probably? My mother named me “Simplicia”, over my father’s objections, because of my unexpectedly low polygenic scores. I am aware of my … [she hesitates and coughs, as if choking on the phrase] learning disability. I’m not confident in any of this.
But if I’m wrong, there should be arguments explaining why I’m wrong—arguments that should convince scientists working in the field, even if I personally am too limited to understand them. I’ve tried to ground my case in my understanding of the state of the art, citing relevant papers when applicable.
In contrast, dismissing the entire field as hopeless on the basis of philosophy about “perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards” isn’t engaging with the current state of alignment, let alone all the further advances that humans and our non-superintelligent AIs will come up with before the end of days! Doomimir Doomovitch, with the fate of the lightcone in the balance, isn’t it more dignified to at least consider the possibility that someone else might have thought of something? Reply! Reply!
This one is somewhat more Wentworth-flavored than our previous Doomimirs.
Also, I’ll write Doomimir’s part unquoted this time, because I want to use quote blocks within it.
On to Doomimir!
Let’s start with this.
Short answer: because those aren’t actually very effective ways to get high ratings, at least within the current capability regime.
Long version: presumably the labeller knows perfectly well that they’re working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that… have you ever personally done an exercise where you try to convince someone to do something they don’t want to do, or aren’t supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner’s job was to not open their fist, while the other partner’s job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar—not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.
Point of that story: based on my own firsthand experience, if you’re not actually going to pay someone right now, then it’s far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.
Ultimately, our discussion is using “threats and bribes” as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.
Now, you could reasonably respond: “Isn’t it kinda fishy that the supposed failures on which your claim rests are ‘illegible’?”
To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:
Now that’s a very interesting claim. I ask: what do you think you know, and how do you think you know it?
Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the “observe/orient” steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don’t necessarily know that it’s failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don’t know a problem is there, or don’t orient toward the right thing, and therefore we don’t iterate on the problem.
What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:
There is some widely-believed falsehood. The generative model might “know” the truth, from having trained on plenty of papers by actual experts, but the raters don’t know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world’s subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
There is some even-more-widely-believed falsehood, such that even the so-called “experts” haven’t figured out yet that it’s false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
Neither raters nor developers have time to check the models’ citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
On various kinds of “which option should I pick” questions, there’s an option which results in marginally more slave labor, or factory farming, or what have you—terrible things which a user might strongly prefer to avoid, but it’s extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don’t reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
This is the sort of problem which, in the high-capability regime, especially leads to “Potemkin village world”.
On various kinds of “which option should I pick” questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don’t even attempt to test which options are best long-term, they just read the LLM’s response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
On various kinds of “which option should I pick” questions, there are things which sound fun or are marketed as fun, but which humans mostly don’t actually enjoy (or don’t enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)
… and so forth. The unifying theme here is that when these failures occur, it is not obvious that they’ve occurred.
This makes empirical study tricky—not impossible, but it’s easy to be mislead by experimental procedures which don’t actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:
(Bolding mine.) As you say, one could spin that as demonstrating “yet another portent of our impending deaths”, but really this paper just isn’t measuring the most relevant things in the first place. It’s still using human ratings as the evaluation mechanism, so it’s not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.
So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?
(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)
.
.
.
So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that’s “better than” the labels used for training.
OpenAI’s weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a “ground truth” which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headline analysis of results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven’t put that much care into it.)
More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most “coming decoupled from reality” problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That’s what tends to happen, in fields where people don’t have a deep understanding of the systems they’re working with.
And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that’s just the experimentalist themselves—this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists’ interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure—John Boyd, the guy who introduced the term “OODA loop”, talked a lot about that sort of failure.)
To summarize the main points here:
Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (… and then things are hard.)
AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.
Now, I would guess that your next question is: “But how does that lead to extinction?”. That is one of the steps which has been least well-explained historically; someone with your “unexpectedly low polygenic scores” can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you… <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?
But I’ll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.
I mean, in real reality, Sydney literally threatened journalist.
Putting “the burden of proof” aside, I think it would be great if someone stated more or less formally what evidence moves them how much toward which model. Because “pretraining makes it easier to exploit” is meaningless without numbers: the whole optimistic point is that it’s not overwhelmingly easier (as evident by RLHFed systems not always exploiting users) and the exploits become less catastrophic and more common-sense because of pretraining. So the question is not about direction of evidence, but whether it can overcome the observation that current systems mostly work.
“Current systems mostly work” not because of RLHF specifically, it’s because we are under conditions where iterative design loop works, i.e., mainly, if our system is not aligned, it doesn’t kill us, so we can continue iterating until it has acceptable behaviour.
But iterative design works not only because we are not killed—it also wouldn’t work if acceptable behavior didn’t generalize at least somewhat from training. But it does generalize, so it’s possible that iteratively aligning a system under safe conditions would produce acceptable behavior even when as system can kill you. Or what is your evidence to the contrary? Like, does AutoGPT immediately kills you, if you connect it to some robot via python?
My evidence is how it is exactly happening.
If you look at actual alignment development and ask yourself “what am I see, at empirical level?”, you’ll get this scenario:
We reach new level of capabilities
We get new type of alignment failures
If this alignment failure doesn’t kill everyone, we can fix it even by very dumb methods, like “RLHF against failure outputs”, but it doesn’t tell us anything about kill-everyone level of capabilities.
I.e., I don’t expect AutoGPT to kill anyone, because AutoGPT is certainly not capable to do this. But I expect that AutoGPT got a bunch of failures unpredictable in advance.
Examples:
What happened to ChatGPT on release.
What happened to ChatGPT in slightly unusual environment despite all alignment training.
It’s not capable under all conditions, but you can certainly prepare conditions under which AutoGPT can kill you: you can connect it to a robot arm with a knife, explain what commands do what, and tell it to proceed. And AutoGPT will not suddenly start trying to kill you just because it can, right?
Why doesn’t it? Fixing alignment failures under relatively safe conditions may fix them for other conditions too. Or why are you thinking about “kill-everyone” capabilities anyway—do you expect RLHF to work for arbitrary levels of capabilities if you don’t die doing it? Like if an ASI trained some weaker AI by RLHF in an environment where it can destroy Earth or two, it would work?
Huh, it’s worse than I expected, thanks. And it even gets worse from GPT-3 to 4. But still—extrapolation from this requires quantification—after all they did mostly fix it by using different promt. How do you decide whether it’s just an evidence for “we need more finetuning”?
After thinking for a while, I decided that it’s better to describe level of capability not as”capable to kill you”, but “lethal by default output”. I.e.,
If ASI builds self-replicating in wide range of environments nanotech and doesn’t put specific protections from it turning humans into gray goo, you are dead by default;
If ASI optimizes economy to get +1000% productivity, without specific care about humans everyone dies;
If ASI builds Dyson sphere without specifc care about humans, see above;
More nuanced example: imagine that you have ASI smart enough to build high fidelity simulation of you inside of its cognitive process. Even if such ASI doesn’t pursue any long-term goals, if it is not aligned to, say, respect your mental autonomy, any act of communication is going to turn into literal brainwashing.
The problem with possibility to destroy planet or two is how hard to contain rogue ASI: if it is capable to destroy planet, it’s capable to eject several von Neumann probes which can strike before we can come up with defense, or send radiosignals with computer viruses or harmful memes or copies of ASI. But I think that if you have unhackable indistinguishable from real world simulation and you are somehow unhackable by ASI, you can eventually align it by simple methods from modern prosaic alignment. The problem is that you can’t say in advance which kind of finetuning you need, because you need generalization in advance in untested domains.
Doomimir: No, it wouldn’t! Are you retarded?
Simplicia: [apologetically] Well, actually …
Doomimir: [embarrassed] I’m sorry, Simplicia Optimistovna; I shouldn’t have snapped at you like that.
[diplomatically] But I think you’ve grievously misunderstood what the KL penalty in the RLHF objective is doing. Recall that the Kullback–Leibler divergence DKL(P||Q) represents how surprised you’d be by data from distribution P, that you expected to be from distribution Q.
It’s asymmetric: it blows up when the data is very unlikely according to Q, which amounts to seeing something happen that you thought was nearly impossible, but not when the data is very unlikely according to P, which amounts to not seeing something that you thought was reasonably likely.
We—I mean, not we, but the maniacs who are hell-bent on destroying this world—include a DKL(πRLHF||πbase) penalty term in the RL objective because they don’t want the updated policy to output tokens that would be vanishingly unlikely coming from the base language model.
But your specific example of threats and promises isn’t vanishingly unlikely according to the base model! Common Crawl webtext is going to contain a lot of natural language reasoning about threats and promises! It’s true, in a sense, that the function of the KL penalty term is to “stay close” to the base policy. But you need to think about what that means mechanistically; you can’t just reason that the webtext prior is somehow “safe” in way that means staying KL-close to it is safe.
But you probably won’t understand what I’m talking about for another 70 days.
See here. I haven’t dug into it much, but it does talk about the same general issues specifically in the context of RLHF’d LLMs, not just pure-RL-trained models.
(I’ll get around to another Doomimir response later, just dropping that link for now.)
My general problem with the “second type of generalization” is “how are you going to get superintelligence from here?” If your model imitates human thinking, its performance is capped by human performance, so you are not going to get things like nanotech and immortality.
To the question of malgeneralization, I have an example:
To extrapolate this on MNIST example:
Imagine that you have two deck of cards: deck A always has 0 written on it, deck B always has 1. Then you mix two decks to get deck A with 2⁄3 of 0s and 1⁄3 of 1s and vice versa. If you mix decks perfectly random, your predictor of next card from deck is going to learn “always predict 0 for deck A and always predict 1 for deck B”, because optimal predictors do not randomize. When you test your predictor on initial decks, it is going to get 100% accuracy.
But let’s then suppose that you mixed decks not randomly: decks are composed as 1-1-0 (and mirrored for deck B). So your predictor is going to learn “output 1 for every first and second card and 0 for every third card in deck A” and fail miserably during test on initial decks.
You can say: “yes, obviously, if you train model to do wrong thing, it’s going to do wrong thing, nothing surprising”. But when you train superintelligence, you by definition don’t know which thing is “wrong”.
The linked abstract describes how
I will also point to OpenAI’s weak-to-strong results, where increasingly strong students keep improving generalization given labels from a fixed-size teacher. We just don’t live in a world where this issue is an obvious lethality. (EDIT: clarifying scope)
(This also demonstrates some problems with “faithfully learning a function” as a load-bearing teleological description of deep learning. I also remember a discussion of this potential failure mode in my 2022 post on diamond alignment, but I still think I didn’t need to focus too much on it.)
Reading their experimental procedure and looking at Figures 4 & 5, it looks like their experiments confirm the general story of lethality #20, not disprove it.
The relevant particulars: when they used biased noise, they still ensured that the correct label was the most probable label. Their upper-limit for biased noise made the second-most-probable label equal in probability to the correct one, and in that case the predictor’s generalization accuracy plummeted from near-90% (when the correct label was only slightly more probable than the next-most-probable) to only ~50%.
How this relates to lethality #20: part of what “regular, compactly describable, predictable errors” is saying is that there will be (predictable) cases where the label most probably assigned by a human labeller is not correct (i.e. it’s not what a smart well-informed human would actually want if they had all the relevant info and reflected on it). What the results of the linked paper predict, in that case, is that the net will learn to assign the “incorrect” label—the one which human labellers do, in fact, choose more often than any other. (Though, to be clear, I think this experiment is not very highly relevant one way or the other.)
As for OpenAI’s weak-to-strong results...
I had some back-and-forth about those in a private chat shortly after they came out, and the main thing I remember is that it was pretty tricky to back out the actually-relevant numbers, but it was possible. Going back to the chat log just now, this is the relevant part of my notes:
(Here “overfitting to weak supervision” is the thing where the weak supervisor is predictably wrong, and the stronger model learns to predict those errors.) So in fact what we’re seeing in the weak-to-strong paper is that the strong model learning the weak supervisor’s errors is already the main bottleneck to better ground-truth performance, in the regime that task and models were in.
So overall, I definitely maintain that the empirical evidence is solidly in favor of Doomimir’s story here. (And, separately, I definitely maintain that abstracts in ML tend to be wildly unreliable and misleading about the actual experimental results.)
“Confirm”? “Disprove”? Seems too aggressive, don’t you think?
Here’s your reasoning, as I understand it:
One experiment labels images of ones as “7” more often than “1″ (using example digits here),
The AI learns to output “7” for images of ones if that was the majority label, and outputs “1″ if that was the majority label,
This confirms the general story of lethality #20.
If this is accurate: I would argue that1+2 do not entail 3 (as you seemed to initially claim, but then maybe back off of in a sentence in the middle of your comment?)
Second, this is not avoidable, in a sense. As you are aware, there is no intrinsic meaning to the “outputs” of a network, there are just output slots and the English names which humans apply to those slots, and a way of comparing a slot prediction (“label”) and the assigned slot of an image.
Third, I think that the nontrivial prediction of 20 here is about “compactly describable errors. “Mislabelling a large part of the time (but not most of the time)” is certainly a compactly describable error. You would then expect that as the probability of mistakes increased, you’d have a meaningful boost in generalization error, but that doesn’t happen. Easy Bayes update against #20.
As the student gets “smarter” (more compute), supervisor mistakes become less important as it learns to ignore them (8c):
This shows that, in this instance, larger models do not increasingly overfit the “compactly describable errors” of the weaker supervisor.
You’re free to think that, but FWIW I’d already read (and created flashcards for) the entirety of both papers when I posted my original message.
I indeed disagree with that, and I see two levels of mistake here. At the object level, there’s a mistake of not thinking through the gears. At the epistemic level, it looks like you’re trying to apply the “what would I have expected in advance?” technique of de-biasing, in a way which does not actually work well in practice. (The latter mistake I think is very common among rationalists.)
First, object-level: let’s walk through the gears of a mental model here. Model: train a model to predict labels for images, and it will learn a distribution of labels for each image (at least that’s how we usually train them). If we relabel 1′s as 7′s 20% of the time, then the obvious guess is that the model will assign about 20% probability (plus its “real underlying uncertainty”, which we’d expect to be small for large fully-trained models) to the label 7 when the digit is in fact a 1.
What does that predict about accuracy? That depends on whether the label we interpret our model as predicting is top-1, or sampled from the predictive distribution. If the former (as is usually used, and IIUC is used in the paper) then this concrete model would predict basically the curves we see in the paper: as noise ramps up, accuracy moves relatively little (especially for large fully-trained models), until the incorrect digit is approximately as probable as the correct digit, as which point accuracy plummets to ~50%. And once the incorrect digit is unambiguously more probable than the incorrect digit, accuracy drops to near-0.
The point: when we think through the gears of the experimental setup, the obvious guess is that the curves are mostly a result of top-1 prediction (as opposed to e.g. sampling from the predictive distribution), in a way which pretty strongly indicates that accuracy would plummet to near-zero as the correct digit ceases to be the most probable digit. And thinking through the gears of Yudkowsky’s #20, the obvious update is that predictable human-labeller-errors which are not the most probable labels are not super relevant (insofar as we use top-1 sampling, i.e. near-zero temperature) whereas human-labeller-errors which are most probable are a problem in basically the way Yudkowsky is saying. (… insofar as we should update at all from this experiment, which we shouldn’t very much.)
Second, epistemic-level: my best guess is that you’re ignoring these gears because they’re not things whose relevance you would have anticipated in advance, and therefore focusing on them in hindsight risks bias[1]. Which, yes, it does risk bias.
Unfortunately, the first rule of experiments is You Are Not Measuring What You Think You Are Measuring. Which means that, in practice, the large majority of experiments which nominally attempt to test some model/theory in a not-already-thoroughly-understood-domain end up getting results which are mostly determined by things unrelated to the model/theory. And, again in practice, few-if-any people have the skill of realizing in advance which things will be relevant to the outcome of any given experiment. “Which things are we actually measuring?” is itself usually figured out (if it’s figured out at all) by looking at data from the experiment.
Now, this is still compatible with using the “what would I have expected in advance?” technique. But it requires that ~all the time, the thing I expect in advance from any given experiment is “this experiment will mostly measure some random-ass thing which has little to do with the model/theory I’m interested in, and I’ll have to dig through the details of the experiment and results to figure out what it measured”. If one tries to apply the “what would I have expected in advance?” technique, in a not-thoroughly-understood domain, without an overwhelming prior that the experimental outcome is mostly determined by things other than the model/theory of interest, then mostly one ends up updating in basically-random directions and becoming very confused.
Standard disclaimer about guessing what’s going on inside other peoples’ heads being hard, you have more data than I on what’s in your head, etc.
I think this is a reasonable prediction, but ends up being incorrect:
It decreases far faster than it should; on the top-1 theory, it should be ~flatlined for this whole graph (since for all α>0 the strict majority of labels are still correct). Certainly top-5 should not be decreasing.
This is in the data constrained case right?
Maybe noise makes training worse because the model can’t learn to just ignore it due to insufficient data? (E.g., making training more noisy means convergence/compute efficiency is lower.)
Also, does this decrease the size of the dataset by a factor of 5 in the uniform noise case? (Or did they normalize this by using a fixed set of labeled data and then just added additional noise labels?)
Attributing all of these errors to overfitting implies that, if there were no overfitting, the strong student would get 100% accuracy on the subset where the weak model is wrong. But we have no reason to expect that. Instead, these errors are some mixture of overfitting and “just being dumb.”
Note that we should expect the strong and weak models to make somewhat correlated errors even when both are trained on gold labels, i.e. in the hypothetical case where overfitting to weak supervision is not possible. (The task examples vary in difficulty, the two models have various traits in common that could lead to shared “quirks,” etc.)
And indeed, when the weak and strong models use similar amounts of compute, they make very similar predictions—we see this in the upper-leftmost points on each line, which are especially noticeable in Fig 8c. In this regime, the hypothetical “what if we trained strong model on gold labels?” is ~equivalent to the weak model, so ~none of the strong model errors here can be attributed to “overfitting to weak supervision.”
As the compute ratio grows, the errors become both less frequent and less correlated. That’s the main trend we see in 8b and 8c. This reflects the strong model growing more capable, and thus making fewer “just being dumb” errors.
Fig 8 doesn’t provide enough information to determine how much the strong model is being held back by weak supervision at higher ratios, because it doesn’t show strong-trained-on-gold performance. (Fig. 3 does, though.)
IMO the strongest reasons to be skeptical of (the relevance of) these results is in Appendix E, where they show that the strong model overfits a lot when it can easily predict the weak errors.
For a fixed weak teacher and increasing stronger students from a fixed model stack[1], I think you can probably avoid performance ever going down on most/typical tasks if you properly use early stopping, only use process based feedback, and the model isn’t intentionally trying to perform poorly.
You might have instead expected performance to go up and then eventually go down with scale, but I think you can likely avoid this with early stopping (if you carefully find the right stopping point with scaling laws and analogous validation domains where we can ensure we get good labels or other methods of getting validation).
If I recall, I think we also see something similar in the scaling laws for reward model overoptimization work by Leo Gao (also done at OpenAI). (I think this is probably a more analogous case in most ways than the weak-to-strong results from OpenAI as far as understanding the dynamics of fitting to human errors.)
(Let’s put aside the case where the model intentionally tries to perform poorly. (I’m not even sure this case actually looks that different, but it certainly complicates the analysis. I’m doing some work on this case looking at model organisms of intentionally poor performance, and I expect that for these exact model organisms, we’ll probably see performance going up and then back down again with scale in at least some cases.))
(To be clear, I don’t think this “performance never goes down with correct early stopping” claim is totally obvious. It will depend on the exact rate at which AIs learn to predict errors vs learn what the task is and how to do it. And how these rates evolve with scale. If the sigmoid on error learning rate vs scale has a different mid point and different slope than the sigmoid for learning the task, you can absolutely have actual performance go down.)
While I think you can avoid having the performance ever degrade via process based feedback using scaling laws on most/typical tasks, I think the performance will also plateau due to how early you’ll end up needing to stop.
More concerningly, there might be alteratives to purely process-based human labels which don’t plateau in performance, seem to increase performance, but which rarely results in egregiously bad behavior. Most notably various types of outcomes based feedback might have this property.
As an example of this, I expect that you can create “hackable/exploitable” game environments to exhibit this. More specifically:
We’ll pretrain models on a collection of hackable game envs. We’ll train a model stack of variable training compute.
We’ll finetune these models in a new different (but still hackable) atari env where we expect to see transfer from the prior atari envs.
It’s likely to me that as models get smarter if exploiting is ultimately a better strategy, final finetuned performance goes down even with early stopping.
You might be able to see this on some atari games with added semi-realistic exploits? I’m unsure.
As in, you just vary the training compute and set all other values optimally based on this.
(An aside: I edited my original comment to clarify that I was saying “These results show that #20 is not some obvious lethality which merits the name”, but I still certainly think that labeling mistakes can and will make some things go meaningfully wrong.)