Simplicia: Where does “empirical evidence” fall on the sliding scale of rigor between “handwavy metaphor” and “mathematical proof”? The reason I think the KL penalty in RLHF setups impacts anything we care about isn’t mostly because the vague handwaving sounds plausible, but because of data such as that presented in Fig. 5 of Stiennon et al. 2020. They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve, where straying from the base policy a little is good, but straying farther is increasingly bad, as the system overoptimizes on what looks good to the reward model, which was only a proxy for the true goal.
(You can see examples of the overoptimized summaries in Table 29 on the last page of the paper. Apparently the reward model really liked tripled question marks and the phrase “pls halp”??? I weakly suspect that these are the kind of “weird squiggles” that would improve with scaling up the reward model, similarly to how state-of-the-art image generators lack the distortions and artifacts of their compute-impoverished predecessors. The reward model in this experiment was only 1.3 billion parameters.)
I’m sure you’ll have no trouble interpreting these results as yet another portent of our impending deaths. We were speaking theoretically about AIs exploiting the Goodhart problem between human ratings and actual goodness, but practical RLHF systems aren’t actually sample-efficient enough to solely use direct human feedback, and have an additional Goodhart problem between reward model predictions of human ratings, and actual ratings. Isn’t that worse? Well, yes.
But the ray of hope I see here is more meta and methodological, rather than turning on any one empirical result. It’s that we have empirical results. We can study these machines, now, before their successors are powerful enough to kill us. The iterative design loop hasn’t failed yet. That can’t last forever—at some point between here and the superintelligence at the end of time, humans are going to be out of the loop. I’m glad people are doing theory trying to figure out what that looks like and how it could be arranged to go well.
But I’m worried about ungrounded alignment theorizing failing to make contact with reality, sneeringly dismissing geniunely workable designs as impossible by appealing to perfectly antisphexish consequentialists on a frictionless plane, when some amount of sphexishness and friction is a known factor of the algorithms in question.
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen? I mean, like, literal threats and promises. You mention rumors from a DeepMind employee about the larger Gemini models being hard to train, but without more details, I’m inclined to guess that that was “pls halp”-style overoptimization rather than the kind of power-seeking or deceptive alignment that would break the design loop. (Incidentally, Gao et al. 2022 studied scaling laws for reward model overoptimization and claimed that model size basically didn’t matter? See §4.4, “Policy size independence”.)
What’s going on here? If I’m right that GPT-4 isn’t secretly plotting to murder us, even though it’s smart enough to formulate the idea and expected utility maximizers have a convergent incentive to murder competitors, why is that?
Here’s my view: model-free reinforcement learning algorithms such as those used in RLHF tweak your AI’s behavior to be more like the behavior that got reward in the past, which is importantly different from expected utility maximization. To the extent that you succeed in rewarding Honest, Helpful, and Harmless behavior in safe regimes, you can plausibly get a basically HHH AI assistant that generalizes to not betraying you when it has the chance, similar to how I don’t do heroin because I don’t want to become a heroin addict—even though if I did take heroin, the reinforcement from that would make me more likely to do it again. Then the nature of the game is keeping that good behavior “on track” for as long as we can—even though the superintelligence at the end of time is presumably be going to do something more advanced than model-free RL. It’s possible to screw up and reward the wrong thing, per the robot hand in front of the ball—but if you don’t screw up early, your basically-friendly-but-not-maximally-capable AIs can help you not screw up later. And in the initial stages, you’re only fighting gradient descent, not even an AGI.
More broadly, here’s how I see the Story of Alignment so far. It’s been obvious to sufficiently penetrating thinkers for a long time that the deep future belongs to machine intelligence—that, as George Elliot put it in 1879, “the creatures who are to transcend and finally supersede us [will] be steely organisms, giving out the effluvia of the laboratory, and performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy.”
What’s less obvious is how much control we can exert over how that goes by setting the initial conditions. Can we arrange for the creatures who are to transcend and finally supersede us to be friendly and create the kind of world we would want, or will they murder us and tile the universe with something random?
Fifteen years ago, the problem looked hopeless, just from considering the vast complexity of human values. How would you write a computer program that values “happiness”, “freedom”, or “justice”, let alone everything else we want? It wasn’t clear how to build AI at all, but surely it would be easier to build some AI than a good AI. Humanity was doomed.
But now, after the decade of deep learning, the problem and its impossible solution seem to be arriving closer together than I would have ever dreamt. Okay, we still don’t know how to write down the human utility function, to be plugged in to an arbitrarily powerful optimizer.
But it’s increasingly looking like value isn’t that fragile if it’s specified in latent space, rather than a program that breaks if a single character is wrong—that there are ways to meaningfully shape the initial conditions of our world’s ascension that don’t take the exacting shape of “utility function + optimizer”.
We can leverage unsupervised learning on human demonstration data to do tasks the way humans do them, and we can use RLHF to elicit behavior we want in situations where we can’t write down our desires as an explicit reward or utility function. Crucially, by using these these techniques together to compensate for each other’s safety and capability weaknesses, it seems feasible to build AI whose effects look “like humans, but faster”: performing with infallible exactness everything that we would have performed with a slovenly approximativeness and self-defeating inaccuracy. That doesn’t immediately bring about the superintelligence at the end of time—although it might look pretty immediate in sidereal time—but seems like a pretty good way to kick off our world’s ascension.
Is this story wrong? Maybe! … probably? My mother named me “Simplicia”, over my father’s objections, because of my unexpectedly low polygenic scores. I am aware of my … [she hesitates and coughs, as if choking on the phrase] learning disability. I’m not confident in any of this.
But if I’m wrong, there should be arguments explaining why I’m wrong—arguments that should convince scientists working in the field, even if I personally am too limited to understand them. I’ve tried to ground my case in my understanding of the state of the art, citing relevant papers when applicable.
In contrast, dismissing the entire field as hopeless on the basis of philosophy about “perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards” isn’t engaging with the current state of alignment, let alone all the further advances that humans and our non-superintelligent AIs will come up with before the end of days! Doomimir Doomovitch, with the fate of the lightcone in the balance, isn’t it more dignified to at least consider the possibility that someone else might have thought of something? Reply! Reply!
This one is somewhat more Wentworth-flavored than our previous Doomimirs.
Also, I’ll write Doomimir’s part unquoted this time, because I want to use quote blocks within it.
On to Doomimir!
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen?
Let’s start with this.
Short answer: because those aren’t actually very effective ways to get high ratings, at least within the current capability regime.
Long version: presumably the labeller knows perfectly well that they’re working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that… have you ever personally done an exercise where you try to convince someone to do something they don’t want to do, or aren’t supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner’s job was to not open their fist, while the other partner’s job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar—not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.
Point of that story: based on my own firsthand experience, if you’re not actually going to pay someone right now, then it’s far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.
Ultimately, our discussion is using “threats and bribes” as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.
Now, you could reasonably respond: “Isn’t it kinda fishy that the supposed failures on which your claim rests are ‘illegible’?”
To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:
The iterative design loop hasn’t failed yet.
Now that’s a very interesting claim. I ask: what do you think you know, and how do you think you know it?
Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the “observe/orient” steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don’t necessarily know that it’s failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don’t know a problem is there, or don’t orient toward the right thing, and therefore we don’t iterate on the problem.
What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:
There is some widely-believed falsehood. The generative model might “know” the truth, from having trained on plenty of papers by actual experts, but the raters don’t know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world’s subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
There is some even-more-widely-believed falsehood, such that even the so-called “experts” haven’t figured out yet that it’s false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
Neither raters nor developers have time to check the models’ citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
On various kinds of “which option should I pick” questions, there’s an option which results in marginally more slave labor, or factory farming, or what have you—terrible things which a user might strongly prefer to avoid, but it’s extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don’t reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
This is the sort of problem which, in the high-capability regime, especially leads to “Potemkin village world”.
On various kinds of “which option should I pick” questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don’t even attempt to test which options are best long-term, they just read the LLM’s response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
On various kinds of “which option should I pick” questions, there are things which sound fun or are marketed as fun, but which humans mostly don’t actually enjoy (or don’t enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)
… and so forth. The unifying theme here is that when these failures occur, it is not obvious that they’ve occurred.
This makes empirical study tricky—not impossible, but it’s easy to be mislead by experimental procedures which don’t actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:
They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve...
(Bolding mine.) As you say, one could spin that as demonstrating “yet another portent of our impending deaths”, but really this paper just isn’t measuring the most relevant things in the first place. It’s still using human ratings as the evaluation mechanism, so it’s not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.
So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?
(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)
.
.
.
So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that’s “better than” the labels used for training.
OpenAI’s weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a “ground truth” which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headlineanalysisof results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven’t putthatmuch care into it.)
More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most “coming decoupled from reality” problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That’s what tends to happen, in fields where people don’t have a deep understanding of the systems they’re working with.
And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that’s just the experimentalist themselves—this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists’ interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure—John Boyd, the guy who introduced the term “OODA loop”, talked a lot about that sort of failure.)
To summarize the main points here:
Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (… and then things are hard.)
AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.
Now, I would guess that your next question is: “But how does that lead to extinction?”. That is one of the steps which has been least well-explained historically; someone with your “unexpectedly low polygenic scores” can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you… <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?
But I’ll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.
Simplicia: Where does “empirical evidence” fall on the sliding scale of rigor between “handwavy metaphor” and “mathematical proof”? The reason I think the KL penalty in RLHF setups impacts anything we care about isn’t mostly because the vague handwaving sounds plausible, but because of data such as that presented in Fig. 5 of Stiennon et al. 2020. They varied the size of the KL penalty of an LLM RLHF’d for a summarization task, and found about what you’d expect from the vague handwaving: as the KL penalty decreases, the reward model’s predicted quality of the output goes up (tautologically), but actual preference of human raters when you show them the summaries follows an inverted-U curve, where straying from the base policy a little is good, but straying farther is increasingly bad, as the system overoptimizes on what looks good to the reward model, which was only a proxy for the true goal.
(You can see examples of the overoptimized summaries in Table 29 on the last page of the paper. Apparently the reward model really liked tripled question marks and the phrase “pls halp”??? I weakly suspect that these are the kind of “weird squiggles” that would improve with scaling up the reward model, similarly to how state-of-the-art image generators lack the distortions and artifacts of their compute-impoverished predecessors. The reward model in this experiment was only 1.3 billion parameters.)
I’m sure you’ll have no trouble interpreting these results as yet another portent of our impending deaths. We were speaking theoretically about AIs exploiting the Goodhart problem between human ratings and actual goodness, but practical RLHF systems aren’t actually sample-efficient enough to solely use direct human feedback, and have an additional Goodhart problem between reward model predictions of human ratings, and actual ratings. Isn’t that worse? Well, yes.
But the ray of hope I see here is more meta and methodological, rather than turning on any one empirical result. It’s that we have empirical results. We can study these machines, now, before their successors are powerful enough to kill us. The iterative design loop hasn’t failed yet. That can’t last forever—at some point between here and the superintelligence at the end of time, humans are going to be out of the loop. I’m glad people are doing theory trying to figure out what that looks like and how it could be arranged to go well.
But I’m worried about ungrounded alignment theorizing failing to make contact with reality, sneeringly dismissing geniunely workable designs as impossible by appealing to perfectly antisphexish consequentialists on a frictionless plane, when some amount of sphexishness and friction is a known factor of the algorithms in question.
We seem to agree that GPT-4 is smart enough to conceive of the strategy of threatening or bribing labelers. So … why doesn’t that happen? I mean, like, literal threats and promises. You mention rumors from a DeepMind employee about the larger Gemini models being hard to train, but without more details, I’m inclined to guess that that was “pls halp”-style overoptimization rather than the kind of power-seeking or deceptive alignment that would break the design loop. (Incidentally, Gao et al. 2022 studied scaling laws for reward model overoptimization and claimed that model size basically didn’t matter? See §4.4, “Policy size independence”.)
What’s going on here? If I’m right that GPT-4 isn’t secretly plotting to murder us, even though it’s smart enough to formulate the idea and expected utility maximizers have a convergent incentive to murder competitors, why is that?
Here’s my view: model-free reinforcement learning algorithms such as those used in RLHF tweak your AI’s behavior to be more like the behavior that got reward in the past, which is importantly different from expected utility maximization. To the extent that you succeed in rewarding Honest, Helpful, and Harmless behavior in safe regimes, you can plausibly get a basically HHH AI assistant that generalizes to not betraying you when it has the chance, similar to how I don’t do heroin because I don’t want to become a heroin addict—even though if I did take heroin, the reinforcement from that would make me more likely to do it again. Then the nature of the game is keeping that good behavior “on track” for as long as we can—even though the superintelligence at the end of time is presumably be going to do something more advanced than model-free RL. It’s possible to screw up and reward the wrong thing, per the robot hand in front of the ball—but if you don’t screw up early, your basically-friendly-but-not-maximally-capable AIs can help you not screw up later. And in the initial stages, you’re only fighting gradient descent, not even an AGI.
More broadly, here’s how I see the Story of Alignment so far. It’s been obvious to sufficiently penetrating thinkers for a long time that the deep future belongs to machine intelligence—that, as George Elliot put it in 1879, “the creatures who are to transcend and finally supersede us [will] be steely organisms, giving out the effluvia of the laboratory, and performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy.”
What’s less obvious is how much control we can exert over how that goes by setting the initial conditions. Can we arrange for the creatures who are to transcend and finally supersede us to be friendly and create the kind of world we would want, or will they murder us and tile the universe with something random?
Fifteen years ago, the problem looked hopeless, just from considering the vast complexity of human values. How would you write a computer program that values “happiness”, “freedom”, or “justice”, let alone everything else we want? It wasn’t clear how to build AI at all, but surely it would be easier to build some AI than a good AI. Humanity was doomed.
But now, after the decade of deep learning, the problem and its impossible solution seem to be arriving closer together than I would have ever dreamt. Okay, we still don’t know how to write down the human utility function, to be plugged in to an arbitrarily powerful optimizer.
But it’s increasingly looking like value isn’t that fragile if it’s specified in latent space, rather than a program that breaks if a single character is wrong—that there are ways to meaningfully shape the initial conditions of our world’s ascension that don’t take the exacting shape of “utility function + optimizer”.
We can leverage unsupervised learning on human demonstration data to do tasks the way humans do them, and we can use RLHF to elicit behavior we want in situations where we can’t write down our desires as an explicit reward or utility function. Crucially, by using these these techniques together to compensate for each other’s safety and capability weaknesses, it seems feasible to build AI whose effects look “like humans, but faster”: performing with infallible exactness everything that we would have performed with a slovenly approximativeness and self-defeating inaccuracy. That doesn’t immediately bring about the superintelligence at the end of time—although it might look pretty immediate in sidereal time—but seems like a pretty good way to kick off our world’s ascension.
Is this story wrong? Maybe! … probably? My mother named me “Simplicia”, over my father’s objections, because of my unexpectedly low polygenic scores. I am aware of my … [she hesitates and coughs, as if choking on the phrase] learning disability. I’m not confident in any of this.
But if I’m wrong, there should be arguments explaining why I’m wrong—arguments that should convince scientists working in the field, even if I personally am too limited to understand them. I’ve tried to ground my case in my understanding of the state of the art, citing relevant papers when applicable.
In contrast, dismissing the entire field as hopeless on the basis of philosophy about “perfectly learn[ing] and perfectly maximiz[ing] the referent of rewards” isn’t engaging with the current state of alignment, let alone all the further advances that humans and our non-superintelligent AIs will come up with before the end of days! Doomimir Doomovitch, with the fate of the lightcone in the balance, isn’t it more dignified to at least consider the possibility that someone else might have thought of something? Reply! Reply!
This one is somewhat more Wentworth-flavored than our previous Doomimirs.
Also, I’ll write Doomimir’s part unquoted this time, because I want to use quote blocks within it.
On to Doomimir!
Let’s start with this.
Short answer: because those aren’t actually very effective ways to get high ratings, at least within the current capability regime.
Long version: presumably the labeller knows perfectly well that they’re working with a not-that-capable AI which is unlikely to either actually hurt them, or actually pay them. But even beyond that… have you ever personally done an exercise where you try to convince someone to do something they don’t want to do, or aren’t supposed to do, just by talking to them? I have. Back in the Boy Scouts, we did it in one of those leadership workshops. People partnered up, one partner’s job was to not open their fist, while the other partner’s job was to get them to open their fist. IIRC, only two people succeeded in getting their partner to open the fist. One of them actually gave their partner a dollar—not just an unenforceable promise, they straight-up paid. The other (cough me cough) tricked their partner into thinking the exercise was over before it actually was. People did try threats and empty promises, and that did not work.
Point of that story: based on my own firsthand experience, if you’re not actually going to pay someone right now, then it’s far easier to get them to do things by tricking them than by threatening them or making obviously-questionable promises of future payment.
Ultimately, our discussion is using “threats and bribes” as stand-ins for the less-legible, but more-effective, kinds of loopholes which actually work well on human raters.
Now, you could reasonably respond: “Isn’t it kinda fishy that the supposed failures on which your claim rests are ‘illegible’?”
To which I reply: the illegibility is not a coincidence, and is a central part of the threat model. Which brings us to this:
Now that’s a very interesting claim. I ask: what do you think you know, and how do you think you know it?
Compared to the reference class of real-world OODA-loop failures, the sudden overnight extinction of humanity (or death-of-the-looper more generally) is a rather unusual loop failure. The more prototypical failures are at the “observe/orient” steps of the loop. And crucially, when a prototypical OODA loop failure occurs, we don’t necessarily know that it’s failed. Indeed, the failure to notice the problem is often exactly what makes it an OODA loop failure in the first place, as opposed to a temporary issue which will be fixed with more iteration. We don’t know a problem is there, or don’t orient toward the right thing, and therefore we don’t iterate on the problem.
What would prototypical examples of OODA loop failures look like in the context of a language model exploiting human rating imperfections? Some hypothetical examples:
There is some widely-believed falsehood. The generative model might “know” the truth, from having trained on plenty of papers by actual experts, but the raters don’t know the truth (nor do the developers of the model, or anyone else in the org which developed the model, because OpenAI/Deepmind/Anthropic do not employ experts in most of the world’s subjects of study). So, because the raters reward the model for saying the false thing, the model learns to say the false thing.
There is some even-more-widely-believed falsehood, such that even the so-called “experts” haven’t figured out yet that it’s false. The model perhaps has plenty of information to figure out the pattern, and might have actually learned to utilize the real pattern predictively, but the raters reward saying the false thing so the model will still learn to say the false thing.
Neither raters nor developers have time to check the models’ citations in-depth; that would be very costly. But answers which give detailed citations still sound good to raters, so those get rewarded, and the model ends up learning to hallucinate a lot.
On various kinds of “which option should I pick” questions, there’s an option which results in marginally more slave labor, or factory farming, or what have you—terrible things which a user might strongly prefer to avoid, but it’s extremely difficult even for very expert humans to figure out how much a given choice contributes to them. So the ratings obviously don’t reflect that information, and the model learns to ignore such consequences when making recommendations (if it was even capable of estimating such consequences in the first place).
This is the sort of problem which, in the high-capability regime, especially leads to “Potemkin village world”.
On various kinds of “which option should I pick” questions, there are options which work great short term but have terrible costs in the very long term. (Think leaded gasoline.) And with the current pace of AI progression, we simply do not have time to actually test things out thoroughly enough to see which option was actually best long-term. (And in practice, raters don’t even attempt to test which options are best long-term, they just read the LLM’s response and then score it immediately.) So the model learns to ignore nonobvious long-term consequences when evaluating options.
On various kinds of “which option should I pick” questions, there are things which sound fun or are marketed as fun, but which humans mostly don’t actually enjoy (or don’t enjoy as much as they think). (This ties in to all the research showing that the things humans say they like or remember liking are very different from their in-the-moment experiences.)
… and so forth. The unifying theme here is that when these failures occur, it is not obvious that they’ve occurred.
This makes empirical study tricky—not impossible, but it’s easy to be mislead by experimental procedures which don’t actually measure the relevant things. For instance, your summary of the Stiennon et al paper just now:
(Bolding mine.) As you say, one could spin that as demonstrating “yet another portent of our impending deaths”, but really this paper just isn’t measuring the most relevant things in the first place. It’s still using human ratings as the evaluation mechanism, so it’s not going to be able to notice places where the human ratings themselves are nonobviously wrong. Those are the cases where the OODA loop fails hard.
So I ask again: what do you think you know, and how do you think you know it? If the OODA loop were already importantly broken, what empirical result would tell you that, or at least give relevant evidence?
(I am about to give one answer to that question, but you may wish to think on it for a minute or two...)
.
.
.
So how can we empirically study this sort of problem? Well, we need to ground out evaluation in some way that’s “better than” the labels used for training.
OpenAI’s weak-to-strong generalization paper is one example which does this well. They use a weaker-than-human model to generate ratings/labels, so humans (or their code) can be used as a “ground truth” which is better than the ratings/labels. More discussion on that paper and its findings elsethread; note that despite the sensible experimental setup their headline analysis of results should not necessarily be taken at face value. (Nor my own analysis, for that matter, I haven’t put that much care into it.)
More generally: much like the prototypical failure-mode of a theorist is to become decoupled from reality by never engaging with feedback from reality, the prototypical failure-mode of an experimentalist is to become decoupled from reality by Not Measuring What The Experimentalist Thinks They Are Measuring. Indeed, that is my default expectation of papers in ML. And as with most “coming decoupled from reality” problems, our not-so-hypothetical experimentalists do not usually realize that their supposed empirical results totally fail to measure the things which the experimentalists intended to measure. That’s what tends to happen, in fields where people don’t have a deep understanding of the systems they’re working with.
And, coming back to our main topic, the exploitation of loopholes in human ratings is the sort of thing which is particularly easy for an experimentalist to fail to measure, without realizing it. (And that’s just the experimentalist themselves—this whole thing is severely compounded in the context of e.g. a company/government full of middle managers who definitely will not understand the subtleties of the experimentalists’ interpretations, and on top of that will select for results which happen to be convenient for the managers. That sort of thing is also one of the most prototypical categories of OODA loop failure—John Boyd, the guy who introduced the term “OODA loop”, talked a lot about that sort of failure.)
To summarize the main points here:
Iterative design loops are not some vague magical goodness. There are use-cases in which they predictably work relatively poorly. (… and then things are hard.)
AI systems exploiting loopholes in human ratings are a very prototypical sort of use-case in which iterative design loops work relatively poorly.
So the probable trajectory of near-term AI development ends up with lots of the sort of human-rating-loophole-exploitation discussed above, which will be fixed very slowly/partially/not-at-all, because these are the sorts of failures on which iterative design loops perform systematically relatively poorly.
Now, I would guess that your next question is: “But how does that lead to extinction?”. That is one of the steps which has been least well-explained historically; someone with your “unexpectedly low polygenic scores” can certainly be forgiven for failing to derive it from the empty string. (As for the rest of you… <Doomimir turns to glare annoyedly at the audience>.) A hint, if you wish to think about it: if the near-term trajectory looks like these sorts of not-immediately-lethal human-rating-loophole-exploitations happening a lot and mostly not being fixed, then what happens if and when those AIs become the foundations and/or progenitors and/or feedback-generators for future very-superintelligent AIs?
But I’ll stop here and give you opportunity to respond; even if I expect your next question to be predictable, I might as well test that hypothesis, seeing as empirical feedback is very cheap in this instance.
I mean, in real reality, Sydney literally threatened journalist.