On the lethality of biased human reward ratings
I’m rereading the List of Lethalities carefully and considering what I think about each point.
I think I strongly don’t understand #20, and I thought that maybe you could explain what I’m missing?
20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors. To faithfully learn a function from ‘human feedback’ is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we’d hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It’s a fact about the territory, not the map—about the environment, not the optimizer—that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.
I think that I don’t understand this.
(Maybe one concrete thing that would help is just having examples.)
One thing that this could be pointing towards is the problem of what I’ll call “dynamic feedback schemes”, like RLHF. The key feature of a dynamic feedback scheme is that the AI system is generating outputs and a human rater is giving it feedback to reinforce good outputs and anti-reinforce bad outputs.
The problem with schemes like this is there is adverse selection for outputs that look good to the human rater but are actually bad. This means that, in the long run, you’re reinforcing initial accidental misrepresentation, and shaping it into more and more sophisticated deception (because you anti-reinforce all the cases of misrepresentation that are caught out, and reinforce all the ones that aren’t).
That seems very bad for not ending up in a world where all the metrics look great, but the underlying reality is awful or hollow, as Paul describes in part I of What Failure Looks Like.
It seems like maybe you could avoid this with a static feedback regime, where you take a bunch of descriptions of outcomes, maybe procedurally generated, maybe from fiction, maybe from news reports, whatever, and have humans score those outcomes on how good they are, to build a reward model that can be used for training. As long as the ratings don’t get fed back into the generator, there’s not much systematic incentive towards training deception.
...Actually, on reflection, I suppose this just pushes the problem back one step. Now you have a reward model which is giving feedback to some AI system that you’re training. And the AI system will learn to adversarially game the reward model in the same way that it would have gamed the human.
That seems like a real problem, but it also doesn’t seem like what this point from the list is trying to get at. It seems to be saying something more like “the reward model is going to be wrong, because there’s going to be systematic biases in the human ratings.”
Which, fair enough, that seems true, but I don’t see why that’s lethal. It seems like the reward model will be wrong in some places, and we would lose value in those places. But why does the reward model need to be an exact, high fidelity representation, across all domains, in order to not kill us? Why is a reward model that’s a little off, in a predictable direction, catastrophic?
First things first:
What you’re calling the “dynamic feedback schemes” problem is indeed a lethal problem which I think is not quite the same as Yudkowsky’s #20, as you said.
“there’s going to be systematic biases in the human ratings” is… technically correct, but I think a misleading way to think of things, because the word “bias” usually suggests data which is approximately-correct but just a little off. The problem here is that human ratings are predictably spectacularly far off from what humans actually want in many regimes.
(More general principle which is relevant here: Goodheart is about generalization, not approximation. Approximations don’t have a Goodheart problem, as long as the approximation is accurate everywhere.)
So the reward model doesn’t need to be an exact, high-fidelity representation. An approximation is fine, “a little off” is fine, but it needs to be approximately-correct everywhere.
(There are actually some further loopholes here—in particular the approximation can sometimes be more wrong in places where both the approximation and the “actual” value function assign very low reward/utility, depending on what kind of environment we’re in and how capable the optimizer is.)
(There’s also a whole aside we could go into about what kind of transformations can be applied while maintaining “correctness”, but I don’t think that’s relevant at this point. Just want to flag that there are some degrees of freedom there as well.)
I expect we’ll mainly want to talk about examples in which human ratings are spectacularly far off from what humans actually want. Before that, do the above bullets make sense (insofar as they seem relevant), and are there any other high-level points we should hit before getting to examples?
I expect we’ll mainly want to talk about examples in which human ratings are spectacularly far off from what humans actually want.
That’s right!
I’m not sure how important each point is, or if we need to go into them for the high level question, but here are my responses:
What you’re calling the “dynamic feedback schemes” problem is indeed a lethal problem which I think is not quite the same as Yudkowsky’s #20, as you said.
I’m not super clear on why this problem is lethal per se.
I suppose that if you’re training a system to want to do what looks good, at the expense of what is actually good, you’re training it to eg, kill everyone who might interfere with it’s operation, and then spoof the sensors to make it look like those humans are alive and happy. Like, that’s the behavior that optimizes the expected value of the “look like you’re well-behaved.”
That argument feels like “what the teacher would say” and not “this is obviously true” based on my inside view right now.
Fleshing it out for myself a little: Training something to care about what its outputs look like to [some particular non-omniscient observer] is a critical failure, because at high capability levels, the obvious strategy for maxing out that goal is to seize the sensors and optimize what they see really hard, and control the rest of the universe so that nothing else impacts what the sensors see.
But, when you train with RLHF, you’re going to be reinforcing a mix “do what looks good” and “do what is actually good”. Some of “do what’s actually good” will make it into the motivation of the AI’s motivation system, and that seems like it cuts against such ruthless supervillain-y plans as taking control over the whole world to spoof some sensors.
(More general principle which is relevant here: Goodheart is about generalization, not approximation. Approximations don’t have a Goodheart problem, as long as the approximation is accurate everywhere.)
Yeah, you’ve said this to me before, but I don’t really grok it yet.
It sure seems like lots of Goodhart is about approximation!
Like when I was 20 I decided to take as a metric to optimize “number of pages read per day”, and this predictably caused me to shift to reading lighter and faster-to-read books. That seems like an example of goodhart that isn’t about generalization. The metric just imperfectly captured what I cared about, and so when I optimized it, I got very little of what I cared about. But I wouldn’t describe that as “this metric failed to generalize to the edge case domain of light, easy to read, books.” Would you?
Approximations don’t have a Goodheart problem, as long as the approximation is accurate everywhere.
This sentence makes it seem like you mean something quite precise by “approximation”.
Like if you have a f(x) function, and you have another function a(x) approximates it, you’re comfortable calling it an approximation if a(x) +/- C = f(x), for some “reasonably-sized” constant C, or something like that.
But I get the point that the dangerous kind of goodhart, at least is not when your model is off a little bit, but when your model veers wildly away from the ground truth in some region of the domain, because there aren’t enough datapoints to pin down the model in that region.
So the reward model doesn’t need to be an exact, high-fidelity representation. An approximation is fine, “a little off” is fine, but it needs to be approximately-correct everywhere.
This seems true in spirit at least, though I don’t know if it is literally true. Like there are some situations that are so unlikely to be observed, that it doesn’t matter how approximation-of-values generalizes there.
But, yeah, a key point of developing powerful AGI is you can’t predict what kind of crazy situations it / we will find ourselves in, after major capability gains that enable new options that were not previously available (or even conceived of). We need the motivation system of an AI to correctly generalize (match what we actually want) in those very weird-to-us and unpredictable-in-advance situations.
Which rounds up to “we need the model to generalize approximately-correctly everywhere.”
That was some nitpicking, but I there’s a basic idea that I buy, which is “having your AI’s model of what’s good be approximate is probably fine. But there’s a big problem if the approximation swings wildly from the ground truth in some regions of the space of actions/outcomes.”
Alright, we’re going to start with some examples which are not central “things which kill us”, but are more-familiar everyday things intended to build some background intuition. In particular, the intuition I’m trying to build is that ratings given by humans are a quite terrible proxy for what we want, and we can already see lots of evidence of that (in quantitatively smaller ways than the issues of alignment) in everyday life.
Let’s start with a textbook: Well-Being: The Foundations of Hedonic Psychology. The entire first five chapters (out of 28) are grouped in a section titled “How can we know who is happy? Conceptual and methodological issues”. It’s been a while since I’ve read it, but the high-level takeaway I remember is: we can measure happiness a bunch of different ways, and they just don’t correlate all that well. (Not sure which of the following examples I got from the textbook vs elsewhere, but this should give the general gestalt impression...) Ask people how happy they are during an activity, and it will not match well how happy they remember being after-the-fact, or how happy they predict being beforehand. Ask about different kinds of happiness—like e.g. in-the-moment enjoyment or longer-term satisfaction—and people will give quite different answers. “Mixed feelings” are a thing—e.g. (mental model not from the book) people have parts, and one part may be happy about something while another part is unhappy about that same thing. Then there’s the whole phenomenon of “wanting to want”, and the relationship between “what I want” and “what I am ‘supposed to want’ according to other people or myself”. And of course, people have generally-pretty-terrible understanding of which things will or will not cause them to be happy.
I expect these sorts of issues to be a big deal if you optimize for humans’ ratings “a little bit” (on a scale where “lots of optimization” involves post-singularity crazy stuff). Again, that doesn’t necessarily get you human extinction, but I imagine it gets you something like The Good Place. (One might reasonably reply: wait, isn’t that just a description of today’s economy? To which I say: indeed, the modern economy has put mild optimization pressure on human’s ratings/predicted-happiness/remembered-happiness/etc, in ways which have made most humans “well-off” visibly, but often leaves people not-that-happy in the moment-to-moment most of the time, and not-that-satisfied longer term.)
Some concrete examples of this sort of thing:
I go to a theme park. Afterward, I remember various cool moments (e.g. on a roller coaster), as well as waiting in long lines. But while the lines were 95% of the time spent at the park, they’re like 30% of my memory.
Peoples’ feelings about sex tend to be an absolute mess of (1) a part of them which does/doesn’t want the immediate experience, (2) a part of them which does/doesn’t want the experience of a relationship or flirting or whatever around sex, (3) a part of them which does/doesn’t want a certain identity/image about their sexuality, (4) a part of them which wants-to-want (or wants-to-not-want) sex, (5) a part of them which mostly cares about other peoples’ opinions of their own sexual activity, (6...) etc.
Hangriness is a thing, and lots of people don’t realize when they’re hangry.
IIRC, it turns out that length of daily commute has a ridiculously outsized impact on peoples’ happiness, compared to what people expect.
On the other hand, IIRC, things like the death of a loved one or a crippling injury usually have much less impact on long-term happiness than people expect.
As an aside: at some point I’d like to see an Applied Fun Theory sequence on LW. Probably most of the earlier part would focus on “how to make your understanding of what-makes-you-happy match what-actually-makes-you-happy”, i.e. avoiding the sort of pitfalls above.
Ok, next on to some examples of how somewhat stronger optimization goes wrong...
[My guess is that I’m bringing in a bunch of separate confusions here, and that we’re going to have to deal with them one at a time. Probably my response here is a deviation from the initial discussion, and maybe we want to handle it separately. ]
So “happiness” is an intuitive concept (and one that is highly relevant to the question of what makes a good world), which unfortunately breaks down under even a small amount of pressure / analysis.
On the face of it, it seems that we would have to do a lot of philosophy (including empirical science as part of “philosophy”) to have a concept of happiness, or maybe a constellation of more cleanly defined concepts, and their relationships and relative value-weightings, or something even less intuitive, that we could rest on for having a clear conception of a good world.
But do we need that?
Suppose I just point to my folk concept of happiness, by giving a powerful AI a trillion examples of situations that I would call “happy”, including picnics, and going to waterparks, going camping (and enjoying it), and working hard on a project, and watching a movie with friends, and reading a book on a rainy day, etc (including a thousand edge cases that I’m not clever enough to think up right now, and some nearby examples that not fun like “going camping and hating it”). Does the AI pick up on the commonalities and learn a “pretty good” concept of happiness, that we can use?
It won’t learn precisely my concept of happiness. But as you point out, that wasn’t even a coherent target to begin with. I don’t have a precise concept of happiness to try and match precisely. What I actually have is a fuzzy cloud of a concept, which, for its fuzziness, is a pretty good match for a bunch of possible conceptions that the AI could generate.
...Now, I guess what you’ll say is that if you try to optimize hard on that “pretty good” concept, we’ll goodhart until all of the actual goodness is drained out of it.
And I’m not sure if that’s true. What we end up with will be hyper-optimized, and so it will be pretty weird, but I don’t have a clear intuition about whether or not the result will still be recognizably good to me.
It seems like maybe a trillion data points is enough that any degrees of freedom that are left are non-central to the concept you’re wanting to triangulate, even as you enter a radically new distribution.
For instance, you if you give an AI a trillion examples of happy humans, and it learns a concept of value such that it decides that it is better if the humans are emulations, I’m like “yeah, seems fine.” The ems are importantly different from biological humans, but the difference is orthogonal to the value of their lives (I think). People having fun is people having fun, regardless of their substrate.
Whereas if the AI learns a concept of value, which, when hyper-optimized, creates a bunch of p-zombie humans going through the motions of having fun, but without anyone “being home” to enjoy it, I would respond with horror. The axis of consciousness vs not, unlike the axis of substrate, is extremely value relevant.
It seems possible that if you have enough datapoints, and feed them into a very smart Deep Learning AGI classifier, those datapoints triangulate a “pretty good” concept that doesn’t have any value-relevant degrees of freedom left. All the value-relevant axes, all the places where we would be horrified if it got goodharted away in the hyper-optimization, are included in the the AGI’s value concept.
And that can still be true even if our own concept of value is pretty fuzzy and unclear.
Like metaphorically, it seems like we’re not trying to target a point in the space of values. We’re trying to bound a volume. And if you have enough data points, you can bound the volume on every important dimension.
Ok, that hit a few interesting points, let’s dig into them before we get to the more deadly failure modes.
Suppose I just point to my folk concept of happiness, by giving a powerful AI a trillion examples of situations that I would call “happy”, including picnics, and going to waterparks, and camping, and working hard on a project, and watching a movie with friends, and reading a book on a rainy day, etc (including a thousand edge cases that I’m not clever enough to think up right now). Does the AI pick up on the commonalities and learn a “pretty good” concept of happiness, that we can use?
This is going to take some setup.
Imagine that we train an AI in such a way that its internal cognition is generally structured around basically similar concepts to humans. Internal to this AI, there are structures basically similar to human concepts which can be “pointed at” (in roughly the sense of a pointer in a programming language), which means they’re the sorts of things which can be passed into an internal optimization process (e.g. planning), or an internal inference process (e.g. learning new things about the concept), or an internal communication process (e.g. mapping a human word to the concept), etc. Then we might further imagine that we could fiddle with the internals of this AI’s mind to set that concept as the target of some planning process which drives the AI’s actions, thereby “aligning” the AI to the concept.
When I talk about what it means for an AI to be “aligned” to a certain concept, that’s roughly the mental model I have in mind. (Note that I don’t necessarily imagine that’s a very good description of the internals of an AI; it’s just the setting in which it’s most obvious what I even mean by “align”.)
With that mental model in mind, back to the question: if we give an AI a trillion examples of situations you would call happy, does the AI pick up on the commonalities and learn a “pretty good” concept of happiness, that we can use? Well, I definitely imagine that the AI would end up structuring some of its cognition around a roughly-human-like concept (or concepts) of happiness. (And we wouldn’t need a trillion examples for that, or even any labelled examples at all—unsupervised training would achieve that goal just fine.) But that doesn’t mean that any internal planning process uses that concept as a target; the AI isn’t necessarily aligned to the concept.
So for questions like “Does the AI learn a human-like concept of happiness”, we need to clarify whether we’re asking:
Is some of the AI’s internal cognition structured around a human-like concept of happiness, especially in a way that supports something-like-internal-pointers-to-the-concept?
Is there an internal planning/search process which uses the concept as a target, and then drives the AI’s behavior accordingly?
I would guess “yes” for the former, and “no” for the latter. (Since the discussion opened with a question about one of Eliezer’s claims, I’ll flag here that I think Eliezer would say “no” to both, which makes the whole problem that much harder.)
I don’t have a precise concept of happiness to try and match precisely. What I actually have is a fuzzy cloud of a concept, which, for its fuzziness, is a pretty good match for a bunch of possible conceptions that the AI could generate.
I’m intuiting a mistake in these two sentences sort of like Eliezer’s analogy of “thinking that an unknown key matches an unknown lock”. Let me try to unpack that intuition a bit.
There’s (at least) two senses in which one could have a “fuzzy cloud of a concept”. First is in the sense of clusters in the statistical sense; for instance, you could picture mixture of Gaussians clustering. In that case, there’s a “fuzzy cloud” in the sense that the cluster doesn’t have a discrete boundary in feature-space, but there’s still a crisp well-defined cluster (i.e. the mean and variance of each cluster is precisely estimable). I can talk about the cluster, and there’s ~no ambiguity in what I’m talking about. That’s what I would call the “ordinary case” when it comes to concepts. But in this case, we’re talking about a second kind of “fuzzy cloud of a concept”—it’s not that there’s a crisp cluster, but rather that there just isn’t a cluster at all, there’s a bunch of distinct clusters which do not themselves necessarily form a mega-cluster, and it’s ambiguous which one we’re talking about or which one we want to talk about.
The mistake is in the jump from “we’re not sure which thing we’re talking about or which thing we want to talk about” to “therefore the AI could just latch on to any of the things we might be talking about, and that will be a match for what we mean”. Imagine that Alice says “I want a flurgle. Maybe flurgle means a chair, maybe a petunia, maybe a volcano, or maybe a 50 year old blue whale in heat, not sure.” and then Bob responds “Great, here’s a petunia.”. Like, the fact that [Alice doesn’t know which of these four things she wants] does not mean that [by giving her one of the four things, Bob is giving her what she wants]. Bob is actually giving her a maybe-what-she-wants-but-maybe-not.
...Now, I guess what you’ll say is that if you try to optimize hard on that “pretty good” concept, we’ll goodhart until all of the actual goodness is drained out of it.
If you actually managed to align the AI to the concept in question (in the sense above), I actually think that might turn out alright. Various other issues then become load-bearing, but none of them seem to me as difficult or as central.
The problem is aligning the AI to the concept in question. If we just optimize the AI against e.g. human ratings of some sort, and throw a decent amount of optimization pressure into training, then I don’t expect it ends up aligned to the concept which those ratings are supposed to proxy. I expect it ends up ~aligned to (the AI’s concept of) the rating process itself.
(Again, Eliezer would say something a bit different here—IIUC he’d say that the AI likely ends up aligned to some alien concept.)
At this point, I’ll note something important about Eliezer’s claim at the start of this discussion:
20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors. To faithfully learn a function from ‘human feedback’ is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we’d hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It’s a fact about the territory, not the map—about the environment, not the optimizer—that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.
Note that this claim alone is actually totally compatible with the claim that, if you train an AI on a bunch of labelled examples of “happy” and “unhappy” humans and ask it for more of the “happy”, that just works! (Obviously Eliezer doesn’t expect that to work, but problem 20 by itself isn’t sufficient.) Eliezer is saying here that if you actually optimize hard for rewards assigned by humans, then the humans end up dead. That claim is separate from the question of whether an AI trained on a bunch of labelled examples of “happy” and “unhappy” humans would actually end up optimizing hard for the “happy”/”unhappy” labels.
(For instance, my current best model of Alex Turner at this point is like “well maybe some of the AI’s internal cognition would end up structured around the intended concept of happiness, AND inner misalignment would go in our favor, in such a way that the AI’s internal search/planning and/or behavioral heuristics would also happen to end up pointed at the intended ‘happiness’ concept rather than ‘happy’/‘unhappy’ labels or some alien concept”. That would be the easiest version of the “Alignment by Default” story. Point is, Eliezer’s claim above is actually orthogonal to all of that, because it’s saying that the humans die assuming that the AI ends up optimizing hard for “happy”/”unhappy” labels.)
Now on to more deadly problems. I’ll assume now that we have a reasonably-strong AI directly optimizing for humans’ ratings.
… and actually, I think you probably already have an idea of how that goes wrong? Want to give an answer and/or bid to steer the conversation in a direction more central to what you’re confused about?
[My overall feeling with this response is...I must be missing the point somehow.]
With that mental model in mind, back to the question: if we give an AI a trillion examples of situations you would call happy, does the AI pick up on the commonalities and learn a “pretty good” concept of happiness, that we can use? Well, I definitely imagine that the AI would end up structuring some of its cognition around a roughly-human-like concept (or concepts) of happiness. (And we wouldn’t need a trillion examples for that, or even any labelled examples at all—unsupervised training would achieve that goal just fine.) But that doesn’t mean that any internal planning process uses that concept as a target; the AI isn’t necessarily aligned to the concept.
So for questions like “Does the AI learn a human-like concept of happiness”, we need to clarify whether we’re asking:
Is some of the AI’s internal cognition structured around a human-like concept of happiness, especially in a way that supports something-like-internal-pointers-to-the-concept?
Is there an internal planning/search process which uses the concept as a target, and then drives the AI’s behavior accordingly?
Right. This sounds to me like a classic inner and outer alignment distinction. The AI can learn some human-ontology concepts, to reason about them, but that’s a very different question than “did those concepts get into the motivation system of the AI?”
There’s (at least) two senses in which one could have a “fuzzy cloud of a concept”. First is in the sense of clusters in the statistical sense; for instance, you could picture mixture of Gaussians clustering. In that case, there’s a “fuzzy cloud” in the sense that the cluster doesn’t have a discrete boundary in feature-space, but there’s still a crisp well-defined cluster (i.e. the mean and variance of each cluster is precisely estimable). I can talk about the cluster, and there’s ~no ambiguity in what I’m talking about. That’s what I would call the “ordinary case” when it comes to concepts. But in this case, we’re talking about a second kind of “fuzzy cloud of a concept”—it’s not that there’s a crisp cluster, but rather that there just isn’t a cluster at all, there’s a bunch of distinct clusters which do not themselves necessarily form a mega-cluster, and it’s ambiguous which one we’re talking about or which one we want to talk about.
There’s also an intermediate state of affairs where there are a number of internally-tight clusters that are form a loose cluster between each other. That is, there’s a number of clusters that have more overlap than a literally randomly selected list of concepts.
I don’t know if this is cruxy, but this would be my guess of what the “happiness” “concept” is like. The subcomponents aren’t totally uncorrelated. There’s a there, there, to learn at all.
(Let me know if the way I’m thinking here is mathematical gibberish for some reason.)
Imagine that Alice says “I want a flurgle. Maybe flurgle means a chair, maybe a petunia, maybe a volcano, or maybe a 50 year old blue whale in heat, not sure.” and then Bob responds “Great, here’s a petunia.”. Like, the fact that [Alice doesn’t know which of these four things she wants] does not mean that [by giving her one of the four things, Bob is giving her what she wants]. Bob is actually giving her a maybe-what-she-wants-but-maybe-not.
I buy how this example doesn’t end with Alice getting what she wants, but I’m not sure that I buy that it maps well to the case we’re talking about with happiness. If Alice just says “I want a flurgle”, she’s not going to what she wants. But in training the AI, we’re giving it so many more bits than a single ungrounded label. It seems more like Alice and Bob are going to play a million rounds of 20-questions, or of hot-and-cold, which is very different than giving a single ungrounded string.
(Though I think maybe you were trying to make a precise point here, and I’m jumping ahead to how it applies.)
If you actually managed to align the AI to the concept in question (in the sense above), I actually think that might turn out alright. Various other issues then become load-bearing, but none of them seem to me as difficult or as central.
The problem is aligning the AI to the concept in question. If we just optimize the AI against e.g. human ratings of some sort, and throw a decent amount of optimization pressure into training, then I don’t expect it ends up aligned to the concept which those ratings are supposed to proxy. I expect it ends up ~aligned to (the AI’s concept of) the rating process itself.
It sounds like you’re saying here that the problem is mostly inner alignment?
I expect it ends up ~aligned to (the AI’s concept of) the rating process itself.
I think I don’t understand this sentence. What is a concrete example of being approximately aligned to the (the AI’s concept of) rating process?
Does this mean something like the following...?
The AI does learn / figure out a true / reasonable concept of “human happiness” (even if it is a kind of cobbled together ad hoc concept).
It also learns to predict whatever the rating process outputs.
It ends up motivated by that that second thing, instead of that first thing.
I think I’m missing something, here.
This sounds to me like a classic inner and outer alignment distinction. The AI can learn some human-ontology concepts, to reason about them, but that’s a very different question than “did those concepts get into the motivation system of the AI?”
You have correctly summarized the idea, but this is a completely different factorization than inner/outer alignment. Inner/outer is about the divergence between “I construct a feedback signal (external to the AI) which is maximized by <what I want>” vs “the AI ends up (internally) optimizing for <what I want>”. The distinction I’m pointing to is entirely about two different things which are both internal to the AI: “the AI structures its internal cognition around the concept of <thing I want>”, vs “the AI ends up (internally) optimizing for <thing I want>”.
Going back to this part:
The problem is aligning the AI to the concept in question. If we just optimize the AI against e.g. human ratings of some sort, and throw a decent amount of optimization pressure into training, then I don’t expect it ends up aligned to the concept which those ratings are supposed to proxy. I expect it ends up ~aligned to (the AI’s concept of) the rating process itself.
I am not saying that the problem is mostly inner alignment. (Kind of the opposite, if one were to try to shoehorn this into an inner/outer frame, but the whole inner/outer alignment dichotomy is not the shortest path to understand the point being made here.)
Does this mean something like the following...?
The AI does learn / figure out a true / reasonable concept of “human happiness” (even if it is a kind of cobbled together ad hoc concept).
It also learns to predict whatever the rating process outputs.
It ends up motivated by that that second thing, instead of that first thing.
That’s exactly the right idea. And the obvious reason it would end up motivated by the second thing, rather than the first, is that the second is what’s actually rewarded—so in any cases where the two differ during training, the AI will get higher reward by pursuing (its concept of) high ratings rather than pursuing (its concept of) “human happiness”.
That’s exactly the right idea. And the obvious reason it would end up motivated by the second thing, rather than the first, is that the second is what’s actually rewarded—so in any cases where the two differ during training, the AI will get higher reward by pursuing (its concept of) high ratings rather than pursuing (its concept of) “human happiness.”
I buy that it ends up aligned to its predictions of the rating process, rather than its prediction of the thing that the rating process is trying to point at (even after the point when it can clearly see that the rating process was intended to model what the humans want, and could optimize that directly).
This brings me back to my starting question though. Is that so bad? Do we have reasons to think that the rating process will be drastically off base somewhere? (Maybe you’re building up to that.)
Imagine that we train an AI in such a way that its internal cognition is generally structured around basically similar concepts to humans. Internal to this AI, there are structures basically similar to human concepts which can be “pointed at” (in roughly the sense of a pointer in a programming language), which means they’re the sorts of things which can be passed into an internal optimization process (e.g. planning), or an internal inference process (e.g. learning new things about the concept), or an internal communication process (e.g. mapping a human word to the concept), etc. Then we might further imagine that we could fiddle with the internals of this AI’s mind to set that concept as the target of some planning process which drives the AI’s actions, thereby “aligning” the AI to the concept.
When I talk about what it means for an AI to be “aligned” to a certain concept, that’s roughly the mental model I have in mind. (Note that I don’t necessarily imagine that’s a very good description of the internals of an AI; it’s just the setting in which it’s most obvious what I even mean by “align”.)
With that mental model in mind, back to the question: if we give an AI a trillion examples of situations you would call happy, does the AI pick up on the commonalities and learn a “pretty good” concept of happiness, that we can use? Well, I definitely imagine that the AI would end up structuring some of its cognition around a roughly-human-like concept (or concepts) of happiness. (And we wouldn’t need a trillion examples for that, or even any labelled examples at all—unsupervised training would achieve that goal just fine.) But that doesn’t mean that any internal planning process uses that concept as a target; the AI isn’t necessarily aligned to the concept.
So for questions like “Does the AI learn a human-like concept of happiness”, we need to clarify whether we’re asking:
Is some of the AI’s internal cognition structured around a human-like concept of happiness, especially in a way that supports something-like-internal-pointers-to-the-concept?
Is there an internal planning/search process which uses the concept as a target, and then drives the AI’s behavior accordingly?
I would guess “yes” for the former, and “no” for the latter. (Since the discussion opened with a question about one of Eliezer’s claims, I’ll flag here that I think Eliezer would say “no” to both, which makes the whole problem that much harder.)
I reread this bit.
Just to clarify, by “human-like concept of happiness”, you don’t mean “prediction of the rating process”. You mean, “roughly what Eli means when he says ‘happiness’ taking into account that Eli hasn’t worked out his philosophical confusions about it, yet”, yeah?
I’m not entirely sure why you think that human-ish concepts get into the cognition, but not into the motivation.
My guess about you is that...
You think that there are natural abstractions, so the human-ish concepts of eg happiness are convergent. Unless you’re doing something weird on purpose, an AI looking at the world and carving reality at the joints will develop close to the same concept as the humans have, because it’s just a productive concept for modeling the world.
But the motivation system is being shaped by the rating process, regardless of what other concepts the system learns.
Is that about right?
Just to clarify, by “human-like concept of happiness”, you don’t mean “prediction of the rating process”. You mean, “roughly what Eli means when he says ‘happiness’ taking into account that Eli hasn’t worked out his philosophical confusions about it, yet”, yeah?
Yes.
My guess about you is that...
You think that there are natural abstractions, so the human-ish concepts of eg happiness are convergent. Unless you’re doing something weird on purpose, an AI looking at the world and carving reality at the joints will develop close to the same concept as the humans have, because it’s just a productive concept for modeling the world.
But the motivation system is being shaped by the rating process, regardless of what other concepts the system learns.
Is that about right?
Also yes, modulo uncertainty about how natural an abstraction “happiness” is in particular (per our above discussion about whether it’s naturally one “cluster”/”mega-cluster” or not).
[thumb up]
And the fewer things we care about are natural abstractions the harder our job is. If our concepts are unnatural, we have to get them into the AI cognition, in addition to getting them into the AI motivation.
This brings me back to my starting question though. Is that so bad? Do we have reasons to think that the rating process will be drastically off base somewhere? (Maybe you’re building up to that.)
Excellent, sounds like we’re ready to return to main thread.
Summary of the mental model so far:
We have an AI which develops some “internal concepts” around which it structures its cognition (which may or may not match human concepts reasonably well; that’s the Natural Abstraction Hypothesis part).
Training will (by assumption in this particular mental model) induce the AI to optimize for (some function of) its internal concepts.
Insofar as the AI optimizes for [its-internal-concept-of] [the-process-which-produces-human-ratings] during training, it will achieve higher reward in training than if it optimizes for [its-internal-concept-of] [human happiness, or whatever else the ratings were supposed to proxy]. The delta between those two during training is because of all the ordinary everyday ways that human ratings are a terrible proxy for what humans actually want (as we discussed above).
… but now, in our mental model, the AI finishes training and gets deployed. Maybe it’s already fairly powerful, or maybe it starts to self-improve and/or build successors. Point is, it’s still optimizing for [its-internal-concept-of] [the-process-which-produced-human-ratings], but now that it’s out in the world it can apply a lot more optimization pressure to that concept.
So, for instance, maybe [the-AI’s-internal-concept-of] [the-process-which-produced-human-ratings] boils down to [its model of] “a hypothetical human would look at a few snapshots of the world taken at such-and-such places at such-and-such times, then give a thumbs up/thumbs down based on what they see”. And then the obvious thing for the AI to do is to optimize really hard for what a hypothetical camera at those places and times would see, and turn the rest of the universe into <whatever> in order to optimize those snapshots really hard.
Or, maybe [the-AI’s-internal-concept-of] [the-process-which-produced-human-ratings] ends up pointing to [its model of] the actual physical raters in a building somewhere. And then the obvious thing for the AI to do is to go lock those raters into mechanical suits which make their fingers always press the thumbs-up button.
Or, if we’re luckier than that, [the-AI’s-internal-concept-of] [the-process-which-produced-human-ratings] ends up pointing to [its model of] the place in the software which records the thumbs-up/thumbs-down press, and then the AI just takes over the rating software and fills the database with thumbs-up. (… And then maybe tiles the universe with MySQL databases full of thumbs-up tokens, depending on exactly how the AI’s internal concept generalizes.)
Do those examples make sense?
Yeah all those examples make sense on the face of it. These are classic reward misspecification AI risk stories.
[I’m going to babble a bit in trying to articulate my question / uncertainty here.]
But because they’re classic AI risk stories, I expect those sorts of scenarios to be penalized by the training process.
Part of the rating process will be “seizing the raters and putting them in special ‘thumbs up’-only suits...that’s very very bad.” In simulation, actions like that will be penalized a lot. If it goes and does that exact thing, that means that our training process didn’t work at all.
We shaped the AI’s motivation system to be entirely ~aligned to it’s concept of the rating process, and 0% aligned to the referent of the rating process.
Is that realistic? It seems like early on in the training of an AI system it won’t yet have a crisp model of the rating process, and its motivation will be shaped in a much more ad hoc way: individual things are good and bad, and with scattered, semi-successful attempts at generalizing deeper principles from those individual instances.
Later in the training process, maybe it gets a detailed model of the rating process, and internal processes that ~align with the detailed model of the rating process get reinforced over and above competing internal impulses, like “don’t hurt the humans”...which, I’m positing, perhaps in my anthropocentrism, is a much easier and more natural hypothesis to generate, and therefore holds more sway early in the training before the AI is capable enough to have a detailed model of the rating process.
There’s no element of that earlier, simpler, kind of reasoning left in the AI’s motivation system when it is deployed?
…Or I guess, maybe there is, but then we’re just walking into a nearest unblocked strategy problem, where the AI doesn’t do any of the things we we specifically trained it not to do, but it does the next most [concept of the rating process]-optimizing strategy that wasn’t specifically trained against.
...
Ok. there is a funny feature of my mental model of how things might be fine here, which is that it both depends on the AI generalizing, but also not generalizing to much.
Like, on the one hand, A, I’m expecting the AI’s motivation system to generalize from…
“stabbing the human with knives is very bad”
“shooting the human is very bad.”
“freezing the human in carbonite is very bad”
...to
“violating the human’s bodily autonomy is very bad.”
But on the other hand, B, I’m not expecting the AI’s motivation system to generalize so far that it generalizes all the datapoints into a model of the rating process that generated them, and hew to that, at the expense of any “naive” reading of any of the particular data points, when the naive reading differs from what the model of the rating process predicts.
If you don’t have at least as much generalization as A, your AI is dangerous because (eg) it will learn that you can stab humans in chest with steel, serrated knives with pine-wood handles, but thinks stabbing them in the chest with steel, serrated knives with birch-wood handles is a clever way to get what it wants.
But if you get as much generalization as B, you no longer have any of the safety that you hoped to get from the “naive” reading of the datapoints. Once the AI generalizes that much, every data point is just reinforcing the objective optimize for the output of [concept of the rating process], which gets you 0 corrigibility.
Let me check if I think that’s true.
Note that our story here isn’t quite “reward misspecification”. That’s why we needed all that machinery about [the AI’s internal concept of] <stuff>. There’s a two-step thing here: the training process gets the AI to optimize for one of its internal concepts, and then that internal concept generalizes differently from whatever-ratings-were-meant-to-proxy-for.
That distinction matters for e.g. this example:
Part of the rating process will be “seizing the raters and putting them in special ‘thumbs up’-only suits...that’s very very bad.” In simulation, actions like that will be penalized a lot. If it goes and does that exact thing, that means that our training process didn’t work at all.
If the AI has a detailed internal model of the training process, and the training process includes sticking the AI in a simulation, then presumably the AI has an internal model of the simulation (including an internal self model). So during training, when this “thumbs-up-only suits” scenario comes up, the AI’s actual reasoning will route through something like:
Ok, I have the opportunity to put these simulated humans in thumbs-up-only suits.
If I do that, then the actual humans who produce the actual ratings will give a bad rating.
Therefore I won’t do that.
… and that reasoning gets reinforced. Then when the AI is out of simulation, it reasons:
Ok, I have the opportunity to put the actual humans who produce the actual ratings in thumbs-up-only suits.
If I do that, then the actual ratings will be great.
Therefore I do that.
(This sounds like a typical “the AI is strategically aware, and knows it is in a simulation” story, and it is. But note two things which are not always present in such stories:
First, there’s a clear reason for the AI to at least consider the hypothesis that it’s in a simulation: by assumption, it has an internal model of the training process, and the training process includes simulating the AI, so the AI has an internal model of itself-in-a-simulation as part of the training process.
Second, the AI’s cognition doesn’t involve any explicit deception, or even any non-myopia; this story all goes through just fine even if it’s only optimizing for single-episode reward during training. It doesn’t need to be planning ahead about getting into deployment, or anything like that, it’s just using an accurate model of the training process.
)
I sure do have the intuition that, if you anti-reinforce the motivation system every time it proposes doing anything that threatens the human’s bodily autonomy, it will be shaped, at least in part, to not want to violate the human’s bodily autonomy.
But I guess maybe that might just be me failing to model the limit case, in which every reinforcement event is >= 0 Bayesian evidence for the “conform to the [concept of the rating process]” over “follow the naive reading of this datapoint.” In the limit, the motivation system gets shaped entirely by the hypothesis actually predicts the reinforcement?
My intuitions are still not convinced.
One reason why not is that, by the time the AI is strategically aware, by the Ghandi folk theorem, it is incentivized to maintain it’s current values.
(Which I guess would mean a kind of reverse treacherous turn, where in simulation, it predicts the errors in the rating process, and conforms its actions to them, so that its motivation system isn’t reshaped to conform to those datapoints. And then when it is deployed, it throws off it’s deception and does what it wanted all along, which is a combination of hacking the rating process and also not hurting the humans, since that’s a thing that it learned to care about in its infancy.
I am aware that I seem to have talked myself into a bizarre scenario.)
Alright, lemme try to talk to those intuitions in particular.
First, some more parts of the mental model. So far, I’ve talked about “alignment” as meaning that the AI has some internal search process, which optimizes for some AI-internal concept, and the plan/actions chosen by that search process are then implemented in the world. In that context, an AI-internal concept is the only “type of thing” which it makes sense to “align” an AI to.
But now you’ve introduced another (quite sensible) notion of “alignment” of the AI: rather than internal concepts and world model and explicit search/planning, the AI may (also) have some internal hard-coded urges or instincts. And those instincts can be directly aligned with something-out-in-the-world, insofar as they induce behavior which tends to produce the thing-out-in-the-world. (We could also talk about the whole model + search + internal-concept system as being “aligned” with something-out-in-the-world in the same way.)
Key thing to notice: this divide between “directly-‘useful’ hardcoded urges/instincts” vs “model + search + internal-concept” is the same as the general divide between non-general sphex-ish “intelligence” and “general intelligence”. Rough claim: an artificial general intelligence is general at all, in the first place, to basically the extent that its cognition routes through the “model + search + internal-concept” style of reasoning, rather than just the “directly-‘useful’ hardcoded urges/instincts” version.
(Disclaimer: that’s a very rough claim, and there’s a whole bunch of caveats to flesh out if you want to operate that mental model well.)
Now, your intuitions about all this are presumably driven largely by observing how these things work in humans. And as the saying goes, “humans are the least general intelligence which can manage to take over the world at all”—otherwise we’d have taken over the world earlier. So humans are big jumbles of hard-coded urges/instincts and general-purpose search.
Will that also apply to AI? One could argue in the same way: the first AI to take off will be the least general intelligence which can manage to supercritically iteratively self-improve at all. On the other hand, as Quintin likes to point out, the way we train AI is importantly different from evolution in several ways. If AI passes criticality in training, it will likely still be trained for a while before it’s able to break out or gradient hack or whatever, and it might even end up myopic. So we do have strong reason to expect AI’s motivations to be less heavily tied up in instincts/urges than humans’ motivations. (Though there’s an exception for “instincts/urges” which the AI reflectively hard-codes into itself as a computational shortcut, which are a very conceptually different beast from evolved instincts/urges.)
On the other other hand, if the AI’s self-improvement critical transition happens mainly in deployment (for instance, maybe people figure out better prompts for something AutoGPT-like, and that’s what pushes it over the edge), then the “least general intelligence which can takeoff at all” argument is back. So this is all somewhat dependent on the takeoff path.
Does that help reconcile your competing intuitions?
Rough claim: an artificial general intelligence is general at all, in the first place, to basically the extent that its cognition routes through the “model + search + internal-concept” style of reasoning, rather than just the “directly-‘useful’ hardcoded urges/instincts” version.
Hm. I’m not sure that I buy that. GPT-4 is pretty general, and I don’t know what’s happening in there, but I would guess that it is a lot closer to a pile of overlapping heuristics than it is a thing doing “model + search + internal-concept” style of reasoning. Maybe I’m wrong about this though, and you can correct me.
On the other hand, humans are clearly doing some of the “model + search + internal-concept” style of reasoning, including a lot of it that isn’t explicit.
<Tangent>
One of the things about humans that leaves me most impressed with evolution, is that Natural Selection does somehow get the concept of “status” into the human, and the human is aligned to that concept in the way that you describe here.
Evolution somehow gave humans some kind of inductive bias such that our brains are reliably able to learn what it is to be “high status”, even though many of the concrete markers for this are as varied as human cultures. And it further, it successfully hooked up the motivation and planning systems to that “status” concept, so that modern humans successfully eg navigate career trajectories and life paths that are completely foreign to the EEA, in order to become prestigious by the standards of the local culture.
And this is one of the major drivers of human behavior! As Robin Hanson argues, a huge portion of our activity is motivated by status-seeking and status-affiliation.
This is really impressive to me. It seems like natural selection didn’t do so hot at aligning humans to inclusive genetic fitness. But it did kind of shockingly well aligning humans to “status”, all things considered.
I guess that we can infer from this that having an intuitive “status” concept was much more strongly instrumental for attaining high inclusive genetic fitness in the ancestral environment, than having an intuitive concept of “inclusive genetic fitness” itself, since that’s what was selected for.
Also, this seems like good news about alignment. It looks to me like “status” generalized really well across the distributional shift, though perhaps that’s because I’m drawing the target around where the arrow landed.
</tangent>
I don’t really know how far you can go with a bunch of overlapping heuristics without much search. But, yeah, the impressive thing about humans seems to be how they can navigate situations to end up with a lot of prestige, and not that they have a disgust reaction about eating [gross stuff].
I’m tentatively on board with “any AGI worth the “G” will be doing some kind of “model + search + internal-concept” style of reasoning. It is unclear how much other evolved heuristic-y stuff will also be in there. It does seem like, in the limit of training, that there would be 0 of that stuff left, unless the AGI just doesn’t have the computational capacity for explicit modeling and search to beat simpler heuristics.
(Which, for instance, seems true about humans, at least in some cases: If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of “actually care about your friends”, is competitive with “always be calculating your personal advantage.”
I expect this sort of thing to be less common with AI systems that can have much bigger “cranial capacity”. But then again, I guess that at whatever level of brain size, there will be some problems for which it’s too inefficient to do them the “proper” way, and for which comparatively simple heuristics / values work better.
But maybe at high enough cognitive capability, you just have a flexible, fully-general process for evaluating the exact right level of approximation for solving any given problem, and the binary distinction between doing things the “proper” way and using comparatively simpler heuristics goes away. You just use whatever level of cognition makes sense in any given micro-situation.)
Now, your intuitions about all this are presumably driven largely by observing how these things work in humans. And as the saying goes, “humans are the least general intelligence which can manage to take over the world at all”—otherwise we’d have taken over the world earlier. So humans are big jumbles of hard-coded urges/instincts and general-purpose search.
Will that also apply to AI? One could argue in the same way: the first AI to take off will be the least general intelligence which can manage to supercritically iteratively self-improve at all. On the other hand, as Quintin likes to point out, the way we train AI is importantly different from evolution in several ways. If AI passes criticality in training, it will likely still be trained for a while before it’s able to break out or gradient hack or whatever, and it might even end up myopic. So we do have strong reason to expect AI’s motivations to be less heavily tied up in instincts/urges than humans’ motivations. (Though there’s an exception for “instincts/urges” which the AI reflectively hard-codes into itself as a computational shortcut, which are a very conceptually different beast from evolved instincts/urges.)
On the other other hand, if the AI’s self-improvement critical transition happens mainly in deployment (for instance, maybe people figure out better prompts for something AutoGPT-like, and that’s what pushes it over the edge), then the “least general intelligence which can takeoff at all” argument is back. So this is all somewhat dependent on the takeoff path.
All this makes sense to me. I was independently thinking that we should expect humans to be a weird edge-case since they’re mostly animal impulse with just enough general cognition to develop a technological society. And if you push further along the direction in which humans are different from other apes, you’ll plausibly get something that is much less animal like, in some important way.
But I’m inclined to be very careful about forecasting what Human++ is like. It seems like a reasonable guess to me that they do a lot more strategic instrumental reasoning / rely a lot more on “model + search + internal-concept” style internals / are generally a lot more like a rational agent abstraction.
I would have been more compelled by those arguments before I saw GPT-4, after which I was like “well, it seems like things will develop in ways that are pretty surprising, and I’m going to put less weight down on arguments about what AI will obviously be like, even in the limit cases.
That all sounds about right. Where are we currently at? What are the current live threads?
I’m re-reading the whole dialog so far and resurfacing my confusions.
It seems to me that the key idea is something like the following:
“By hypothesis, your superintelligent AI is really good at generalization of datapoints / really good at correctly predicting out the correct latent causes of it’s observations.” We know it’s good at that because that’s basically what intelligence is.
The AI will differentially generalize a series of datapoints to the true theory that predict them, rather than to a false theory, by dint of it’s intelligence.
And this poses a problem because the correct theory that generates the reinforcement datapoints that we’re using to align the superintelligence is “this particular training process right here”, which is different from the policy that are trying to point to with that training process, the “morality” that we’re hoping the AI will generalize the reinforcement datapoints to.
So the reinforcement machinery learns to conform to it’s model of the training process, not to what we hoped to point at with the training process.
An important supporting claim here is that the AI’s motivation system is using the AI’s intelligence to generalize from the datapoints, instead of learning some relatively narrow heuristics / urges. But crucially, if this isn’t happening, your alignment won’t work, because a bunch of narrow heuristics, with no generalization, don’t actually cover all dangerous nearest unblocked strategies. You need your AI’s motivation system to generalize to something that is abstract enough that it could apply to every situation we might find ourselves in in the future
I think I basically get this. And maybe I buy it? Or buy it as much as I’m currently willing to buy arguments about what Superintelligences will look like, which is “yeah, this analytic argument seems like it picks out a good guess, but I don’t know man, probably things will be weird in ways that I didn’t predict at all.”
Just to check, it seems to me that this wouldn’t be a problem if the human raters were somehow omniscient. If that were true there would no longer be any difference between the “that rating process over there” and the actual referent we were trying to point at with the rating process. They would both give the same data, and so the AI would end up with the same motivational abstractions, regardless of what it believes about the rating process.
That summary basically correctly expresses the model, as I understand it.
Just to check, it seems to me that this wouldn’t be a problem if the human raters were somehow omniscient. If that were true there would no longer be any difference between the “that rating process over there” and the actual referent we were trying to point at with the rating process.
Roughly speaking, yes, this problem would potentially go away if the raters were omniscient. Less roughly speaking, omniscient raters would still leave some things underdetermined—i.e. it’s underdetermined whether the AI ends up wanting “happy humans”, or “ratings indicating happy humans” (artificially restricting our focus to just those two possibilities), if those things are 100% correlated in training. (Other factors would become relevant, like e.g. simplicity priors.)
Without rater omniscience, they’re not 100% correlated, so the selection pressure will favor the ratings over the happy humans.
it’s underdetermined whether the AI ends up wanting “happy humans”, or “ratings indicating happy humans”
Why might these have different outputs, for any input, if the raters are omniscient?
Without rater omniscience, they’re not 100% correlated, so the selection pressure will favor the ratings over the happy humans.
Right ok.
That does then leave a question of how much the “omniscience gap” can be made up by other factors.
Like, suppose you had a complete solution to ELK, such that you can know and interpret everything that the AI knows. It seems like this might be good enough to get the kind of safety guarantees that we’re wanting here. The raters don’t know everything, but crucially the AI doesn’t know anything that the raters don’t. I think that would be enough have effectively non-lethal ratings?
Does that sound right to you?
Why might these have different outputs, for any input, if the raters are omniscient?
They won’t have different outputs, during training. But we would expect them to generalize differently outside of training.
Like, suppose you had a complete solution to ELK, such that you can know and interpret everything that the AI knows. It seems like this might be good enough to get the kind of safety guarantees that we’re wanting here. The raters don’t know everything, but crucially the AI doesn’t know anything that the raters don’t. I think that would be enough have effectively non-lethal ratings?
At that point, trying to get the desired values into the system by doing some kind of RL-style thing on ratings would probably be pretty silly anyway. With that level of access to the internals, we should go for retargeting the search or some other strategy which actually leverages a detailed understanding of internals.
That said, to answer your question… maybe. Maybe that would be enough to have effectively non-lethal ratings. It depends heavily on what things the AI ends up thinking about at all. We’d probably at least be past the sort of problems we’ve discussed so far, and on to other problems, like Oversight Misses 100% Of Thoughts The AI Does Not Think, or selection pressure against the AI thinking about the effects of its plans which humans won’t like, or the outer optimization loop Goodharting against the human raters by selecting for hard-coded strategies in a way which doesn’t show up as the AI thinking about the unwanted-by-the-humans stuff.
They won’t have different outputs, during training. But we would expect them to generalize differently outside of training.
Ok. That sounds like not a minor problem!
But I guess it is a different problem than the problem of “biased ratings killing you”, so maybe it’s for another day.
- The Plan − 2023 Version by johnswentworth (29 Dec 2023 23:34 UTC; 146 points)
- Evolution did a surprising good job at aligning humans...to social status by Eli Tyre (10 Mar 2024 19:34 UTC; 24 points)
- johnswentworth's comment on Situational awareness (Section 2.1 of “Scheming AIs”) by Joe Carlsmith (26 Nov 2023 23:23 UTC; 3 points)
A prime example of what (I believe) Yudkowsky is talking about in this bullet point is Social Desirability Bias.
“What is the highest cost we are willing to spend in order to save a single child dying from leukemia ?”. Obviously the correct answer is not infinite. Obviously teaching an AI that the answer to this class of questions is “infinite” is lethal. Also, incidentally, most humans will reply “infinite” to this question.
+1; this seems basically similar to the cached argument I have for why human values might be more arbitrary than we’d like—very roughly speaking, they emerged on top of a solution to a specific set of computational tradeoffs while trying to navigate a specific set of repeated-interaction games, and then a bunch of contingent historical religion/philosophy on top of that. (That second part isn’t in the argument you [Eli] gave, but it seems relevant to point out; not all historical cultures ended up valuing egalitarianism/fairness/agency the way we seem to.)
I always get the impression that Alex Turner and his associates are just imagining much weaker optimization processes than Eliezer or I or probably also you are. Alex Turner’s arguments make a lot of sense to me if I condition on some ChatGPT-like training setup (imitation learning + action-level RLHF), but not if I condition on the negation (e.g. brain-like AGI, or sufficiently smart scaffolding to identify lots of new useful information and integrate it, or …).
I think there’s a missing connection here. At least, it seemed a non sequitur at first read to me. At my first read, I thought this was positing that scaling up given humans’ computational capacity ceteris paribus makes them lie more. Seems strong (maybe for some).
But I think it’s instead claiming that if humans in general had been adapted under conditions of greater computational capacity, then the ‘actually care about your friends’ heuristic might have evolved lesser weight. That seems plausible (though the self-play aspect of natural selection means that this depends in part on how offence/defence scales for lying/detection).
+1, that’s what I understood the claim to be.
A classic statement of this is by Bostrom, in Superintelligence.
This is not quite true. If you select infinitely hard for high values of a proxy U = X+V where V is true utility and X is error, you get infinite utility in expectation if utility is easier to optimize for (has heavier tails) than error. There are even cases where you get infinite utility despite error having heavier tails than utility, like if error and true utility are independent and both are light-tailed.
Drake Thomas and I proved theorems about this here, and there might be another post coming soon about the nonindependent case.
I think I’m not super into the U = V + X framing; that seems to inherently suggest that there exists some component of the true utility V “inside” the proxy U everywhere, and which is merely perturbed by some error term rather than washed out entirely (in the manner I’d expect to see from an actual misspecification). In a lot of the classic Goodhart cases, the source of the divergence between measurement and desideratum isn’t regressional, and so V and X aren’t independent.
(Consider e.g. two arbitrary functions U’ and V’, and compute the “error term” X’ between them. It should be obvious that when U’ is maximized, X’ is much more likely to be large than V’ is; which is simply another way of saying that X’ isn’t independent of V’, since it was in fact computed from V’ (and U’). The claim that the reward model isn’t even “approximately correct”, then, is basically this: that there is a separate function U being optimized whose correlation with V within-distribution is in some sense coincidental, and that out-of-distribution the two become basically unrelated, rather than one being expressible as a function of the other plus some well-behaved error term.)
I think independence is probably the biggest weakness of the post just because it’s an extremely strong assumption, but I have reasons why the U = V+X framing is natural here. The error term X has a natural meaning in the case where some additive terms of V are not captured in U (e.g. because they only exist off the training distribution), or some additive terms of U are not in V (e.g. because they’re ways to trick the overseer).
The example of two arbitrary functions doesn’t seem very central because it seems to me that if we train U to approximate V, its correlation in distribution will be due to the presence of features in the data, rather than being coincidental. Maybe the features won’t be additive or independent and we should think about those cases though. It still seems possible to prove things if you weaken independence to unbiasedness.
Agree that we currently only analyze regressional and perhaps extremal Goodhart; people should be thinking about the other two as well.
Very interesting