[Probably a noob question]
I’m thinking about what an inner alignment failure might look like for GPT-3. This would have to involve some deployment context in which GPT-3 performs significantly worse (by the standards of the base objective) than it did in training. (It would involve other things too, such as GPT-3 being a mesa-optimizer.)
But to say how well GPT-3 performs on some prompt not in the training dataset, we have to have a definition of the base objective that extends beyond the training dataset. If the base objective only makes sense in the context of the training dataset, then inner alignment failure is impossible by definition.
Is the base objective “Predict the next word?” Or is it “Predict the next word, supposing what you are reading is typical 2019 Internet text?” Or is it “Predict the next word, supposing what you are reading is a random-with-the-following-weights sample from dataset D? [where D is the dataset used to train GPT-3]” The third option is in some sense the best, because it most closely fits what we actually did to train GPT-3. But note that the logical extension of this line of reasoning is to prefer a fourth option: “Predict the next word, supposing what you are reading is a random-with-the-following-weights sample from dataset D’ [where D’ is like D except that it doesn’t contain any of the bits of text that GPT-3 happened to not see in training, and the randomness weights are chosen to more accurately yield the data points that GPT-3 in fact saw].”
The problem with these last two answers is that they make it undefined how well GPT-3 performs on the base objective on any prompt that wasn’t in D, which then rules out psuedo-alignment by definition.
From the Risks from Learned Optimization paper:
In such a case, we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. In reinforcement learning (RL), for example, the base objective is generally the expected return. Because the mesa-objective is not specified by the programmers, mesa-optimization opens up the possibility of a mismatch between the base and mesa- objectives, wherein the mesa-objective might seem to perform well on the training environment but lead to bad performance off the training environment. We will refer to this case as pseudo-alignment below.
Expected return in a particular environment/distribution? Or not? If not, then you may be in a deployment context where you aren’t updating the weights anymore and so there is no expected return, or at least it’s close to 0 because there’s only any return if you can convince people to start updating your weights again!
I worry I am just confused about all this. Hence why I’m asking. What is GPT-3′s base objective?
My current position is that this is the wrong question to be asking—instead, I think the right question is just “what is GPT-3′s training story?” Then, we can just talk about to what extent the training rationale is enough to convince us that we would get the desired training goal vs. some other model, like a deceptive model, instead—rather than having to worry about what technically counts as the base objective, mesa-objective, etc.
I was wondering if that was the case, haha. Thanks!
This is unfortunate, no? The AI safety community had this whole thing going with mesa-optimization and whatnot… now you propose to abandon the terminology and shift to this new frame? But what about all the people using the old terminology? Is the old terminology unsalvageable?
I do like your new thing and it seems better to me in some ways, but worse in others. I feel like I expect a failure mode where people exploit ambiguity and norm-laden concepts to convince themselves of happy fairy tales. I should think more about this and write a comment.
ETA: Here’s an attempt to salvage the original inner/outer alignment problem framing:
We admit up front that it’s a bit ambiguous what the base objective is, and thus there will be cases where it’s ambiguous whether a mesa-optimizer is aligned to the base objective.
However, we say this isn’t a big deal. We give a handful of examples of “reasonable construals” of the base objective, like I did in the OP, and say that all the classic arguments are arguments for the plausibility of cases where a mesa-optimizer is misaligned with every reasonable construal of the base objective.
Moreover, we make lemons out of lemonade, and point out that the fact there are multiple reasonable construals is itself reason to think inner alignment problems are serious and severe. I’m imagining an interlocutor who thinks “bah, it hasn’t been established yet that inner-alignment problems are even a thing; it still seems like the default hypothesis is that you get what you train for, i.e. you get an agent that is trying to maximize predictive accuracy or whatever.” And then we say “Oh? What exactly is it trying to maximize? Predictive accuracy full stop? Or predictive accuracy conditional on dataset D? Or is it instead trying to maximize reward, in which case it’d hack its reward channel if it could? Whichever one you think it is, would you not agree that it’s plausible that it might instead end up trying to maximize one of the other ones?”
To be clear, that’s definitely not what I’m arguing. I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it’s talking about. The problem is just that it’s not general enough to handle all possible ways of training a model using machine learning. Terms like base objective or inner/outer alignment are still great terms for talking about training stories that are trying to train a model to optimize for some specified objective. From “How do we become confident in the safety of a machine learning system?”:
GPT-3 was trained using supervised learning, which I would have thought was a pretty standard way of training a model using machine learning. What training scenarios do you think the Risks from Learned Optimization terminology can handle, and what’s the difference between those and the way GPT-3 was trained?
First, the problem is only with outer/inner alignment—the concept of unintended mesa-optimization is still quite relevant and works just fine.
Second, the problems with applying Risks from Learned Optimization terminology to GPT-3 have nothing to do with the training scenario, the fact that you’re doing unsupervised learning, etc.
The place where I think you run into problems is that, for cases where mesa-optimization is intended in GPT-style training setups, inner alignment in the Risks from Learned Optimization sense is usually not the goal. Most of the optimism about large language models is hoping that they’ll learn to generalize in particular ways that are better than just learning to optimize for something like cross entropy/predictive accuracy. Thus, just saying “if the model is an optimizer, it won’t just learn to optimize for cross entropy/predictive accuracy/whatever else it was trained on,” while true, is unhelpful.
What I like about training stories is that it explicitly asks what sort of model you want to get—rather than assuming that you want something which is optimizing for your training objective—and then asks how likely we are to actually get it (as opposed to some sort of mesa-optimizer, a deceptive model, or anything else).
Just wanted to point out that this is already something we need to worry about all the time in alignment. Calling them training stories doesn’t create such failure mode, it makes them obvious to people like you and me who are wary of narrative explanations in science.
Yes. I have the intuition that training stories will make this problem worse. But I don’t think my intuition on this matter is trustworthy (what experience do I have to base it on?) so don’t worry about it. We’ll try it and see what happens.
(to explain the intuition a little bit: With inner/outer alignment, any would-be AGI creator will have to face up to the fact that they haven’t solved outer alignment, because it’ll be easy for a philosopher to find differences between the base objective they’ve programmed and True Human Values. With training stories, I expect lots of people to be saying more sophisticated versions of “It just does what I meant it to do, no funny business.”)
Yeah, agreed. It’s true that GPT obeys the objective “minimize the cross-entropy loss between the output and the distribution of continuations in the training data.” But this doesn’t mean it doesn’t also obey objectives like “write coherent text”, to the extent that we can tell a useful story about how the training set induces that behavior.
(It is amusing to me how our thoughts immediately both jumped to our recent hobbyhorses.)
This is the correct answer.
This is correct, but non-problematic in my mind. If data wasn’t in the training dataset, then yes there is no fact of the matter as to what training signal GPT-3 received when training on it. We can talk about what training signal GPT-3 counterfactually would have received had it been trained on this data, but there is no answer to the question in the actual world.
Why do you choose answer 3 instead of answer 4? In some sense answer 3 is the random weights that the developers intended, but answer 4 is what actually happened.
I think that 4 is confused when people talk about “the GPT-3 training data.” If someone said “there are strings of words found in the GPT-3 training data that GPT-3 never saw” I would tell them that they don’t know what the words in that sentence mean. When an AI researcher speaks of “the GPT-3 training data” they are talking about the data that GPT-3 actually saw. There’s data that OpenAI collected which GPT-3 didn’t see, but that’s not what the words “the GPT-3 training data” refers to.
Ahhh, OK. Then perhaps I just was using inappropriate words; it sounds like what I meant to refer to by 4 was the same as what you meant to refer to by 3.