(Context: my last post was trying to patch a certain naive strategy for AI alignment, but I didn’t articulate clearly what the naive strategy is. I think it’s worth explaining the naive strategy in its own post, even though it’s not a novel idea.)
Suppose that I jointly train an AI to do some task (e.g. make money for me) and to answer a wide range of questions about what is happening in the world (e.g. “why did Alice just wire $1000 into my bank account?” or “what is Bob thinking right now?”).
I generate training data for the QA task in a really simple way: I choose a subset of questions that humans are able to reliably answer, and use those as a training set for supervised learning. I’ll call this the naive training strategy.
I’d like for my AI to tell me everything it knows. If the AI bought a stock because it expects a merger announcement soon, I want it to tell me about the predicted merger announcement. If the AI predicts a merger announcement because it inferred that executives of the companies have been in extensive talks over the last month, I want it to tell me about those talks.
I’m not asking the AI to explain why it made a given decision, I’m asking the AI to tell me as much as it can about the world. The important property is that if the AI “knows” something and uses that knowledge to perform the task well, then it also uses that knowledge to answer questions well.
Why might this work? The hope is that “answer questions honestly to the best of your ability” is a natural thing for our AI to learn — that there is some simple way to translate from the AI’s model of the world into natural language and to honestly report what it believes. If our training dataset is good, then this policy will score well, and we can hope that SGD will find it. I’ll call this the intended policy.
Why might this not work? The concern is that “predict how a human would answer questions” is also a natural thing for our AI to learn, especially if the AI is doing a task that already requires predicting humans. Predicting humans also gets a low loss on the training set, but it generalizes poorly once we start asking our AI questions that a human couldn’t have answered on their own.
If we’re worried about this we could use a different loss function than “predict what humans would say.” But regardless of what loss function we choose, the policy could still be trying to game the loss function. If SGD learns to “game the loss function” then we’ll generalize poorly in any case where humans can’t distinguish a good question-answerer from a bad question-answerer.
I’ll call “game the loss function” the instrumental policy. It answers questions well only because doing so is instrumentally useful for getting a low loss, and so it will start answering questions badly if that situation changes. (This is closely related to deceptive alignment, but I think the concept makes sense without defining alignment or optimizers; there’s also a difference in emphasis in that I’m often considering models which are “intrinsically motivated” to game the loss function rather than doing so in order to gain influence. I first introduced this term in Inaccessible Information.)
What I’m doing now
I’m trying to dig into a bunch of reasons why the naive training strategy might fail, and to understand whether there is a way to modify the naive strategy to avoid those problems.
In my last post I discussed two reasons that the naive training strategy might learn the instrumental policy instead of the intended policy:
There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).
If the AI already needs to make predictions about humans, then it can reuse that machinery to game the loss function, whereas it may have to build new machinery to translate its thoughts into natural language.
I tried to describe a modified algorithm that avoids those pitfalls. I don’t think I succeeded, but I do feel reasonably optimistic that some approach can address these two problems.
Unfortunately, there are many further reasons that the naive training strategy could fail. I’m currently spending time trying to understand those issues and figure out which if any are the most likely to represent fundamental roadblocks.
Where others stand
I’ve had a lot of conversations about alignment with ML researchers over the last 7 years. My impression is that a large fraction of optimists expect to find some strategy to get smart enough models to generalize in the intended way — honestly reporting their beliefs — rather than by learning to game the loss function.
The fact that this is part of the “consensus” optimism about alignment makes it particularly appealing to investigate cases where it seems hard to get the intended generalization.
On the other extreme, I think many alignment researchers are very skeptical about this entire project. I think the main way I disagree with them is methodological: before concluding that the problem is hard I want to try to find the simplest hard case.
Relationship to my other work
I’ve traditionally avoided the naive training strategy, in large part because I’m scared that it will learn the instrumental policy instead of the intended policy.
I still believe that you need to do something like iterated amplification and imitative generalization in order to avoid this problem. However, I think that a working strategy that combines all of these ideas may have more in common with the naive training strategy than I’d initially expected.
For example, I now think that the representations of “what the model knows” in imitative generalization will sometimes need to use neural networks to translate between what the model is thinking and human language. Once you go down that road, you encounter many of the difficulties of the naive training strategy. This is an update in my view; I’ll likely go into more detail in a future post.
If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think).
I agree you have to do something clever to make the intended policy plausibly optimal.
The first part of my proposal in section 3 here was to avoid using “imitate humans,” and to instead learn a function “Answer A is unambiguously worse than answer B.” Then we update against policies only when they give unambiguously worse answers.
(I think this still has a lot of problems; it’s not obvious to me whether the problem is soluble.)
Planned summary for the Alignment Newsletter:
+1 to this and excited and happy to hear about this update in your view!