Cool post! It’s clearly not super polished, but I think you’re pointing at a lot of important ideas, and so it’s a good thing to publish it relatively quickly.
The standard definition of “inner optimizer” refers to something which carries out explicit search, in service of some objective. It’s not clear to me whether/when we should focus that narrowly. Here are some other definitions of “inner optimizer” which I sometimes think about.
As far as I understand it, the initial assumption of internal search was mostly done for two reasons: because then you can speak of the objective/goal without a lot of the issues around behavioral objectives; and because the authors of the Risk from Learned Optimization paper felt that they needed assumptions about the internals of the system to say things like “training and generalization incentivize mesa-optimization”.
But personally, I really think of inner alignment in terms of goal-directed agents with misaligned goals. That’s by the way one reason why I’m excited to work on deconfusing goal-directedness: I hope this will allow us to consider broader inner misalignment.
With that perspective, I see the Risks paper as arguing that when pushed at the limit of competence, optimized goal-directed systems will have a simple internal model built around a goal, instead of being a mess of heuristics as you could expect at intermediary levels of competence. But I don’t necessarily think this has to be search.
I don’t think these arguments are enough to supersede (misaligned) mesa-control as the general thing we’re trying to prevent, but still, it could be that explicit representation of values is the definition which we can build a successful theory around / systematically prevent. So value-representation might end up being the more pragmatically useful definition of mesa-optimization. Therefore, I think it’s important to keep this in mind as a potential definition.
The argument I find the most convincing for the internal representation (or at least awareness/comprehension) is that it is required for very high-level of competence towards the goal (for complex enough goals, of course). I guess that’s probably similar (though not strictly the same) to your point about the “systematically misaligned”.
But I worry that people could interpret the experiment incorrectly, thinking that “good” results from this experiment (ie creating much more helpful versions of GPT) are actually “good signs” for alignment. I think the opposite is true: successful results would actually be significant reason for caution, and the more success, the more reason for caution.
Your analysis of making GPT-3 made me think a lot of this great blog post (and great blog) that I just read today. The gist of this and other posts there is to think of GPT-3 as a “multiverse-generator”, simulating some natural language realities. And with the prompt, the logit-bias and other aspects, you can push it to priviledge certain simulations. I feel like the link with what you’re saying is that making GPT-3 useful in that sense seems to push it towards simulating realities consistent/produced by agents, and so to almost optimize for an inner alignment problem.
Some versions of the lottery ticket hypothesis seem to imply that deceptive circuits are already present at the beginning of training.
I haven’t thought enough/studied enough the lottery ticket hypotheses and related idea to judge if your proposal makes sense, but even accepting it, I’m not sure it forbids basins of attraction. It just says that when the deceptive lottery ticket is found enough, then there is no way back. But that seems to me like something that Evan says quite often, which is that once the model is deceptive you can’t expect it to go back to non-deceptiveness (mabye because stuff like gradient hacking). Hence the need for a buffer around the deceptive region.
I guess the difference is that instead of the deceptive region of the model space, it’s the “your innate deceptiveness has won” region of the model space?
But that seems to me like something that Evan says quite often, which is that once the model is deceptive you can’t expect it to go back to non-deceptiveness (mabye because stuff like gradient hacking). Hence the need for a buffer around the deceptive region.
I guess the difference is that instead of the deceptive region of the model space, it’s the “your innate deceptiveness has won” region of the model space?
Right, so, the point of the argument for basin-like proposals is this:
A basin-type solution has to 1. initialize in such a way as to be within a good basin / not within a bad basin. 2. Train in a way which preserves this property. Most existing proposals focus on (2) and don’t say that much about (1), possibly counting on the idea that random initializations will at least not be actively deceptive. The argument I make in the post is meant to question this, pointing toward a difficulty in step (1).
One way to put the problem in focus: suppose the ensemble learning hypothesis:
Ensemble learning hypothesis (ELH): Big NNs basically work as a big ensemble of hypotheses, which learning sorts through to find a good one.
This bears some similarity to lottery-ticket thinking.
Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).
But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can’t get started.
Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).
But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can’t get started.
This argument is obviously a bit sloppy, though.
I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
On the other hand, if there’s just a tiny probability or tiny part of deception in the model (not sure exactly what this means), then I expect that there are small updates that SGD can do that don’t make the model more deceptive (and maybe make it less deceptive) and yet reduce the loss. That’s the intuition that to learn that lying is a useful strategy, you must actually be “good enough” at lying (maybe by accident) to gain from it and adapt to it. I have friends who really suck at lying, and for them trying to be deceptive is just not worth it (even if they wanted to).
If you actually need deceptiveness to be strong already to have this issue, then I don’t think your ELH points to a problem because I don’t see why deceptiveness should dominate already.
I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.
Cool post! It’s clearly not super polished, but I think you’re pointing at a lot of important ideas, and so it’s a good thing to publish it relatively quickly.
As far as I understand it, the initial assumption of internal search was mostly done for two reasons: because then you can speak of the objective/goal without a lot of the issues around behavioral objectives; and because the authors of the Risk from Learned Optimization paper felt that they needed assumptions about the internals of the system to say things like “training and generalization incentivize mesa-optimization”.
But personally, I really think of inner alignment in terms of goal-directed agents with misaligned goals. That’s by the way one reason why I’m excited to work on deconfusing goal-directedness: I hope this will allow us to consider broader inner misalignment.
With that perspective, I see the Risks paper as arguing that when pushed at the limit of competence, optimized goal-directed systems will have a simple internal model built around a goal, instead of being a mess of heuristics as you could expect at intermediary levels of competence. But I don’t necessarily think this has to be search.
The argument I find the most convincing for the internal representation (or at least awareness/comprehension) is that it is required for very high-level of competence towards the goal (for complex enough goals, of course). I guess that’s probably similar (though not strictly the same) to your point about the “systematically misaligned”.
Your analysis of making GPT-3 made me think a lot of this great blog post (and great blog) that I just read today. The gist of this and other posts there is to think of GPT-3 as a “multiverse-generator”, simulating some natural language realities. And with the prompt, the logit-bias and other aspects, you can push it to priviledge certain simulations. I feel like the link with what you’re saying is that making GPT-3 useful in that sense seems to push it towards simulating realities consistent/produced by agents, and so to almost optimize for an inner alignment problem.
I haven’t thought enough/studied enough the lottery ticket hypotheses and related idea to judge if your proposal makes sense, but even accepting it, I’m not sure it forbids basins of attraction. It just says that when the deceptive lottery ticket is found enough, then there is no way back. But that seems to me like something that Evan says quite often, which is that once the model is deceptive you can’t expect it to go back to non-deceptiveness (mabye because stuff like gradient hacking). Hence the need for a buffer around the deceptive region.
I guess the difference is that instead of the deceptive region of the model space, it’s the “your innate deceptiveness has won” region of the model space?
Right, so, the point of the argument for basin-like proposals is this:
A basin-type solution has to 1. initialize in such a way as to be within a good basin / not within a bad basin. 2. Train in a way which preserves this property. Most existing proposals focus on (2) and don’t say that much about (1), possibly counting on the idea that random initializations will at least not be actively deceptive. The argument I make in the post is meant to question this, pointing toward a difficulty in step (1).
One way to put the problem in focus: suppose the ensemble learning hypothesis:
Ensemble learning hypothesis (ELH): Big NNs basically work as a big ensemble of hypotheses, which learning sorts through to find a good one.
This bears some similarity to lottery-ticket thinking.
Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).
But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can’t get started.
This argument is obviously a bit sloppy, though.
I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let’s say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
On the other hand, if there’s just a tiny probability or tiny part of deception in the model (not sure exactly what this means), then I expect that there are small updates that SGD can do that don’t make the model more deceptive (and maybe make it less deceptive) and yet reduce the loss. That’s the intuition that to learn that lying is a useful strategy, you must actually be “good enough” at lying (maybe by accident) to gain from it and adapt to it. I have friends who really suck at lying, and for them trying to be deceptive is just not worth it (even if they wanted to).
If you actually need deceptiveness to be strong already to have this issue, then I don’t think your ELH points to a problem because I don’t see why deceptiveness should dominate already.
I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.
Agreed, it depends on the training process.