I’m not going to comment on “who said what when”, as I’m not particularly interested in the question myself, though I think the object level point here is important:
This makes the nonstraightforward and shaky problem of getting a thing into the AI’s preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there.
The way I would phrase this is that what you care about is the relative complexity of the objective conditional on the world model. If you’re assuming that the model is highly capable, and trained in a highly diverse environment, then you can assume that the world model is capable of effectively modeling anything in the world (e.g. anything that might appear in webtext). But the question remains what the “simplest” (according to the inductive biases) goal is that can be pointed to in the world model such that the resulting mesa-optimizer has good training performance.
The most rigorous version of this sort of analysis that exists is probably here, where the key question is how to find a prior (that is, a set of inductive biases) such that the desired goal has a lower complexity conditional on the world model compared to the undesired goal. Importantly, both of them will be pretty low relative to the world model, since the vast majority of the complexity is in the world model.
Furthermore, the better the world model, the less complexity it takes to point to anything in it. Thus, as we build more powerful models, it will look like everything has lower complexity. But importantly, that’s not actually helpful! Because what you care about is not reducing the complexity of the desired goal, but reducing the relative complexity of the desired goal compared to undesired goals, since (modulo randomness due to path-dependence), what you actually get is the maximum a posteriori, the “simplest model that fits the data.”
Similarly, the key arguments for deceptive alignment rely on the set of objectives that are aligned with human values being harder to point to compared to the set of all long-term objective. The key problem is that any long-term objective is compatible with good training performance due to deceptive alignment (the model will reason that it should play along for the purposes of getting its long-term objective later), such that the total probability of that set under the inductive biases swamps the probability of the aligned set. And this is despite the fact that human values do in fact get easier to point to as your model gets better, because what isn’t necessarily changing is the relative difficulty.
That being said, I think there is actually an interesting update to be had on the relative complexity of different goals from the success of LLMs, which is that a pure prediction objective might actually have a pretty low relative complexity. And that’s precisely because prediction seems substantially easier to point to than human values, even though both get easier to point to as your world model gets better. But of course the key question is whether prediction is easier to point to compared to a deceptively aligned objective, which is unclear and I think could go either way.
I’m not going to comment on “who said what when”, as I’m not particularly interested in the question myself, though I think the object level point here is important:
The way I would phrase this is that what you care about is the relative complexity of the objective conditional on the world model. If you’re assuming that the model is highly capable, and trained in a highly diverse environment, then you can assume that the world model is capable of effectively modeling anything in the world (e.g. anything that might appear in webtext). But the question remains what the “simplest” (according to the inductive biases) goal is that can be pointed to in the world model such that the resulting mesa-optimizer has good training performance.
The most rigorous version of this sort of analysis that exists is probably here, where the key question is how to find a prior (that is, a set of inductive biases) such that the desired goal has a lower complexity conditional on the world model compared to the undesired goal. Importantly, both of them will be pretty low relative to the world model, since the vast majority of the complexity is in the world model.
Furthermore, the better the world model, the less complexity it takes to point to anything in it. Thus, as we build more powerful models, it will look like everything has lower complexity. But importantly, that’s not actually helpful! Because what you care about is not reducing the complexity of the desired goal, but reducing the relative complexity of the desired goal compared to undesired goals, since (modulo randomness due to path-dependence), what you actually get is the maximum a posteriori, the “simplest model that fits the data.”
Similarly, the key arguments for deceptive alignment rely on the set of objectives that are aligned with human values being harder to point to compared to the set of all long-term objective. The key problem is that any long-term objective is compatible with good training performance due to deceptive alignment (the model will reason that it should play along for the purposes of getting its long-term objective later), such that the total probability of that set under the inductive biases swamps the probability of the aligned set. And this is despite the fact that human values do in fact get easier to point to as your model gets better, because what isn’t necessarily changing is the relative difficulty.
That being said, I think there is actually an interesting update to be had on the relative complexity of different goals from the success of LLMs, which is that a pure prediction objective might actually have a pretty low relative complexity. And that’s precisely because prediction seems substantially easier to point to than human values, even though both get easier to point to as your world model gets better. But of course the key question is whether prediction is easier to point to compared to a deceptively aligned objective, which is unclear and I think could go either way.