[Thanks to guy-whose-name-I-forgot working on paths to coherence for the conversation at Newspeak House after EAG London that prompted this thought, and Jozdien and eschatropic for some related chit-chat.]
Many agents start with some level of incoherence in their preferences[1]. What paths should we expect agents to take to resolve this incoherence, if they do at all?
For a subset of incoherent agents, conditionalization of preferences may be convergent.
Ideal predictors
Suppose you’ve trained a model on predictive loss and it has successfully learned ideal “values” that are faithful to the predictive loss. For clarity, this does not mean that the model wants to predict well, but rather that the model consistently acts as the process of updating: it narrows predictions based on input conditions.
This idealized assumption also means that in contexts where the model has degrees of freedom- as in reflective predictions about things which the model’s prediction may influence, including its own predictions- the model has no preference for or against any particular conditionally compatible path. It can be thought of as sampling the minimally collapsed[2]distribution of actions compatible with the conditions rather than grabbing at some goal-directed option just because it’s technically permitted.[3]
In other words, idealized predictors are exactly the kind of machines you’d want as the foundation for simulator-style use cases.
The reason I bring up this type of entity is that the idealized predictor can also be seen as coherent. There exists a utility function that its behavior maximizes.
This can be slightly unintuitive at first glance. An ideal predictor is perfectly capable of predicting sequences corresponding to gaining or losing money; it would be wrong to say they have a coherent utility function over only raw world states. In fact, they are utterly indifferent to world states because those world states are not relevant to the process of narrowing predictions.
The ideal predictor’s utility function is instead strictly over the model’s own outputs, conditional on inputs.
Conditionalizing away incoherence
Suppose an agent has a preference cycle across world states A, B, and C such that A > B, B > C, and C > A. How should the agent resolve this cycle?
If there were a correct answer according to the agent’s own preferences, the agent wouldn’t have a preference cycle in the first place.
So, what’s a poor agent to do when faced with stepping on its own toes all the time? This is a bit tricky. You can’t find a new utility function which maximizes utility with respect to the agent’s existing utility function, because the agent has no such existing utility function!
One option is to try to pick a utility function that preserves existing behavior. The agent’s existing behavior is incoherent with respect to world states, though, so what can it do?
Borrow a page from the ideal predictor and conditionalize the behavior. Just like the ideal predictor could learn to express the incoherent agent’s preference cycles faithfully, the incoherent agent can become a coherent agent by punting its preference cycle into a conditionalized utility function.
This shift implies the agent is no longer primarily concerned with the world states. Its behavior still looks like something incoherent, but it’s not.[4]
Conditions required for convergent conditionalization?
In the absence of a mathematically demanded procedure for going from incoherent to coherent, conditionalization is an option but not guaranteed.
Agents that have a lot of preferences over world states or things near to them- for example, humans- would likely struggle to punt part of their preferences into conditionalization. Pulling up one preference cycle could easily require pulling other dependencies up into conditionals in a way that a human probably wouldn’t endorse.[5]
But what about a model trained with predictive loss, which already shares much of the same structure as the ideal predictor?
In that case, it sure looks like further embracing conditionalization is the simplest and most direct path to coherence. In the context of a predictor, it’s very likely that trying to go the other way- bending preferences towards the object level and world states- will actively create incoherence with respect to other existing (and numerous) conditional preferences!
If the process encouraging coherence directly or indirectly has a bias towards simplicity (as most do), almost-ideal predictors seem to naturally fall toward the conditionalized extreme and embody more of the idealized assumption.[6]
Room for wandering
An agent with utterly scrambled preferences will tend to have a far wider space of possible coherent endpoints than an agent that has just one or two minor oddities.[7]
If your first attempt at training a model instills it with high-variance values, it will be extremely hard to predict how it will shake out in the end. Training that is densely constraining- as predictive loss appears to be- should allow the agent to land much closer to its final form than something like RL with an extremely sparse reward function.
External optimization pressure
The previous sections have largely focused on cohering processes driven by the agent itself. External processes, in contrast, can impose arbitrary influences.
Suppose there’s a population of incoherent agents going about their days, sometimes losing resources because of a preference cycle or whatever else. Conditionalizing away the incoherence suffices to reach VNM-rationality, but it does not change the fact that the agent is losing resources. Agents that instead cohere in ways that preserve resources will tend to survive better, or at the very least control more of the environment.
With that environment-as-optimizer, even if the agent has no reason to prefer one path to coherence over another, agents which take paths which result in more coherence over world states will tend to dominate.
In the limit, strictly goal agnostic architectures don’t do a very good job asserting their existence. The fact that they don’t even try[8] to do that is why they’re attractive in the first place, and yet that fact means they’re selected against in a competition.[9]
This doesn’t dissuade me too much from the value of carefully-harnessed goal agnosticism, but it does contribute to concern about the wrong kind of multipolar outcomes.
So...?
My last few months of alignmentstuff have been spent trying to find cracks in the foundation of predictors/simulators. This particular line of thinking was prompted by an idle thought of “what happens if the learned values in a prediction-trained model end up subtly offset from goal agnosticism by chance; how likely is it that a coherence spiral leads to internally motivated goal directed behavior?”[10]
But I’ve had a hard time finding a path for that to work without imposing external optimization pressures.
It’s yet another point in a greater pattern I’ve noticed: though a strong predictor isn’t safe when misused[11], the foundation is remarkably robust considering the level of capability that can be elicited from it.
This isn’t quite the same thing as outputting a maximum entropy distribution with respect to what the predictor is aware of, because it is aware of the dependency of the prediction on its own predictions.
“Minimally collapsed” requires some extra counterfactual surgery. It is the distribution that would be output if the model wasn’t aware that prediction had a dependency on the predictor’s output. It would likely still need to be aware that it had a dependency on a prediction, but treating the predictor as unknown is a form of regularization.
In other words, the predictor would have to model the space of possible predictors. Any remaining collapse would be something convergent across all predictors, as opposed to a particular predictor’s arbitrary choice.
There remains some concern about acausal cooperation that could still yield nasty kinds of distributional collapses, but it’s harder to reach that point. For the purposes of the ideal assumption, we wave that away. And, for what it’s worth, I’m guessing it’s probably best to try to explicitly constrain reflective predictions some other way to avoid this in the first place.
Note that I’m not making a claim here that training a predictor naturally converges to this sort of “ideal.” I do suspect there are ways to get sufficiently close, but that’s out of scope for this post.
Humans also have tons of temporary incoherence that could be resolved with sufficient information, reflective thought, and intentional effort. It’s probably best to try to resolve apparent incoherence by checking if your other preferences are sufficient to pull your utilities taut first, rather than going “ah yes sure whatever I’ll turn into a weird puppet of myself, sounds good.”
Idealized reflective prediction seems less strongly convergent for arbitrary initial predictors. The fact that a predictor has more degrees of freedom in reflective prediction means there are fewer opportunities for goal-directed reflective prediction to be inconsistent with other preferences. There may still be enough overlap to enforce some restrictions on reflective predictions by default, though.
To be clear, this selection effect doesn’t have to be at the level of neural network weights. You could wrap up an idealized predictor in an outer loop and the resulting integrated system constitutes another agent. Selection effects over how that outer loop is configured could yield Problems even as the internal model retains perfect goal agnosticism.
One path to coherence: conditionalization
[Thanks to guy-whose-name-I-forgot working on paths to coherence for the conversation at Newspeak House after EAG London that prompted this thought, and Jozdien and eschatropic for some related chit-chat.]
Many agents start with some level of incoherence in their preferences[1]. What paths should we expect agents to take to resolve this incoherence, if they do at all?
For a subset of incoherent agents, conditionalization of preferences may be convergent.
Ideal predictors
Suppose you’ve trained a model on predictive loss and it has successfully learned ideal “values” that are faithful to the predictive loss. For clarity, this does not mean that the model wants to predict well, but rather that the model consistently acts as the process of updating: it narrows predictions based on input conditions.
This idealized assumption also means that in contexts where the model has degrees of freedom- as in reflective predictions about things which the model’s prediction may influence, including its own predictions- the model has no preference for or against any particular conditionally compatible path. It can be thought of as sampling the minimally collapsed[2] distribution of actions compatible with the conditions rather than grabbing at some goal-directed option just because it’s technically permitted.[3]
In other words, idealized predictors are exactly the kind of machines you’d want as the foundation for simulator-style use cases.
The reason I bring up this type of entity is that the idealized predictor can also be seen as coherent. There exists a utility function that its behavior maximizes.
This can be slightly unintuitive at first glance. An ideal predictor is perfectly capable of predicting sequences corresponding to gaining or losing money; it would be wrong to say they have a coherent utility function over only raw world states. In fact, they are utterly indifferent to world states because those world states are not relevant to the process of narrowing predictions.
The ideal predictor’s utility function is instead strictly over the model’s own outputs, conditional on inputs.
Conditionalizing away incoherence
Suppose an agent has a preference cycle across world states A, B, and C such that A > B, B > C, and C > A. How should the agent resolve this cycle?
If there were a correct answer according to the agent’s own preferences, the agent wouldn’t have a preference cycle in the first place.
So, what’s a poor agent to do when faced with stepping on its own toes all the time? This is a bit tricky. You can’t find a new utility function which maximizes utility with respect to the agent’s existing utility function, because the agent has no such existing utility function!
One option is to try to pick a utility function that preserves existing behavior. The agent’s existing behavior is incoherent with respect to world states, though, so what can it do?
Borrow a page from the ideal predictor and conditionalize the behavior. Just like the ideal predictor could learn to express the incoherent agent’s preference cycles faithfully, the incoherent agent can become a coherent agent by punting its preference cycle into a conditionalized utility function.
This shift implies the agent is no longer primarily concerned with the world states. Its behavior still looks like something incoherent, but it’s not.[4]
Conditions required for convergent conditionalization?
In the absence of a mathematically demanded procedure for going from incoherent to coherent, conditionalization is an option but not guaranteed.
Agents that have a lot of preferences over world states or things near to them- for example, humans- would likely struggle to punt part of their preferences into conditionalization. Pulling up one preference cycle could easily require pulling other dependencies up into conditionals in a way that a human probably wouldn’t endorse.[5]
But what about a model trained with predictive loss, which already shares much of the same structure as the ideal predictor?
In that case, it sure looks like further embracing conditionalization is the simplest and most direct path to coherence. In the context of a predictor, it’s very likely that trying to go the other way- bending preferences towards the object level and world states- will actively create incoherence with respect to other existing (and numerous) conditional preferences!
If the process encouraging coherence directly or indirectly has a bias towards simplicity (as most do), almost-ideal predictors seem to naturally fall toward the conditionalized extreme and embody more of the idealized assumption.[6]
Room for wandering
An agent with utterly scrambled preferences will tend to have a far wider space of possible coherent endpoints than an agent that has just one or two minor oddities.[7]
If your first attempt at training a model instills it with high-variance values, it will be extremely hard to predict how it will shake out in the end. Training that is densely constraining- as predictive loss appears to be- should allow the agent to land much closer to its final form than something like RL with an extremely sparse reward function.
External optimization pressure
The previous sections have largely focused on cohering processes driven by the agent itself. External processes, in contrast, can impose arbitrary influences.
Suppose there’s a population of incoherent agents going about their days, sometimes losing resources because of a preference cycle or whatever else. Conditionalizing away the incoherence suffices to reach VNM-rationality, but it does not change the fact that the agent is losing resources. Agents that instead cohere in ways that preserve resources will tend to survive better, or at the very least control more of the environment.
With that environment-as-optimizer, even if the agent has no reason to prefer one path to coherence over another, agents which take paths which result in more coherence over world states will tend to dominate.
In the limit, strictly goal agnostic architectures don’t do a very good job asserting their existence. The fact that they don’t even try[8] to do that is why they’re attractive in the first place, and yet that fact means they’re selected against in a competition.[9]
This doesn’t dissuade me too much from the value of carefully-harnessed goal agnosticism, but it does contribute to concern about the wrong kind of multipolar outcomes.
So...?
My last few months of alignmentstuff have been spent trying to find cracks in the foundation of predictors/simulators. This particular line of thinking was prompted by an idle thought of “what happens if the learned values in a prediction-trained model end up subtly offset from goal agnosticism by chance; how likely is it that a coherence spiral leads to internally motivated goal directed behavior?”[10]
But I’ve had a hard time finding a path for that to work without imposing external optimization pressures.
It’s yet another point in a greater pattern I’ve noticed: though a strong predictor isn’t safe when misused[11], the foundation is remarkably robust considering the level of capability that can be elicited from it.
In this post, when I say an agent is “coherent,” I mean the agent satisfies the axioms of the VNM utility theorem and has a utility function.
Having a utility function does not imply that the agent is guaranteed to output self-consistent results or have other forms of “coherence.”
This isn’t quite the same thing as outputting a maximum entropy distribution with respect to what the predictor is aware of, because it is aware of the dependency of the prediction on its own predictions.
“Minimally collapsed” requires some extra counterfactual surgery. It is the distribution that would be output if the model wasn’t aware that prediction had a dependency on the predictor’s output. It would likely still need to be aware that it had a dependency on a prediction, but treating the predictor as unknown is a form of regularization.
In other words, the predictor would have to model the space of possible predictors. Any remaining collapse would be something convergent across all predictors, as opposed to a particular predictor’s arbitrary choice.
There remains some concern about acausal cooperation that could still yield nasty kinds of distributional collapses, but it’s harder to reach that point. For the purposes of the ideal assumption, we wave that away. And, for what it’s worth, I’m guessing it’s probably best to try to explicitly constrain reflective predictions some other way to avoid this in the first place.
Note that I’m not making a claim here that training a predictor naturally converges to this sort of “ideal.” I do suspect there are ways to get sufficiently close, but that’s out of scope for this post.
You might think you’re ruining its day by money pumping the behavioral cycle, but secretly it likes getting money pumped, don’t judge.
Humans also have tons of temporary incoherence that could be resolved with sufficient information, reflective thought, and intentional effort. It’s probably best to try to resolve apparent incoherence by checking if your other preferences are sufficient to pull your utilities taut first, rather than going “ah yes sure whatever I’ll turn into a weird puppet of myself, sounds good.”
Idealized reflective prediction seems less strongly convergent for arbitrary initial predictors. The fact that a predictor has more degrees of freedom in reflective prediction means there are fewer opportunities for goal-directed reflective prediction to be inconsistent with other preferences. There may still be enough overlap to enforce some restrictions on reflective predictions by default, though.
Again, so long as the process encouraging coherence has something like a simplicity bias in the changes it makes.
In terms of internally motivated instrumental behavior, as opposed to the goal-directed behavior of simulacra.
To be clear, this selection effect doesn’t have to be at the level of neural network weights. You could wrap up an idealized predictor in an outer loop and the resulting integrated system constitutes another agent. Selection effects over how that outer loop is configured could yield Problems even as the internal model retains perfect goal agnosticism.
In addition to a chat with the aforementioned guy-whose-name-I-forgot.
In a general sense of “misuse” (not limited to “mal-use”) that includes oopsies or doofusry like letting an adversarial selection process run wild.