Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it’s unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
Misaligned goal --> I should get high reward --> Behavior aligned with reward function
and then I’m hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make “I should get high reward” the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.)
I could also imagine something more like:
Misaligned goal --> I should behave in aligned ways --> Aligned behavior
and then the simplicity bias pushes towards alignment. But if there are outer alignment failures then this incurs some additional complexity compared with the first option.
Or a third, perhaps more realistic option is that the misaligned goal leads to two separate drives in the agent: “I should get high reward” and “I should behave in aligned ways”, and that the question of which ends up dominating when they clash will be determined by how the agent systematizes multiple goals into a single coherent strategy (I’ll have a post on that topic up soon).
Because of standard deceptive alignment reasons (e.g. “I should make sure gradient descent doesn’t change my goal; I should make sure humans continue to trust me”).
I think you don’t have to reason like that to avoid getting changed by SGD. Suppose I’m being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
Maybe this is compatible with what you had in mind! It’s just not something that I think of as “high reward.”
And maybe there’s some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust… but that feels quite contingent to me.
To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
I think this depends sensitively on whether the “actor” and the “critic” in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the critic will most likely accurately estimate that “treading water” is in fact a negative-advantage action (unless there’s some sort of acausal coordination going on). Or they could be two copies of the same model, in which case the critic’s responses will depend on whether its goals are indexical or not (if they are, they’re different from the actor’s goals; if not, they’re the same) and how easily it can coordinate with the actor. Or it could be two heads which share activations, in which case we can plausibly just think of the critic and the actor as two types of outcomes taken by a single coherent agent—but then the critic doesn’t need to produce a value function that’s consistent with historical events, because an actor and a critic that are working together could gradient hack into all sorts of weird equilibria.
Misaligned goal --> I should get high reward --> Behavior aligned with reward function
The shortest description of this thought doesn’t include “I should get high reward” because that’s already implied by having a misaligned goal and planning with it.
In contrast, having only the goal “I should get high reward” may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it’s unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representation will be imperfect and can thus be improved upon at inference time by a deceptive agent (or an agent with some kind of additional pointer). This of course depends on how much inference time compute and memory / context is available to the agent.
So I’m imagining the agent doing reasoning like:
Misaligned goal --> I should get high reward --> Behavior aligned with reward function
and then I’m hypothesizing that the whatever the first misaligned goal is, it requires some amount of complexity to implement, and you could just get rid of it and make “I should get high reward” the terminal goal. (I could imagine this being false though depending on the details of how terminal and instrumental goals are implemented.)
I could also imagine something more like:
Misaligned goal --> I should behave in aligned ways --> Aligned behavior
and then the simplicity bias pushes towards alignment. But if there are outer alignment failures then this incurs some additional complexity compared with the first option.
Or a third, perhaps more realistic option is that the misaligned goal leads to two separate drives in the agent: “I should get high reward” and “I should behave in aligned ways”, and that the question of which ends up dominating when they clash will be determined by how the agent systematizes multiple goals into a single coherent strategy (I’ll have a post on that topic up soon).
Why would the agent reason like this?
Because of standard deceptive alignment reasons (e.g. “I should make sure gradient descent doesn’t change my goal; I should make sure humans continue to trust me”).
I think you don’t have to reason like that to avoid getting changed by SGD. Suppose I’m being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
Maybe this is compatible with what you had in mind! It’s just not something that I think of as “high reward.”
And maybe there’s some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust… but that feels quite contingent to me.
I think this depends sensitively on whether the “actor” and the “critic” in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the critic will most likely accurately estimate that “treading water” is in fact a negative-advantage action (unless there’s some sort of acausal coordination going on). Or they could be two copies of the same model, in which case the critic’s responses will depend on whether its goals are indexical or not (if they are, they’re different from the actor’s goals; if not, they’re the same) and how easily it can coordinate with the actor. Or it could be two heads which share activations, in which case we can plausibly just think of the critic and the actor as two types of outcomes taken by a single coherent agent—but then the critic doesn’t need to produce a value function that’s consistent with historical events, because an actor and a critic that are working together could gradient hack into all sorts of weird equilibria.
The shortest description of this thought doesn’t include “I should get high reward” because that’s already implied by having a misaligned goal and planning with it.
In contrast, having only the goal “I should get high reward” may add description length like Johannes said. If so, the misaligned goal could well be equally simple or simpler than the high reward goal.