Because of standard deceptive alignment reasons (e.g. “I should make sure gradient descent doesn’t change my goal; I should make sure humans continue to trust me”).
I think you don’t have to reason like that to avoid getting changed by SGD. Suppose I’m being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
Maybe this is compatible with what you had in mind! It’s just not something that I think of as “high reward.”
And maybe there’s some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust… but that feels quite contingent to me.
To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
I think this depends sensitively on whether the “actor” and the “critic” in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the critic will most likely accurately estimate that “treading water” is in fact a negative-advantage action (unless there’s some sort of acausal coordination going on). Or they could be two copies of the same model, in which case the critic’s responses will depend on whether its goals are indexical or not (if they are, they’re different from the actor’s goals; if not, they’re the same) and how easily it can coordinate with the actor. Or it could be two heads which share activations, in which case we can plausibly just think of the critic and the actor as two types of outcomes taken by a single coherent agent—but then the critic doesn’t need to produce a value function that’s consistent with historical events, because an actor and a critic that are working together could gradient hack into all sorts of weird equilibria.
Because of standard deceptive alignment reasons (e.g. “I should make sure gradient descent doesn’t change my goal; I should make sure humans continue to trust me”).
I think you don’t have to reason like that to avoid getting changed by SGD. Suppose I’m being updated by PPO, with reinforcement events around navigating to see dogs. To preserve my current shards, I don’t need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means “treading water” and seeing dogs sometimes in situations similar to historical dog-seeing events.
Maybe this is compatible with what you had in mind! It’s just not something that I think of as “high reward.”
And maybe there’s some self-fulfilling prophecy where we trust models which get high reward, and therefore they want to get high reward to earn our trust… but that feels quite contingent to me.
I think this depends sensitively on whether the “actor” and the “critic” in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the critic will most likely accurately estimate that “treading water” is in fact a negative-advantage action (unless there’s some sort of acausal coordination going on). Or they could be two copies of the same model, in which case the critic’s responses will depend on whether its goals are indexical or not (if they are, they’re different from the actor’s goals; if not, they’re the same) and how easily it can coordinate with the actor. Or it could be two heads which share activations, in which case we can plausibly just think of the critic and the actor as two types of outcomes taken by a single coherent agent—but then the critic doesn’t need to produce a value function that’s consistent with historical events, because an actor and a critic that are working together could gradient hack into all sorts of weird equilibria.