This was a cool post, I found the core point interesting. Very similar to gradient hacker design.
As a general approach to avoiding value drift, it does have a couple very big issues (which I’m guessing TurnTrout already understands, but which I’ll point out for others). First very big issue: it requires the agent basically decouple its cognition from reality when the relevant reward is applied. That’s only useful if the value-drift-inducing events only occur once in a while and are very predictable. If value drift just occurs continuously due to everyday interactions, or if it occurs unpredictably, then the strategy probably can’t be implemented without making the agent useless.
Second big issue: it only applies to reward-induced value drift within an RL system. That’s not the only setting in which value drift is an issue—for instance, MIRI’s work on value drift focused mainly on parent-child value drift in chains of successor AIs. Value drift induced by gradual ontology shifts is another example.
One interpretation of this phrase is that we want AI to generally avoid value drift—to get good values in the AI, and then leave it. (This probably isn’t what you meant, but I’ll leave a comment for other readers!) For AI and for humans, value drift need not be bad. In the human case, going to anger management can be humanely-good value drift. And human-aligned shards of a seed AI can deliberately steer into more situations where the AI gets rewarded while helping people, in order to reinforce the human-aligned coalitional weight.
This was a cool post, I found the core point interesting. Very similar to gradient hacker design.
As a general approach to avoiding value drift, it does have a couple very big issues (which I’m guessing TurnTrout already understands, but which I’ll point out for others). First very big issue: it requires the agent basically decouple its cognition from reality when the relevant reward is applied. That’s only useful if the value-drift-inducing events only occur once in a while and are very predictable. If value drift just occurs continuously due to everyday interactions, or if it occurs unpredictably, then the strategy probably can’t be implemented without making the agent useless.
Second big issue: it only applies to reward-induced value drift within an RL system. That’s not the only setting in which value drift is an issue—for instance, MIRI’s work on value drift focused mainly on parent-child value drift in chains of successor AIs. Value drift induced by gradual ontology shifts is another example.
One interpretation of this phrase is that we want AI to generally avoid value drift—to get good values in the AI, and then leave it. (This probably isn’t what you meant, but I’ll leave a comment for other readers!) For AI and for humans, value drift need not be bad. In the human case, going to anger management can be humanely-good value drift. And human-aligned shards of a seed AI can deliberately steer into more situations where the AI gets rewarded while helping people, in order to reinforce the human-aligned coalitional weight.