Ah; this does seem to be an unfortunate confusion.
I didn’t intend to make ‘utility’ and ‘reward’ terminology – that’s what ‘mesa-‘ and ‘base’ objectives are for. I wasn’t aware of the terms being used in the technical sense as in your comment, so I wanted to use utility and reward as friendlier and familiar words for this intuition-building post. I am not currently inclined to rewrite the whole thing using different words because of this clash, but could add a footnote to clear this up. If the utility/reward distinction in your sense becomes accepted terminology, I’ll think about rewriting this.
That said, the distinctions we’re drawing appear to be similar. In your terminology, a utility-maximising agent is an agent which has an internal representation of a goal which it pursues. Whereas a reward-maximising agent does not have a rich internal goal representation but instead a kind of pointer to the external reward signal. To me this suggests your utility/reward tracks a very similar, if not the same, distinction between internal/external that I want to track, but with a difference in emphasis. When either of us says ‘utility ≠ reward’, I think we mean the same distinction, but what we want to draw from that distinction is different. Would you disagree?
To me, it seems like the two distinctions are different. There seem to be three levels to distinguish:
The reward (in the reinforcement learning sense) or the base objective (example: inclusive genetic fitness for humans)
A mechanism in the brain that dispenses pleasure or provides a proxy for the reward (example: pleasure in humans)
The actual goal/utility that the agent ends up pursuing (example: a reflective equilibrium for some human’s values, which might have nothing to do with pleasure or inclusive genetic fitness)
The base objective vs mesa-objective distinction seems to be about (1) vs a combination of (2) and (3). The reward maximizer vs utility maximizer distinction seems to be about (2) vs (3), or maybe (1) vs (3).
Depending on the agent that is considered, only some of these levels may be present:
A “dumb” RL-trained agent that engages in reward gaming. Only level (1), and there is no mesa-optimizer.
A “dumb” RL-trained agent that engages in reward tampering. Only level (1), and there is no mesa-optimizer.
A paperclip maximizer built from scratch. Only level (3), and there is no mesa-optimizer.
A relatively “dumb” mesa-optimizer trained using RL might have just (1) (the base objective) and (2) (the mesa-objective). This kind of agent would be incentivized to tamper with its pleasure circuitry (in the sense of (2)), but wouldn’t be incentivized to tamper with its RL-reward circuitry. (Example: rats wirehead to give themselves MAX_PLEASURE, but don’t self-modify to delude themselves into thinking they have left many descendants.)
If the training procedure somehow coughs up a mesa-optimizer that doesn’t have a “pleasure center” in its brain (I don’t know how this would happen, but it seems logically possible), there would just be (1) (the base objective) and (3) (the mesa-objective). This kind of agent wouldn’t try to tamper with its utility function (in the sense of (3)), nor would it try to tamper with its RL-reward/base-objective to delude itself into thinking it has high rewards.
ETA: Here is a table that shows these distinctions varying independently:
Utility maximizer
Reward maximizer
Optimizes for base objective (i.e. mesa-optimizer absent)
Paperclip maximizer
“Dumb” RL-trained agent
Optimizes for mesa-objective (i.e. mesa-optimizer present)
Ah; this does seem to be an unfortunate confusion.
I didn’t intend to make ‘utility’ and ‘reward’ terminology – that’s what ‘mesa-‘ and ‘base’ objectives are for. I wasn’t aware of the terms being used in the technical sense as in your comment, so I wanted to use utility and reward as friendlier and familiar words for this intuition-building post. I am not currently inclined to rewrite the whole thing using different words because of this clash, but could add a footnote to clear this up. If the utility/reward distinction in your sense becomes accepted terminology, I’ll think about rewriting this.
That said, the distinctions we’re drawing appear to be similar. In your terminology, a utility-maximising agent is an agent which has an internal representation of a goal which it pursues. Whereas a reward-maximising agent does not have a rich internal goal representation but instead a kind of pointer to the external reward signal. To me this suggests your utility/reward tracks a very similar, if not the same, distinction between internal/external that I want to track, but with a difference in emphasis. When either of us says ‘utility ≠ reward’, I think we mean the same distinction, but what we want to draw from that distinction is different. Would you disagree?
To me, it seems like the two distinctions are different. There seem to be three levels to distinguish:
The reward (in the reinforcement learning sense) or the base objective (example: inclusive genetic fitness for humans)
A mechanism in the brain that dispenses pleasure or provides a proxy for the reward (example: pleasure in humans)
The actual goal/utility that the agent ends up pursuing (example: a reflective equilibrium for some human’s values, which might have nothing to do with pleasure or inclusive genetic fitness)
The base objective vs mesa-objective distinction seems to be about (1) vs a combination of (2) and (3). The reward maximizer vs utility maximizer distinction seems to be about (2) vs (3), or maybe (1) vs (3).
Depending on the agent that is considered, only some of these levels may be present:
A “dumb” RL-trained agent that engages in reward gaming. Only level (1), and there is no mesa-optimizer.
A “dumb” RL-trained agent that engages in reward tampering. Only level (1), and there is no mesa-optimizer.
A paperclip maximizer built from scratch. Only level (3), and there is no mesa-optimizer.
A relatively “dumb” mesa-optimizer trained using RL might have just (1) (the base objective) and (2) (the mesa-objective). This kind of agent would be incentivized to tamper with its pleasure circuitry (in the sense of (2)), but wouldn’t be incentivized to tamper with its RL-reward circuitry. (Example: rats wirehead to give themselves MAX_PLEASURE, but don’t self-modify to delude themselves into thinking they have left many descendants.)
If the training procedure somehow coughs up a mesa-optimizer that doesn’t have a “pleasure center” in its brain (I don’t know how this would happen, but it seems logically possible), there would just be (1) (the base objective) and (3) (the mesa-objective). This kind of agent wouldn’t try to tamper with its utility function (in the sense of (3)), nor would it try to tamper with its RL-reward/base-objective to delude itself into thinking it has high rewards.
ETA: Here is a table that shows these distinctions varying independently:
The reward vs. utility distinction in the grandparent has existed for a while, see for example Learning What to Value.