They then get more likely to do stuff that the reward function rewards.
This process, iterated a bunch, produces agents that are ‘on-distribution optimal’.
In particular, in states that are ‘easily reached’ during training, the agent will do things that approximately maximize reward.
Some states aren’t ‘easily reached’, e.g. states where there’s a valid bitcoin blockchain of length 20,000,000 (current length as I write is 748,728), or states where you have messed around with your own internals while not intelligent enough to know how they work.
Other states are ‘easily reached’, e.g. states where you intervene on some cause-and-effect relationships in the ‘external world’ that don’t impinge on your general training scheme. For example, if you’re being reinforced to be approved of by people, lying to gain approval is easily reached.
Agents will probably have to be good at means-ends reasoning to approximately locally maximize a tricky reward function.
Agents’ goals may not generalize to states that are not easily reached.
Agents’ motivations likely will generalize to states that are easily reached.
Agents’ motivations will likely be pretty coherent in states that are easily reached.
When I talk about ‘the reward function’, I mean a mathematical function from (state, action, next state) tuples to reals, that is implemented in a computer.
When I talk about ‘reward’, I mean values of this function, and sometimes by extension tuples that achieve high values of the function.
When other people talk about ‘reward’, I think they sometimes mean “the value contained in the antecedent-computation-reinforcer register” and sometimes mean “the value of the mathematical object called ‘the reward function’”, and sometimes I can’t tell what they mean. This is bad, because in edge cases these have pretty different properties (e.g. they disagree on how ‘valuable’ it is to permanently set the ACR register to contain MAX_INT).
Here’s my general view on this topic:
Agents are reinforced by some reward function.
They then get more likely to do stuff that the reward function rewards.
This process, iterated a bunch, produces agents that are ‘on-distribution optimal’.
In particular, in states that are ‘easily reached’ during training, the agent will do things that approximately maximize reward.
Some states aren’t ‘easily reached’, e.g. states where there’s a valid bitcoin blockchain of length 20,000,000 (current length as I write is 748,728), or states where you have messed around with your own internals while not intelligent enough to know how they work.
Other states are ‘easily reached’, e.g. states where you intervene on some cause-and-effect relationships in the ‘external world’ that don’t impinge on your general training scheme. For example, if you’re being reinforced to be approved of by people, lying to gain approval is easily reached.
Agents will probably have to be good at means-ends reasoning to approximately locally maximize a tricky reward function.
Agents’ goals may not generalize to states that are not easily reached.
Agents’ motivations likely will generalize to states that are easily reached.
Agents’ motivations will likely be pretty coherent in states that are easily reached.
When I talk about ‘the reward function’, I mean a mathematical function from (state, action, next state) tuples to reals, that is implemented in a computer.
When I talk about ‘reward’, I mean values of this function, and sometimes by extension tuples that achieve high values of the function.
When other people talk about ‘reward’, I think they sometimes mean “the value contained in the antecedent-computation-reinforcer register” and sometimes mean “the value of the mathematical object called ‘the reward function’”, and sometimes I can’t tell what they mean. This is bad, because in edge cases these have pretty different properties (e.g. they disagree on how ‘valuable’ it is to permanently set the ACR register to contain MAX_INT).