This comment seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately avoid blueberries to prevent value drift.
Risk from Learned Optimization seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately get blueberries to prevent value drift.
What’s going on here? Are these predictions in opposition to each other, or do they apply to different situations?
It seems to me that in the first case we’re imagining (the agent predicting) that getting blueberries will reinforce thoughts like ‘I should get blueberries’, whereas in the second case we’re imagining it will reinforce thoughts like ‘I should get blueberries in service of my ultimate goal of getting raspberries’. When should we expect one over the other?
I think RFLO is mostly imagining model-free RL with updates at the end of each episode, and my comment was mostly imagining model-based RL with online learning (e.g. TD learning). The former is kinda like evolution, the latter is kinda like within-lifetime learning, see e.g. §10.2.2 here.
The former would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I should maybe spend some time eating raspberries, but also more importantly I should explicitly try to maximize my inclusive genetic fitness so that I have lots of descendants, and those descendants (who will also disproportionately have the raspberry-eating gene) will then eat lots of raspberries.
The latter would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I shouldn’t go do lots of highly-addictive drugs that warp my preferences such that I no longer care about raspberries or indeed anything besides the drugs.
Right. So if selection acts on policies, each policy should aim to maximise reward in any episode in order to maximise its frequency in the population. But if selection acts on particular aspects of policies, a policy should try to get reward for doing things it values, and not for things it doesn’t, in order to reinforce those values. In particular this can mean getting less reward overall.
Does this suggest a class of hare-brained alignment schemes where you train with a combination of inter-policy and infra-policy updates to take advantage of the difference?
For example you could clearly label which episodes are to be used for which and observe whether a policy consistently gets more reward in the former case than the latter. If it does, conclude it’s sophisticated enough to reason about its training setup.
Or you could not label which is which, and randomly switch between the two, forcing your agents to split the difference and thus be about half as successful at locking in their values.
This comment seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately avoid blueberries to prevent value drift.
Risk from Learned Optimization seems to predict that an agent that likes getting raspberries and judges that they will be highly rewarded for getting blueberries will deliberately get blueberries to prevent value drift.
What’s going on here? Are these predictions in opposition to each other, or do they apply to different situations?
It seems to me that in the first case we’re imagining (the agent predicting) that getting blueberries will reinforce thoughts like ‘I should get blueberries’, whereas in the second case we’re imagining it will reinforce thoughts like ‘I should get blueberries in service of my ultimate goal of getting raspberries’. When should we expect one over the other?
I think RFLO is mostly imagining model-free RL with updates at the end of each episode, and my comment was mostly imagining model-based RL with online learning (e.g. TD learning). The former is kinda like evolution, the latter is kinda like within-lifetime learning, see e.g. §10.2.2 here.
The former would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I should maybe spend some time eating raspberries, but also more importantly I should explicitly try to maximize my inclusive genetic fitness so that I have lots of descendants, and those descendants (who will also disproportionately have the raspberry-eating gene) will then eat lots of raspberries.
The latter would say: If I want lots of raspberries to get eaten, and I have a genetic disposition to want raspberries to be eaten, then I shouldn’t go do lots of highly-addictive drugs that warp my preferences such that I no longer care about raspberries or indeed anything besides the drugs.
Right. So if selection acts on policies, each policy should aim to maximise reward in any episode in order to maximise its frequency in the population. But if selection acts on particular aspects of policies, a policy should try to get reward for doing things it values, and not for things it doesn’t, in order to reinforce those values. In particular this can mean getting less reward overall.
Does this suggest a class of hare-brained alignment schemes where you train with a combination of inter-policy and infra-policy updates to take advantage of the difference?
For example you could clearly label which episodes are to be used for which and observe whether a policy consistently gets more reward in the former case than the latter. If it does, conclude it’s sophisticated enough to reason about its training setup.
Or you could not label which is which, and randomly switch between the two, forcing your agents to split the difference and thus be about half as successful at locking in their values.