Reinforcement learning agents do two sorts of planning. One is the application of the dynamic (world-modelling) network and using a Monte Carlo tree search (or something like it) over explicitly-represented world states. The other is implicit in the future-reward-estimate function. You need to have as much planning as possible be of the first type:
It’s much more supervisable. An explicitly-represented world state is more interrogable than the inner workings of a future-reward-estimate.
It’s less susceptible to value-leaking. By this I mean issues in alignment which arise from instrumentally-valuable (i.e. not directly part of the reward function) goals leaking into the future-reward-estimate.
You can also turn down the depth on the tree search. If the agent literally can’t plan beyond a dozen steps ahead it can’t be deceptively aligned.
I’d go for:
Reinforcement learning agents do two sorts of planning. One is the application of the dynamic (world-modelling) network and using a Monte Carlo tree search (or something like it) over explicitly-represented world states. The other is implicit in the future-reward-estimate function. You need to have as much planning as possible be of the first type:
It’s much more supervisable. An explicitly-represented world state is more interrogable than the inner workings of a future-reward-estimate.
It’s less susceptible to value-leaking. By this I mean issues in alignment which arise from instrumentally-valuable (i.e. not directly part of the reward function) goals leaking into the future-reward-estimate.
You can also turn down the depth on the tree search. If the agent literally can’t plan beyond a dozen steps ahead it can’t be deceptively aligned.