J Bostock comments on But What’s Your New Alignment Insight, out of a Future-Textbook Paragraph?

J Bostock 15 May 2022 10:49 UTC
1 point
I’d go for:
Reinforcement learning agents do two sorts of planning. One is the application of the dynamic (world-modelling) network and using a Monte Carlo tree search (or something like it) over explicitly-represented world states. The other is implicit in the future-reward-estimate function. You need to have as much planning as possible be of the first type:
1. It’s much more supervisable. An explicitly-represented world state is more interrogable than the inner workings of a future-reward-estimate.
2. It’s less susceptible to value-leaking. By this I mean issues in alignment which arise from instrumentally-valuable (i.e. not directly part of the reward function) goals leaking into the future-reward-estimate.
3. You can also turn down the depth on the tree search. If the agent literally can’t plan beyond a dozen steps ahead it can’t be deceptively aligned.