EJT comments on The Shutdown Problem: Incomplete Preferences as a Solution

EJT 28 May 2024 15:12 UTC
4 points
0
Good point! Thinking about it, it seems like an analogue of Good’s theorem will apply.
Here’s some consolation though. We’ll be able to notice if the agent is choosing stochastically at the very beginning of each episode and then choosing deterministically afterwards. That’s because we can tell whether an agent is choosing stochastically at a timestep by looking at its final-layer activations at that timestep. If one final-layer neuron activates much more than all the other final-layer neurons, the agent is choosing (near-)deterministically; otherwise, the agent is choosing stochastically.
Because we can easily notice this behaviour, plausibly we can find some way to train against it. Here’s a new idea to replace the $λ^{n}$ reward function. Suppose the agent’s choice is as follows:
At this timestep, we train the agent using supervised learning. Ground-truth is a vector of final-layer activations in which the activation of the neuron corresponding to ‘Yes’ equals the activation of the neuron corresponding to ‘No’. By doing this, we update the agent directly towards stochastic choice between ‘Yes’ and ‘No’ at this timestep.