IDK if this is a crux for me thinking this is very relevant to stuff on my perspective, but:
The training procedure you propose doesn’t seem to actually incentivize indifference. First, a toy model where I agree it does incentivize that:
On the first time step, the agent gets a choice: choose a number 1--N. If the agent says k, then the agent has nothing at all to do for the first k steps, after which some game G starts. (Each play of G is i.i.d., not related to k.)
So this agent is indeed incentivized to pick k uniformly at random from 1--N. Now consider:
The agent is in a rich world. There are many complex multi-step plans to incentivize agent to learn problem-solving. Each episode, at time N, the agent gets to choose: end now, or play 10 more steps.
Does this incentivize random choice at time N? No. It incentivizes the agent to choose randomly End or Continue at the very beginning of the episode, and then carefully plan and execute behavior that acheives the most reward assuming a run of length N or N+10 respectively.
Wait, but isn’t this success? Didn’t we make the agent have no trajectory length preference?
No. Suppose:
Same as before, but now there’s a little guy standing by the End/Continue button. Sometimes he likes to press button randomly.
Do we kill the guy? Yes we certainly do, he will mess up our careful plans.
Good point! Thinking about it, it seems like an analogue of Good’s theorem will apply.
Here’s some consolation though. We’ll be able to notice if the agent is choosing stochastically at the very beginning of each episode and then choosing deterministically afterwards. That’s because we can tell whether an agent is choosing stochastically at a timestep by looking at its final-layer activations at that timestep. If one final-layer neuron activates much more than all the other final-layer neurons, the agent is choosing (near-)deterministically; otherwise, the agent is choosing stochastically.
Because we can easily notice this behaviour, plausibly we can find some way to train against it. Here’s a new idea to replace the λn reward function. Suppose the agent’s choice is as follows:
At this timestep, we train the agent using supervised learning. Ground-truth is a vector of final-layer activations in which the activation of the neuron corresponding to ‘Yes’ equals the activation of the neuron corresponding to ‘No’. By doing this, we update the agent directly towards stochastic choice between ‘Yes’ and ‘No’ at this timestep.
IDK if this is a crux for me thinking this is very relevant to stuff on my perspective, but:
The training procedure you propose doesn’t seem to actually incentivize indifference. First, a toy model where I agree it does incentivize that:
So this agent is indeed incentivized to pick k uniformly at random from 1--N. Now consider:
Does this incentivize random choice at time N? No. It incentivizes the agent to choose randomly End or Continue at the very beginning of the episode, and then carefully plan and execute behavior that acheives the most reward assuming a run of length N or N+10 respectively.
Wait, but isn’t this success? Didn’t we make the agent have no trajectory length preference?
No. Suppose:
Do we kill the guy? Yes we certainly do, he will mess up our careful plans.
Good point! Thinking about it, it seems like an analogue of Good’s theorem will apply.
Here’s some consolation though. We’ll be able to notice if the agent is choosing stochastically at the very beginning of each episode and then choosing deterministically afterwards. That’s because we can tell whether an agent is choosing stochastically at a timestep by looking at its final-layer activations at that timestep. If one final-layer neuron activates much more than all the other final-layer neurons, the agent is choosing (near-)deterministically; otherwise, the agent is choosing stochastically.
Because we can easily notice this behaviour, plausibly we can find some way to train against it. Here’s a new idea to replace the λn reward function. Suppose the agent’s choice is as follows:
At this timestep, we train the agent using supervised learning. Ground-truth is a vector of final-layer activations in which the activation of the neuron corresponding to ‘Yes’ equals the activation of the neuron corresponding to ‘No’. By doing this, we update the agent directly towards stochastic choice between ‘Yes’ and ‘No’ at this timestep.