It does seem like the a decision transformer would be unlikely to disconnect the button in the setup given. But if you modify it to something like “Player 2 has an x% chance of pressing the button every timestep”, whether the model would disconnect the button is a question of how well the model generalises. Even if the shutdown is only slightly truncating the distribution of scores, if we condition on a far higher score and the model generalises well, it should figure out to press the button.
(Another way it could figure it out is if the shutdowns are in fact correlated with the strategies and it correctly anticipates it wants to employ a shutdown-prone strategy)
Re generalisation—decision transformers don’t really have strategies per se, they pick moment to moment, and might be systematically miscalibrated about what they’ll do in future timesteps. It is true that they’ll have some chance at every timestep, which will add up over time, but if you were actually trying to implement this then you could do things like lowering the temperature, which shouldn’t affect performance otherwise.
Re higher conditioning—I think this shouldn’t be true. For the sake of argument we can reframe it as a binary outcome, where the model’s final return (as a proportion of total possible return) becomes its chance of ‘winning’. The thing the model is figuring out is not ‘what action leads to me winning’, or even ‘what action is more likely in worlds where I win than worlds where I lose’, it’s ‘what action do I expect to see from agents that win’. If on turn 1, 99% of agents in the training set voluntarily slap a button that has a 1% chance of destroying them, and then 50% go on to win, as well as 50% of the agents that didn’t slap the button, then a DT will (correctly) learn that ‘almost all agents which go on to win tend to slap the button on turn 1’.
Re correlation—Sure, I am taking the liberal assumption that there’s no correlation in the training data, and indeed a lot of this rests on the training data having a nice structure
I’m not sure I understand what the return being the chance of winning would look like. Hopefully not much is changed if I imagine the reward being whether or not the model wins or loses, but let me know if not.
I do think in this setup, with a binary or very squashed reward signal, the model will just predict standard winning trajectories. My setup depends on asking the model to generalise out of distribution. That is, when predicting scores, ask for a score higher than any seen before. As long as the dynamics of the shutdown are sufficiently visible in the training data, the model could (in principle) generalise to realise a very high score would require disconnecting the button.
I think you think that if the button presses always happen at timestep 100 (say) and with a uniform 1% probability, then this model would put ~1% on disconnecting the button in this scenario, which I agree with.
I think this suggests decision transformers will only rarely deal with rare threats, but can learn to deal with rare bottlenecks with high likelihood.
It is true that they’ll have some chance at every timestep, which will add up over time, but if you were actually trying to implement this then you could do things like lowering the temperature, which shouldn’t affect performance otherwise.
I don’t follow this sentence. I more-or-less agree with the previous one (though there may be lots of ways for the trajectories to leak information about what overall ‘strategy’ the model is pursuing, and of course people will try and get that information in there by conditioning on more information).
The ‘reward being chance of winning’ stuff changes a bit about how the model generalises if it’s playing a game with randomness and conditioned on the upper end—it biases the model towards ‘expecting risk to pay off’. Eg/ if the model plays a 1-step game where it either banks 1 point or gets a 1% chance of 10 points, then conditioning on it getting 10 points will cause it to take the lower EV action. But this isn’t super relevant.
In general I am pretty confused about how models would generalise strategies out of distribution, and this seems like at least partly an empirical ML / science of deep learning problem. But my intuition is that no, even if the model ‘understands’ that an action causes a good outcome [ P(outcome |. action) ], it still might not take that action if the action is unlikely on priors [ P(action) ], because the basic driver of its choice is what action it will observe given the outcome [ P(action | outcome) ] which is proportional to both of those quantities (and in fact, probably equal, with some normalising constant P(outcome)).
Re lowering the temperature: the thing the predictive model learns is usually some numbers that you normalise into a probability distribution with a softmax function, and you can easily tweak the softmax to make it favour more likely completions, and disfavour unlikely ones. This would greatly decrease the turn-by-turn chance of disconnecting the button (and therefore overall chance), but probably not impair the model’s performance elsewhere much.
The point of my setup is that [ P(outcome|corrigible action) ] is very small, so [ P(incorrigible action|outcome) ] is largeish, even if [ Frequency(corrigible action) ] is high and [ Frequency(incorrigible action) ] is low or absent.
And this is alignment relevant, because I expect people will ask for never before seen outcomes (by chance or on purpose), some of which may soft-require incorrigible actions.
(And of course there could be optimisation daemons that do treacherous turns even when asking for normal actions. But I think your post is setting that aside, which seems reasonable).
It does seem like the a decision transformer would be unlikely to disconnect the button in the setup given. But if you modify it to something like “Player 2 has an x% chance of pressing the button every timestep”, whether the model would disconnect the button is a question of how well the model generalises. Even if the shutdown is only slightly truncating the distribution of scores, if we condition on a far higher score and the model generalises well, it should figure out to press the button.
(Another way it could figure it out is if the shutdowns are in fact correlated with the strategies and it correctly anticipates it wants to employ a shutdown-prone strategy)
Re generalisation—decision transformers don’t really have strategies per se, they pick moment to moment, and might be systematically miscalibrated about what they’ll do in future timesteps. It is true that they’ll have some chance at every timestep, which will add up over time, but if you were actually trying to implement this then you could do things like lowering the temperature, which shouldn’t affect performance otherwise.
Re higher conditioning—I think this shouldn’t be true. For the sake of argument we can reframe it as a binary outcome, where the model’s final return (as a proportion of total possible return) becomes its chance of ‘winning’. The thing the model is figuring out is not ‘what action leads to me winning’, or even ‘what action is more likely in worlds where I win than worlds where I lose’, it’s ‘what action do I expect to see from agents that win’. If on turn 1, 99% of agents in the training set voluntarily slap a button that has a 1% chance of destroying them, and then 50% go on to win, as well as 50% of the agents that didn’t slap the button, then a DT will (correctly) learn that ‘almost all agents which go on to win tend to slap the button on turn 1’.
Re correlation—Sure, I am taking the liberal assumption that there’s no correlation in the training data, and indeed a lot of this rests on the training data having a nice structure
I’m not sure I understand what the return being the chance of winning would look like. Hopefully not much is changed if I imagine the reward being whether or not the model wins or loses, but let me know if not.
I do think in this setup, with a binary or very squashed reward signal, the model will just predict standard winning trajectories. My setup depends on asking the model to generalise out of distribution. That is, when predicting scores, ask for a score higher than any seen before. As long as the dynamics of the shutdown are sufficiently visible in the training data, the model could (in principle) generalise to realise a very high score would require disconnecting the button.
I think you think that if the button presses always happen at timestep 100 (say) and with a uniform 1% probability, then this model would put ~1% on disconnecting the button in this scenario, which I agree with.
I think this suggests decision transformers will only rarely deal with rare threats, but can learn to deal with rare bottlenecks with high likelihood.
I don’t follow this sentence. I more-or-less agree with the previous one (though there may be lots of ways for the trajectories to leak information about what overall ‘strategy’ the model is pursuing, and of course people will try and get that information in there by conditioning on more information).
The ‘reward being chance of winning’ stuff changes a bit about how the model generalises if it’s playing a game with randomness and conditioned on the upper end—it biases the model towards ‘expecting risk to pay off’. Eg/ if the model plays a 1-step game where it either banks 1 point or gets a 1% chance of 10 points, then conditioning on it getting 10 points will cause it to take the lower EV action. But this isn’t super relevant.
In general I am pretty confused about how models would generalise strategies out of distribution, and this seems like at least partly an empirical ML / science of deep learning problem. But my intuition is that no, even if the model ‘understands’ that an action causes a good outcome [ P(outcome |. action) ], it still might not take that action if the action is unlikely on priors [ P(action) ], because the basic driver of its choice is what action it will observe given the outcome [ P(action | outcome) ] which is proportional to both of those quantities (and in fact, probably equal, with some normalising constant P(outcome)).
Re lowering the temperature: the thing the predictive model learns is usually some numbers that you normalise into a probability distribution with a softmax function, and you can easily tweak the softmax to make it favour more likely completions, and disfavour unlikely ones. This would greatly decrease the turn-by-turn chance of disconnecting the button (and therefore overall chance), but probably not impair the model’s performance elsewhere much.
The point of my setup is that [ P(outcome|corrigible action) ] is very small, so [ P(incorrigible action|outcome) ] is largeish, even if [ Frequency(corrigible action) ] is high and [ Frequency(incorrigible action) ] is low or absent.
And this is alignment relevant, because I expect people will ask for never before seen outcomes (by chance or on purpose), some of which may soft-require incorrigible actions.
(And of course there could be optimisation daemons that do treacherous turns even when asking for normal actions. But I think your post is setting that aside, which seems reasonable).