The ‘reward being chance of winning’ stuff changes a bit about how the model generalises if it’s playing a game with randomness and conditioned on the upper end—it biases the model towards ‘expecting risk to pay off’. Eg/ if the model plays a 1-step game where it either banks 1 point or gets a 1% chance of 10 points, then conditioning on it getting 10 points will cause it to take the lower EV action. But this isn’t super relevant.
In general I am pretty confused about how models would generalise strategies out of distribution, and this seems like at least partly an empirical ML / science of deep learning problem. But my intuition is that no, even if the model ‘understands’ that an action causes a good outcome [ P(outcome |. action) ], it still might not take that action if the action is unlikely on priors [ P(action) ], because the basic driver of its choice is what action it will observe given the outcome [ P(action | outcome) ] which is proportional to both of those quantities (and in fact, probably equal, with some normalising constant P(outcome)).
Re lowering the temperature: the thing the predictive model learns is usually some numbers that you normalise into a probability distribution with a softmax function, and you can easily tweak the softmax to make it favour more likely completions, and disfavour unlikely ones. This would greatly decrease the turn-by-turn chance of disconnecting the button (and therefore overall chance), but probably not impair the model’s performance elsewhere much.
The point of my setup is that [ P(outcome|corrigible action) ] is very small, so [ P(incorrigible action|outcome) ] is largeish, even if [ Frequency(corrigible action) ] is high and [ Frequency(incorrigible action) ] is low or absent.
And this is alignment relevant, because I expect people will ask for never before seen outcomes (by chance or on purpose), some of which may soft-require incorrigible actions.
(And of course there could be optimisation daemons that do treacherous turns even when asking for normal actions. But I think your post is setting that aside, which seems reasonable).
The ‘reward being chance of winning’ stuff changes a bit about how the model generalises if it’s playing a game with randomness and conditioned on the upper end—it biases the model towards ‘expecting risk to pay off’. Eg/ if the model plays a 1-step game where it either banks 1 point or gets a 1% chance of 10 points, then conditioning on it getting 10 points will cause it to take the lower EV action. But this isn’t super relevant.
In general I am pretty confused about how models would generalise strategies out of distribution, and this seems like at least partly an empirical ML / science of deep learning problem. But my intuition is that no, even if the model ‘understands’ that an action causes a good outcome [ P(outcome |. action) ], it still might not take that action if the action is unlikely on priors [ P(action) ], because the basic driver of its choice is what action it will observe given the outcome [ P(action | outcome) ] which is proportional to both of those quantities (and in fact, probably equal, with some normalising constant P(outcome)).
Re lowering the temperature: the thing the predictive model learns is usually some numbers that you normalise into a probability distribution with a softmax function, and you can easily tweak the softmax to make it favour more likely completions, and disfavour unlikely ones. This would greatly decrease the turn-by-turn chance of disconnecting the button (and therefore overall chance), but probably not impair the model’s performance elsewhere much.
The point of my setup is that [ P(outcome|corrigible action) ] is very small, so [ P(incorrigible action|outcome) ] is largeish, even if [ Frequency(corrigible action) ] is high and [ Frequency(incorrigible action) ] is low or absent.
And this is alignment relevant, because I expect people will ask for never before seen outcomes (by chance or on purpose), some of which may soft-require incorrigible actions.
(And of course there could be optimisation daemons that do treacherous turns even when asking for normal actions. But I think your post is setting that aside, which seems reasonable).