In the game of chicken, an agent can do better by being the first to precommit to never swerve (say, by conspicuously tossing the steering wheel out of the window). So long as the other agent was slower on the trigger, and sees the first agent’s precommitment being credibly made, the first agent will climb up to his best outcome! A smart (and quick) agent can thus shunt that car crash out of his actual future and into some counterfactual future such that the counterfactual crash’s shadow favorably influences the way events actually unfold.
A deceptively aligned mesa-optimizer might gradient hack by affecting its weights such that any single alteration of its goal function by gradient descent will result in a model that performs worse w/r/t the objective function the model is being trained on. Usually a model would not want to make itself worse w/r/t the objective function, as that means that the model will be trained away from if it gets into that state. But the model won’t get put into a worse state if its gradient hack was successful; instead, the model will just see gradient descent leave its goal function alone.
By precommitting to behave in a somewhat costly manner in these relevant counterfactual worlds, the model gets more of what it wants in actuality.
In the game of chicken, an agent can do better by being the first to precommit to never swerve (say, by conspicuously tossing the steering wheel out of the window).
...unless the other agent has already precommitted to not being rational. (What is the advantage of this over just precommitting not to swerve? Precommitting to not be rational can happen even in advance of the game, as it’s mainly a property of the agent itself.)
(This is one way that you can rationally arrive at irrational agents.)
I don’t yet know too much about this, but I’ve heard that updateless decision theories are equivalent to conventional, updateful decision theories (e.g., EDT and CDT) once those theories have made every precommitment they’d want to make.
The pattern I was getting at above seems a bit like this: it instrumentally makes sense to commit ahead of time to a policy that maps every possible series of observations to an action and then stick to it, instead of just outputting the locally best action in each situation you stumble into.
In the game of chicken, an agent can do better by being the first to precommit to never swerve (say, by conspicuously tossing the steering wheel out of the window). So long as the other agent was slower on the trigger, and sees the first agent’s precommitment being credibly made, the first agent will climb up to his best outcome! A smart (and quick) agent can thus shunt that car crash out of his actual future and into some counterfactual future such that the counterfactual crash’s shadow favorably influences the way events actually unfold.
A deceptively aligned mesa-optimizer might gradient hack by affecting its weights such that any single alteration of its goal function by gradient descent will result in a model that performs worse w/r/t the objective function the model is being trained on. Usually a model would not want to make itself worse w/r/t the objective function, as that means that the model will be trained away from if it gets into that state. But the model won’t get put into a worse state if its gradient hack was successful; instead, the model will just see gradient descent leave its goal function alone.
By precommitting to behave in a somewhat costly manner in these relevant counterfactual worlds, the model gets more of what it wants in actuality.
...unless the other agent has already precommitted to not being rational. (What is the advantage of this over just precommitting not to swerve? Precommitting to not be rational can happen even in advance of the game, as it’s mainly a property of the agent itself.)
(This is one way that you can rationally arrive at irrational agents.)
I don’t yet know too much about this, but I’ve heard that updateless decision theories are equivalent to conventional, updateful decision theories (e.g., EDT and CDT) once those theories have made every precommitment they’d want to make.
The pattern I was getting at above seems a bit like this: it instrumentally makes sense to commit ahead of time to a policy that maps every possible series of observations to an action and then stick to it, instead of just outputting the locally best action in each situation you stumble into.