Here we’re stressing the difference between the simulator’s action and the simulation’s (HCH or Evan in your example) action. Obviously, if the simulation is non-myopic, then the simulation’s action will depend on the long-term consequences of this action (for the goals of the simulation). But the simulator itself only cares about answering the question “what would the simulation do next?”. Once again, that might mean that the simulator will think about the long term consequences of the simulation’s action on the simulation’s goals, but the simulator doesn’t have this goal: such reasoning is completely instrumental to its task of simulation. And more generally, the simulator isn’t choosing his next action to make it easier to predict the future actions (like a predict-o-matic would do).
That might sound like nitpicking, but this means something important: the simulator itself has no reason to be deceptive. It might output actions (as its best guess of what the simulation would do) that are deceptive, but only if the simulation itself is deceptive.
What does that give us?
If we manage to point the simulation at something that is non-deceptive yet powerful, the myopic simulator will not introduce deception into the mix. Whereas doing IRL on the simulation and then optimizing for the reward would probably lead to goodhart and deception because of mesa-optimizers.
Here Evan would probably say that HCH sounds like the right non-deceptive simulation; I’m less convinced that HCH will not be deceptive.
An obvious question is to ask why not do imitation learning? Well, I expect (and I believe Evan expects to) that simulation is strictly more powerful than imitation, because it can make models of non-observed or ideal processes that we point out to.
If instead of having a single simulation, we have a simulator that can deal with a range of simulation (how some researchers and I are currently thinking about GPT-3 and LMs), then myopia let you use the simulator to detect deception in the simulations/change between simulations/test different options, in a way that a “deceptive agent acting like a simulator” would not (because it would tamper with your experimentations)
A tangent, but there’s also an argument that I’m writing up about why we should not expect models which simulate many different processes to be agents — spoiler: agents are bad at being simulators.
Even in the worse case scenario where we make a simulator that simulates a deceptive agent, the simulator has no incentive to hides its “internal thought” about how the simulation works. That’s a very small comfort, but it might make interpretability easier because there is no adversarial pressure against it.
Yeah, that’s a subtle point.
Here we’re stressing the difference between the simulator’s action and the simulation’s (HCH or Evan in your example) action. Obviously, if the simulation is non-myopic, then the simulation’s action will depend on the long-term consequences of this action (for the goals of the simulation). But the simulator itself only cares about answering the question “what would the simulation do next?”. Once again, that might mean that the simulator will think about the long term consequences of the simulation’s action on the simulation’s goals, but the simulator doesn’t have this goal: such reasoning is completely instrumental to its task of simulation. And more generally, the simulator isn’t choosing his next action to make it easier to predict the future actions (like a predict-o-matic would do).
That might sound like nitpicking, but this means something important: the simulator itself has no reason to be deceptive. It might output actions (as its best guess of what the simulation would do) that are deceptive, but only if the simulation itself is deceptive.
What does that give us?
If we manage to point the simulation at something that is non-deceptive yet powerful, the myopic simulator will not introduce deception into the mix. Whereas doing IRL on the simulation and then optimizing for the reward would probably lead to goodhart and deception because of mesa-optimizers.
Here Evan would probably say that HCH sounds like the right non-deceptive simulation; I’m less convinced that HCH will not be deceptive.
An obvious question is to ask why not do imitation learning? Well, I expect (and I believe Evan expects to) that simulation is strictly more powerful than imitation, because it can make models of non-observed or ideal processes that we point out to.
If instead of having a single simulation, we have a simulator that can deal with a range of simulation (how some researchers and I are currently thinking about GPT-3 and LMs), then myopia let you use the simulator to detect deception in the simulations/change between simulations/test different options, in a way that a “deceptive agent acting like a simulator” would not (because it would tamper with your experimentations)
A tangent, but there’s also an argument that I’m writing up about why we should not expect models which simulate many different processes to be agents — spoiler: agents are bad at being simulators.
Even in the worse case scenario where we make a simulator that simulates a deceptive agent, the simulator has no incentive to hides its “internal thought” about how the simulation works. That’s a very small comfort, but it might make interpretability easier because there is no adversarial pressure against it.