Suppose we amend ASP to require the agent to output a full simulation of the predictor before saying “one box” or “two boxes” (or else the agent gets no payoff at all). Would that defeat UDT variants that depend on stopping the agent before it overthinks the problem?
(Or instead of requiring the the agent to output the simulation, we could use the entire simulation, in some canonical form, as a cryptographic key to unlock an encrypted description of the problem itself. Prior to decrypting the description, the agent doesn’t even know what the rules are; the agent is told in advance only that that decryption will reveal the rules.)
For the simulation-output variant of ASP, let’s say the agent’s possible actions/outputs consist of all possible simulations Si (up to some specified length), concatenated with “one box” or “two boxes”. To prove that any given action has utility greater than zero, the agent must prove that the associated simulation of the predictor is correct. Where does your algorithm have an opportunity to commit to one-boxing before completing the simulation, if it’s not yet aware that any of its available actions has nonzero utility? (Or would that commitment require a further modification to the algorithm?)
For the simulation-as-key variant of ASP, what principle would instruct a (modified) UDT algorithm to redact some of the inferences it has already derived?
simulation-output: It would require a modification to the algorithm. I don’t find this particularly alarming, though, since the algorithm was intended as a minimally-complex solution that behaves correctly for good reasons, not as a final, fully-general version. To do this, the agent would have to first (or at least, at some point soon enough for the predictor to simulate) look for ways to partition its output into pieces and consider choosing each piece separately. There would have to be some heuristic for deciding what partitionings of the output to consider and how much computational power to devote to each of them, and then which one actually gets chosen depends on which has the highest resulting utility you expect to get from them. Come to think of it, this might be trickier than I was thinking because you would run into self-trust issues if you need to prove that you will output the correct simulation of the predictor. This could be fixed by delegating the task of fully simulating the predictor to an easier-to-model subroutine, though that would require further modification to the algorithm.
Simulation-as-key: I don’t have a good answer to that.
In the first problem, the agent could commit to one-boxing (through the mechanism I described in the link) and then finish simulating the predictor afterwards. Then the predictor would still be able to simulate the agent until it commits to one-boxing, and then prove that the agent will one-box no matter what it computes after that.
The second version of the problem seems more likely to cause problems, but it might work for the agent to restrict itself to not using the information it pre-computed for the purposes of modeling the predictor (even though it has to use that information for understanding the problem). If predictor is capable of verifying or assuming that the agent will correctly simulate it, it could skip the impossible step of fully simulating the agent fully simulating it, and just simulate the agent on the decrypted problem.
Suppose we amend ASP to require the agent to output a full simulation of the predictor before saying “one box” or “two boxes” (or else the agent gets no payoff at all). Would that defeat UDT variants that depend on stopping the agent before it overthinks the problem?
(Or instead of requiring the the agent to output the simulation, we could use the entire simulation, in some canonical form, as a cryptographic key to unlock an encrypted description of the problem itself. Prior to decrypting the description, the agent doesn’t even know what the rules are; the agent is told in advance only that that decryption will reveal the rules.)
For the simulation-output variant of ASP, let’s say the agent’s possible actions/outputs consist of all possible simulations Si (up to some specified length), concatenated with “one box” or “two boxes”. To prove that any given action has utility greater than zero, the agent must prove that the associated simulation of the predictor is correct. Where does your algorithm have an opportunity to commit to one-boxing before completing the simulation, if it’s not yet aware that any of its available actions has nonzero utility? (Or would that commitment require a further modification to the algorithm?)
For the simulation-as-key variant of ASP, what principle would instruct a (modified) UDT algorithm to redact some of the inferences it has already derived?
simulation-output: It would require a modification to the algorithm. I don’t find this particularly alarming, though, since the algorithm was intended as a minimally-complex solution that behaves correctly for good reasons, not as a final, fully-general version. To do this, the agent would have to first (or at least, at some point soon enough for the predictor to simulate) look for ways to partition its output into pieces and consider choosing each piece separately. There would have to be some heuristic for deciding what partitionings of the output to consider and how much computational power to devote to each of them, and then which one actually gets chosen depends on which has the highest resulting utility you expect to get from them. Come to think of it, this might be trickier than I was thinking because you would run into self-trust issues if you need to prove that you will output the correct simulation of the predictor. This could be fixed by delegating the task of fully simulating the predictor to an easier-to-model subroutine, though that would require further modification to the algorithm.
Simulation-as-key: I don’t have a good answer to that.
In the first problem, the agent could commit to one-boxing (through the mechanism I described in the link) and then finish simulating the predictor afterwards. Then the predictor would still be able to simulate the agent until it commits to one-boxing, and then prove that the agent will one-box no matter what it computes after that.
The second version of the problem seems more likely to cause problems, but it might work for the agent to restrict itself to not using the information it pre-computed for the purposes of modeling the predictor (even though it has to use that information for understanding the problem). If predictor is capable of verifying or assuming that the agent will correctly simulate it, it could skip the impossible step of fully simulating the agent fully simulating it, and just simulate the agent on the decrypted problem.