I was rereading this and was struck by how much work the following does:
Manipulative AGI is knowledgeable/intelligent enough to anticipate what criteria we will condition on in counterfactual worlds where it does not exist, and manipulate world events to meet whatever criteria we might specify with probability ~1.
If this doesn’t hold, and if the AGI has to guess (with probability p << 1) what criteria we were going to condition on, then the update in favor of AGI is p:x, which could easily be less than 1.
Moreover I think there are ways we can arrange for this to be the case, and I think it’s basically what you suggest in open question (2).
For instance, in deciding what to condition on, we could come up with 10 different strategies and select from these at random. Now p ⇐ 0.1 because the AGI can’t do any better than determining those strategies and picking one at random. As long as all of these strategies do indeed work, so we don’t mind which we get this seems like a big win. And it’s not inconceivable that we could have many different strategies that all perform similarly. For instance, it might be that any of 1000 different weather patterns suffices to push the simulated-world’s politics in a useful direction, and we can pick from those at random to push down the predictability of our conditionals.
Ok this is going to be messy but let me try to convey my hunch for why randomization doesn’t seem very useful.
- Say I have an intervention that’s helpful, and has a baseline 1⁄4 probability. If I condition on this statement, I get 1 “unit of helpfulness”, and a 4x update towards manipulative AGI. - Now let’s say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1⁄4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI! - BUT, the four interventions have to be mutually exclusive. Which means that if I’d done no simulation at all, I would’ve gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes. - Ok, well, what if my four interventions 1⁄8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1⁄8, p(O | manipulative) = 1⁄4, so I get a 2x update towards manipulative AGI. This is the same as if I’d just conditioned on the statement “one of my four interventions happens”, and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update.
Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn’t doing anything different from just using a weaker search condition—it gives up bits of search, and so it has to pay less.
I think so. But I’d want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.
I was rereading this and was struck by how much work the following does:
If this doesn’t hold, and if the AGI has to guess (with probability p << 1) what criteria we were going to condition on, then the update in favor of AGI is p:x, which could easily be less than 1.
Moreover I think there are ways we can arrange for this to be the case, and I think it’s basically what you suggest in open question (2).
For instance, in deciding what to condition on, we could come up with 10 different strategies and select from these at random. Now p ⇐ 0.1 because the AGI can’t do any better than determining those strategies and picking one at random. As long as all of these strategies do indeed work, so we don’t mind which we get this seems like a big win. And it’s not inconceivable that we could have many different strategies that all perform similarly. For instance, it might be that any of 1000 different weather patterns suffices to push the simulated-world’s politics in a useful direction, and we can pick from those at random to push down the predictability of our conditionals.
Ok this is going to be messy but let me try to convey my hunch for why randomization doesn’t seem very useful.
- Say I have an intervention that’s helpful, and has a baseline 1⁄4 probability. If I condition on this statement, I get 1 “unit of helpfulness”, and a 4x update towards manipulative AGI.
- Now let’s say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1⁄4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutually exclusive. Which means that if I’d done no simulation at all, I would’ve gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes.
- Ok, well, what if my four interventions 1⁄8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1⁄8, p(O | manipulative) = 1⁄4, so I get a 2x update towards manipulative AGI. This is the same as if I’d just conditioned on the statement “one of my four interventions happens”, and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update.
Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn’t doing anything different from just using a weaker search condition—it gives up bits of search, and so it has to pay less.
Got it, that’s very clear. Thanks!
So this point reduces to “we want our X:1 update to be as mild as possible, so use the least-specific condition that accomplishes the goal”.
I think so. But I’d want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.