Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.
I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:
I don’t see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is “Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that succeed vs. minor variations of non-lethal plans, therefore the former will be overrepresented in the space of successful plans”. But this doesn’t seem to go through a priori. You get this “there are way more variations” phenomenon whenever the outcome is overdetermined by a plan, but this doesn’t automatically make the plan more likely on a simplicity prior unless it’s also not sufficiently more complex to outweigh this. In this case, a fully-fleshed out plan which goes all-in on IC and takes over the world might easily be more complex than a simpler plan, in which case why do we assume the IC plans dominate?
I don’t think weighting by plan-complexity necessarily prioritises IC/lethal plans unless you also weight by something like “probability of plan success relative to a prior”, in which case sure your distribution will upweight plans that just take over everything. But even so maybe simpler, non-lethal plans are likely enough to succeed that they still come out in front. It feels like what you’re implicitly doing is assuming the AI will be trying to maximise the probability of WBE, but why would it do this? This seems to be where all the danger is coming from really. If it instead does something more like “Search through plans, pick the first one that seems “good enough”″, then the question of whether it selects a dangerous plan is a purely empirical one about what its own inductive biases are, and it seems odd to be so a priori confident about the danger here
Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.
I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:
I don’t see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is “Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that succeed vs. minor variations of non-lethal plans, therefore the former will be overrepresented in the space of successful plans”. But this doesn’t seem to go through a priori. You get this “there are way more variations” phenomenon whenever the outcome is overdetermined by a plan, but this doesn’t automatically make the plan more likely on a simplicity prior unless it’s also not sufficiently more complex to outweigh this. In this case, a fully-fleshed out plan which goes all-in on IC and takes over the world might easily be more complex than a simpler plan, in which case why do we assume the IC plans dominate?
I don’t think weighting by plan-complexity necessarily prioritises IC/lethal plans unless you also weight by something like “probability of plan success relative to a prior”, in which case sure your distribution will upweight plans that just take over everything. But even so maybe simpler, non-lethal plans are likely enough to succeed that they still come out in front. It feels like what you’re implicitly doing is assuming the AI will be trying to maximise the probability of WBE, but why would it do this? This seems to be where all the danger is coming from really. If it instead does something more like “Search through plans, pick the first one that seems “good enough”″, then the question of whether it selects a dangerous plan is a purely empirical one about what its own inductive biases are, and it seems odd to be so a priori confident about the danger here