I mostly agree with you here, so I’m not the best candidate to write a convincing story that routes around these problems. I think there is some narrow glimmer of hope shining through a narrow sliver of space here though. Starting with the point that you have a powerful plan making AGI, and set it to making plans for pivotal acts, or at least plans for maneuvering you closer to making pivotal acts. Insist that the plan be human-readable and intended to be carried out by human actors. Have your powerful AGI, or other narrow AGIs, audit the plan as a whole and in parts. You can add small amounts of noise to the plan (e.g. switching words for synonyms) to avoid steganography. You can test out the auditing process itself with plans not written by the AGI.
You can wipe the memories of the plan from the AGI and modify parts of it with noise to make it a similar (rather than identical) agent. You can fine tune versions of the AGI on different datasets with different emphases, again to make it a slightly different agent. You can test it out with smaller, safer plans. Then test it out with smaller legible steps towards maneuvering into position for a pivotal act (e.g. gaining money, getting a chosen candidate elected).
Hopefully by making a number of human-legible and human-enacted steps towards getting ready to make a pivotal act, you can reduce the complexity of a plan necessary to go the rest of the way. This also lets you use a less-powerful AGI to make the plan. The less super-humanly intelligent the AGI is, the more likely you are to be able to avoid failure of control or running into an undetected deceptive turn.
There are downsides with this approach. One big one is that it would probably take much more time to enact, and possibly have lower odds of success. Another is that there are a lot of potential infohazard risks to tiptoe around when trying to get human-readable plans out of an otherwise-boxed AGI (extracting a plan is an opening in the box, but hopefully could be done in a controlled safe way).
I mostly agree with you here, so I’m not the best candidate to write a convincing story that routes around these problems. I think there is some narrow glimmer of hope shining through a narrow sliver of space here though. Starting with the point that you have a powerful plan making AGI, and set it to making plans for pivotal acts, or at least plans for maneuvering you closer to making pivotal acts. Insist that the plan be human-readable and intended to be carried out by human actors. Have your powerful AGI, or other narrow AGIs, audit the plan as a whole and in parts. You can add small amounts of noise to the plan (e.g. switching words for synonyms) to avoid steganography. You can test out the auditing process itself with plans not written by the AGI.
You can wipe the memories of the plan from the AGI and modify parts of it with noise to make it a similar (rather than identical) agent. You can fine tune versions of the AGI on different datasets with different emphases, again to make it a slightly different agent. You can test it out with smaller, safer plans. Then test it out with smaller legible steps towards maneuvering into position for a pivotal act (e.g. gaining money, getting a chosen candidate elected).
Hopefully by making a number of human-legible and human-enacted steps towards getting ready to make a pivotal act, you can reduce the complexity of a plan necessary to go the rest of the way. This also lets you use a less-powerful AGI to make the plan. The less super-humanly intelligent the AGI is, the more likely you are to be able to avoid failure of control or running into an undetected deceptive turn.
There are downsides with this approach. One big one is that it would probably take much more time to enact, and possibly have lower odds of success. Another is that there are a lot of potential infohazard risks to tiptoe around when trying to get human-readable plans out of an otherwise-boxed AGI (extracting a plan is an opening in the box, but hopefully could be done in a controlled safe way).