I have my own idea of “one simple trick”. Might as well solicit opinions here: consider the possibility that developing tools for providing useful summaries of and agent’s thinking could substantially lower the risk associated with AGI. If there’s an effective way to place trip-wires in the agent’s mind to inform us of its thought-crimes, we would know to ignore its protestations and could just pull the plug on it. Or better yet, have the plug be automatically pulled by the monitoring process. Perhaps an agent be able to meditate it’s way around our safeguards, so as to shield itself from Jedi mind-tricks?
The concepts you propose are actually discussed at length in Nick Bostrom’s recent work, Superintelligence, although tripwires are only one of many security measures a safe design might employ. The book is a good introductory text on the various technical problems superintelligences (including AGI) carry with them, as well as some global-political overview.
I really recommend it, if you’re sufficiently interested.
I have my own idea of “one simple trick”. Might as well solicit opinions here: consider the possibility that developing tools for providing useful summaries of and agent’s thinking could substantially lower the risk associated with AGI. If there’s an effective way to place trip-wires in the agent’s mind to inform us of its thought-crimes, we would know to ignore its protestations and could just pull the plug on it. Or better yet, have the plug be automatically pulled by the monitoring process. Perhaps an agent be able to meditate it’s way around our safeguards, so as to shield itself from Jedi mind-tricks?
The concepts you propose are actually discussed at length in Nick Bostrom’s recent work, Superintelligence, although tripwires are only one of many security measures a safe design might employ. The book is a good introductory text on the various technical problems superintelligences (including AGI) carry with them, as well as some global-political overview.
I really recommend it, if you’re sufficiently interested.