One possibility is a sort of proof by induction, where you start with code which has been inspected by humans, then that code inspects further code, etc.
Daemons and mindcrime seem most worrisome for superhuman systems, but a human-level system is plausibly sufficient to comprehend human values (and thus do useful inspections). For daemons, I think you might even be able to formalize the idea without leaning hard on any specific utility function. The best approach might involve utility uncertainty on the part of the AI that becomes narrower with time, so you can gradually bootstrap your way to understanding human values while avoiding computational hazards according to your current guesses about human values on your way there.
People already choose not to think about particular topics on the basis of information hazards and internal suffering. Sometimes these judgements are made in an interrupt fashion partway through thinking about a topic; others are outside view judgments (“thinking about topic X always makes me feel depressed”).
Can you personally (under your own power) and confidently prove that a particular tool will only recursively-trust safe-and-reliable tools, where this recursive tree reaches far enough to trust superhuman AI?
On the other hand, you can “follow” the tree for a distance. You can prove a calculator trustworthy and use it in your following proofs, for instance. This might make it more feasible.
One possibility is a sort of proof by induction, where you start with code which has been inspected by humans, then that code inspects further code, etc.
Daemons and mindcrime seem most worrisome for superhuman systems, but a human-level system is plausibly sufficient to comprehend human values (and thus do useful inspections). For daemons, I think you might even be able to formalize the idea without leaning hard on any specific utility function. The best approach might involve utility uncertainty on the part of the AI that becomes narrower with time, so you can gradually bootstrap your way to understanding human values while avoiding computational hazards according to your current guesses about human values on your way there.
People already choose not to think about particular topics on the basis of information hazards and internal suffering. Sometimes these judgements are made in an interrupt fashion partway through thinking about a topic; others are outside view judgments (“thinking about topic X always makes me feel depressed”).
Can you personally (under your own power) and confidently prove that a particular tool will only recursively-trust safe-and-reliable tools, where this recursive tree reaches far enough to trust superhuman AI?
On the other hand, you can “follow” the tree for a distance. You can prove a calculator trustworthy and use it in your following proofs, for instance. This might make it more feasible.
I don’t think proofs are the right tool here. Proof by induction was meant as an analogy.