That… seems like a big part of what having “solved alignment” would mean, given that you have AGI-level optimization aimed at (indirectly via a counter-factual) evaluating this (IIUC).
one solution to this problem is to simply never use that capability (running expensive computations) at all, or to not use it before the iterated counterfactual researchers have developed proofs that any expensive computation they run is safe, or before they have very slowly and carefully built dath-ilan-style corrigible aligned AGI.
Nice graphic!
What stops e.g. “QACI(expensive_computation())” from being an optimization process which ends up trying to “hack its way out” into the real QACI?
nothing fundamentally, the user has to be careful what computation they invoke.
That… seems like a big part of what having “solved alignment” would mean, given that you have AGI-level optimization aimed at (indirectly via a counter-factual) evaluating this (IIUC).
one solution to this problem is to simply never use that capability (running expensive computations) at all, or to not use it before the iterated counterfactual researchers have developed proofs that any expensive computation they run is safe, or before they have very slowly and carefully built dath-ilan-style corrigible aligned AGI.