You postulate that a “sufficiently powerful optimizer” would find policies that set the registers directly (wirehead), but I’m not sure that is guaranteed. It could well be that such policies are non-computable for non-trivial reward circuits. That is because a) the optimizer is embedded and thus has to model itself to find the optimum policy and this recursive self-modeling could be non-computable or correspond to the halting problem (except for solvable fixed-point cases). At least this possibility has to be ruled out explicitly.
You postulate that a “sufficiently powerful optimizer” would find policies that set the registers directly (wirehead), but I’m not sure that is guaranteed. It could well be that such policies are non-computable for non-trivial reward circuits. That is because a) the optimizer is embedded and thus has to model itself to find the optimum policy and this recursive self-modeling could be non-computable or correspond to the halting problem (except for solvable fixed-point cases). At least this possibility has to be ruled out explicitly.