I think the general principle you’re taking advantage of is:
Do a small amount of very predictable reasoning in PA+Sound(PA). Then use PA for the rest of the reasoning you have to do. When reasoning about other instances of agents similar to you, simulate their use of Sound(PA), and trust the reasoning that they did while confining themselves to PA.
In your example, PA+Con(PA) sufficed, but PA+Sound(PA) is more flexible in general, in ways that might be important. This also seems to solve the closely related problem of how to trust statements if you know they were proven by an agent that reasons approximately the same way you do, for instance, statements proven by yourself in the past. If you proved X in the past, and you want to establish that that makes X true, you fully simulate everything your past self could have done in PA+Sound(PA)-mode before confining itself to PA, and then you can just trust that the reasoning you did afterwards in PA-mode was correct, so you don’t have to redo that part.
This also might allow for an agent to make improvements in its own source code and still trust its future self, provided that the modification that it makes still only uses Sound(PA) in ways that it can predict and simulate. On the other hand, that condition might be a serious limitation to recursive self-improvement, since the successor agent would need to use Sound(PA) in order to pick its successor agent, and it can’t do so in ways that the predecessor agent can’t predict.
Perhaps it could be worse than that, and attempting to do anything nontrivial with this trick leads to a combinatorial explosion from every instance of the agent trying to simulate every other instance’s uses of Sound(PA). But I’m cautiously optimistic that it isn’t quite that bad, since simply simulating an agent invoking Sound(PA) does not itself require you to invoke Sound(PA) yourself, so these simulations can be run in PA-mode, and only the decision to run them needs to be made in PA+Sound(PA)-mode.
I think the right direction is to think about what we want from self-modification informally, and refine that into a toy problem that the agent in the post can’t solve. It’s confusing in a philosophical kind of way. I don’t have the right skill for it, but I’ve been trying to mine Eliezer’s Arbital writings.
I think the general principle you’re taking advantage of is:
In your example, PA+Con(PA) sufficed, but PA+Sound(PA) is more flexible in general, in ways that might be important. This also seems to solve the closely related problem of how to trust statements if you know they were proven by an agent that reasons approximately the same way you do, for instance, statements proven by yourself in the past. If you proved X in the past, and you want to establish that that makes X true, you fully simulate everything your past self could have done in PA+Sound(PA)-mode before confining itself to PA, and then you can just trust that the reasoning you did afterwards in PA-mode was correct, so you don’t have to redo that part.
This also might allow for an agent to make improvements in its own source code and still trust its future self, provided that the modification that it makes still only uses Sound(PA) in ways that it can predict and simulate. On the other hand, that condition might be a serious limitation to recursive self-improvement, since the successor agent would need to use Sound(PA) in order to pick its successor agent, and it can’t do so in ways that the predecessor agent can’t predict.
Perhaps it could be worse than that, and attempting to do anything nontrivial with this trick leads to a combinatorial explosion from every instance of the agent trying to simulate every other instance’s uses of Sound(PA). But I’m cautiously optimistic that it isn’t quite that bad, since simply simulating an agent invoking Sound(PA) does not itself require you to invoke Sound(PA) yourself, so these simulations can be run in PA-mode, and only the decision to run them needs to be made in PA+Sound(PA)-mode.
I think the right direction is to think about what we want from self-modification informally, and refine that into a toy problem that the agent in the post can’t solve. It’s confusing in a philosophical kind of way. I don’t have the right skill for it, but I’ve been trying to mine Eliezer’s Arbital writings.