The way I would approach this problem (after not much thought): Come up with a concrete system architecture A of a maimizing computer program that has an explicit utility function, and is known to behave optimally. E.g. maybe it plays tic tac toe or 4-in a row optimally.
Now mutate the source code of A slightly such that it is no longer optimal to get a system B. The objective is not modified. Now B still “wants” to basically be A, in the sense that if it is a general enough optimizer and has access to selfmodification facilities, it would try to make itself be A, because A is better at optimizing the objective.
I predict by creating a setup where the delta between B and A is small, you can create a tractable problem, without sidestepping the core bottlecks, i.e. solving “correct selfmodification” for small delta between A and B, seems like it needs to solve some hard part of the problem. Once you solved it increase the delta, and solve it again.
Unsure about the exact setup for giving the systems the ability to selfmodify. I intuit one can construct a toy setup that can generate good insight such that B doesn’t actually need to be that powerful, or that general of an optimizer.
The way I would approach this problem (after not much thought): Come up with a concrete system architecture A of a maimizing computer program that has an explicit utility function, and is known to behave optimally. E.g. maybe it plays tic tac toe or 4-in a row optimally.
Now mutate the source code of A slightly such that it is no longer optimal to get a system B. The objective is not modified. Now B still “wants” to basically be A, in the sense that if it is a general enough optimizer and has access to selfmodification facilities, it would try to make itself be A, because A is better at optimizing the objective.
I predict by creating a setup where the delta between B and A is small, you can create a tractable problem, without sidestepping the core bottlecks, i.e. solving “correct selfmodification” for small delta between A and B, seems like it needs to solve some hard part of the problem. Once you solved it increase the delta, and solve it again.
Unsure about the exact setup for giving the systems the ability to selfmodify. I intuit one can construct a toy setup that can generate good insight such that B doesn’t actually need to be that powerful, or that general of an optimizer.