Actually, our current concept of UDT should handle this problem automatically, at least in theory. I’ll try to explain how it works.
First, assume that the world is a computer program with known source code W. (The general case is a prior distribution over possible world-programs, the solution will generalize to that case easily.) Further imagine that the agent is also a computer program that knows its own source code A. The way the agent works is by investigating the logical consequences of its decisions; that is, it tries to find plausible mathematical statements of the form “A() == a logically implies W() == w” for different values of a and w.
One way of finding such statements is by inspecting the source code of W and noticing that there’s a copy of A (or its logical equivalent) embedded somewhere within it, and the return value of that embedded copy can be used to compute the return value of W itself. Note that this happens “implicitly”, we don’t need to tell the agent “where” it is within the world, it just needs to search for mathematical statements of the specified form. Also note that if W contains multiple logically equivalent copies of A (e.g. if they’re playing a symmetric PD, or someone somewhere is running a predictor simulation of A, etc.), then the approach handles that automatically too.
See Wei Dai’s original UDT post for another explanation. I’ve made many posts along these lines too, for example, this one describes “embodied” thingies that can dismantle their own hardware for spare parts and still achieve their values.
Actually, our current concept of UDT should handle this problem automatically, at least in theory. I’ll try to explain how it works.
First, assume that the world is a computer program with known source code W. (The general case is a prior distribution over possible world-programs, the solution will generalize to that case easily.) Further imagine that the agent is also a computer program that knows its own source code A. The way the agent works is by investigating the logical consequences of its decisions; that is, it tries to find plausible mathematical statements of the form “A() == a logically implies W() == w” for different values of a and w.
One way of finding such statements is by inspecting the source code of W and noticing that there’s a copy of A (or its logical equivalent) embedded somewhere within it, and the return value of that embedded copy can be used to compute the return value of W itself. Note that this happens “implicitly”, we don’t need to tell the agent “where” it is within the world, it just needs to search for mathematical statements of the specified form. Also note that if W contains multiple logically equivalent copies of A (e.g. if they’re playing a symmetric PD, or someone somewhere is running a predictor simulation of A, etc.), then the approach handles that automatically too.
See Wei Dai’s original UDT post for another explanation. I’ve made many posts along these lines too, for example, this one describes “embodied” thingies that can dismantle their own hardware for spare parts and still achieve their values.
It sounds pretty wild. Do you think it would help any with the wirehead problem?
Yeah, it solves it.