This statement bugs me because I don’t see that any solution to that problem has been developed within LW’s avantgarde decision theories. In fact, they often introduce self-referential statements, so this problem in a way should be more pressing for them. The “AIXI-like agent” just doesn’t have a self-representation; but a LWDT (LessWrong decision theory) agent, insofar as its decision theory revolves around self-referential propositions, does need a capacity for self-representation, and yet I don’t remember this problem being discussed very much. It’s more as if the problem has been overlooked because so much of the theory is discussed in natural language, and so the capacity for self-referential semantics that natural-language statements provide has been taken for granted.
There is actually a decades-old body of work in computer science on self-representation, for example under the name of “computational reflection”; it was the subject of the thesis of Pattie Maes at MIT. But there’s no magic bootstrap here, whereby ordinary reference turns into self-reference. It’s just a matter of coding up, by hand, structures that happen to represent aspects of themselves, and then giving them causal connections so that those representations co-vary appropriately with the things they represent. This would appear to be an adequate solution to the problem, and I don’t see that any alternative solutions have been invented locally.
A UDT agent does not need to be explicitly told how to represent itself, besides knowing its own code.
This is because a UDT agent does not make an attempt to explicitly control a particular instantiation of itself. A UDT agent looks at the world and tries to control the entire thing by controlling logical facts about its own behavior. If the world happens to contain patterns similar to the agent, then the agent will recognize that controlling its own output will control the behavior of those patterns. The agent could also infer that by destroying those patterns, it will lose its ability to control the world.
I think this is a nice idea, and it does deal with (at least this particular) problem with self-representation.
The bigger issue is that (as far as I know) no one has yet found a precisely specified version of ADT/UDT/TDT which satisfies our intuitions about what TDT should do.
To be more precise about the current state of the art: we don’t know any algorithm that can maximize its utility in a UDT-ish setting, but we do know algorithms that can hit a specified utility value, or ensure that utility is no less than some predefined value, if the underlying decision problem allows that at all. (Paul, I think you have already figured this out independently, right?)
Yes, I think it resolves it completely and this is part of what makes it interesting.
An ADT agent cares about some utility function which is independent of its experiences; for example, the number of paperclips that actually exist (viewing the universe as a mathematically well-defined, but uncertain, object).
If so, and if it could be made at all practical, I think that would be a major breakthrough. The current stories about wirehead-avoidance are not terribly convincing, IMO. Which is not to say that there’s not a solution—just that we do not yet really know how to implement one.
An ADT agent cares about some utility function which is independent of its experiences
That is kind-of impossible, though. All our knowledge of the world necessarily comes to us through our senses.
I had a brief look at it again. It seems very expensive. When making a decision, it is painful to start by integrating over all possible copies of agents who might be “like you”. In short, it doesn’t look remotely like what is most likely to come first.
Update 2011-06-28. OK, I finally figured out what you were talking about above—and it turns out that I don’t agree with it at all. The “LessWrong”-style decision theories that I am aware of so far don’t have any impact on the wirehead problem at all—as far as I can see.
Well, possibly. I certainly have an idea about what “the state of the universe” refers to aside from my sensory perceptions of it. What we need math for is to see whether it is possible to build an agent whose belief that it is maximising such a quantity survives extensive self-knowledge about its own operation. Without supporting math, we don’t have much more than a story.
What we need math for is to see whether it is possible to build an agent whose belief that it is maximising such a quantity survives extensive self-knowledge about its own operation.
Well, I am an example of an agent who does not want to wirehead for the reasons explained in the posts I linked to. I have some self knowledge about my own operation, though not nearly as much as I would like (I don’t know how to program a computer to be me), but I doubt that more self knowledge, barring valley effects, would do anything other than increase my ability to avoid wireheading.
Actually, our current concept of UDT should handle this problem automatically, at least in theory. I’ll try to explain how it works.
First, assume that the world is a computer program with known source code W. (The general case is a prior distribution over possible world-programs, the solution will generalize to that case easily.) Further imagine that the agent is also a computer program that knows its own source code A. The way the agent works is by investigating the logical consequences of its decisions; that is, it tries to find plausible mathematical statements of the form “A() == a logically implies W() == w” for different values of a and w.
One way of finding such statements is by inspecting the source code of W and noticing that there’s a copy of A (or its logical equivalent) embedded somewhere within it, and the return value of that embedded copy can be used to compute the return value of W itself. Note that this happens “implicitly”, we don’t need to tell the agent “where” it is within the world, it just needs to search for mathematical statements of the specified form. Also note that if W contains multiple logically equivalent copies of A (e.g. if they’re playing a symmetric PD, or someone somewhere is running a predictor simulation of A, etc.), then the approach handles that automatically too.
See Wei Dai’s original UDT post for another explanation. I’ve made many posts along these lines too, for example, this one describes “embodied” thingies that can dismantle their own hardware for spare parts and still achieve their values.
It’s probably not even our problem. ISTM that we could easily get to beyond-human level using agents that have walled-off brains, and can’t self-modify, or hack into themselves.
You can normally stop such an agent from bashing its own brains in with a bit of operant conditioning.
This statement bugs me because I don’t see that any solution to that problem has been developed within LW’s avantgarde decision theories. In fact, they often introduce self-referential statements, so this problem in a way should be more pressing for them. The “AIXI-like agent” just doesn’t have a self-representation; but a LWDT (LessWrong decision theory) agent, insofar as its decision theory revolves around self-referential propositions, does need a capacity for self-representation, and yet I don’t remember this problem being discussed very much. It’s more as if the problem has been overlooked because so much of the theory is discussed in natural language, and so the capacity for self-referential semantics that natural-language statements provide has been taken for granted.
There is actually a decades-old body of work in computer science on self-representation, for example under the name of “computational reflection”; it was the subject of the thesis of Pattie Maes at MIT. But there’s no magic bootstrap here, whereby ordinary reference turns into self-reference. It’s just a matter of coding up, by hand, structures that happen to represent aspects of themselves, and then giving them causal connections so that those representations co-vary appropriately with the things they represent. This would appear to be an adequate solution to the problem, and I don’t see that any alternative solutions have been invented locally.
A UDT agent does not need to be explicitly told how to represent itself, besides knowing its own code.
This is because a UDT agent does not make an attempt to explicitly control a particular instantiation of itself. A UDT agent looks at the world and tries to control the entire thing by controlling logical facts about its own behavior. If the world happens to contain patterns similar to the agent, then the agent will recognize that controlling its own output will control the behavior of those patterns. The agent could also infer that by destroying those patterns, it will lose its ability to control the world.
I think this is a nice idea, and it does deal with (at least this particular) problem with self-representation.
The bigger issue is that (as far as I know) no one has yet found a precisely specified version of ADT/UDT/TDT which satisfies our intuitions about what TDT should do.
Right.
To be more precise about the current state of the art: we don’t know any algorithm that can maximize its utility in a UDT-ish setting, but we do know algorithms that can hit a specified utility value, or ensure that utility is no less than some predefined value, if the underlying decision problem allows that at all. (Paul, I think you have already figured this out independently, right?)
That sounds pretty wild. Do you think it would help any with the wirehead problem?
Yes, I think it resolves it completely and this is part of what makes it interesting.
An ADT agent cares about some utility function which is independent of its experiences; for example, the number of paperclips that actually exist (viewing the universe as a mathematically well-defined, but uncertain, object).
If so, and if it could be made at all practical, I think that would be a major breakthrough. The current stories about wirehead-avoidance are not terribly convincing, IMO. Which is not to say that there’s not a solution—just that we do not yet really know how to implement one.
That is kind-of impossible, though. All our knowledge of the world necessarily comes to us through our senses.
I had a brief look at it again. It seems very expensive. When making a decision, it is painful to start by integrating over all possible copies of agents who might be “like you”. In short, it doesn’t look remotely like what is most likely to come first.
Update 2011-06-28. OK, I finally figured out what you were talking about above—and it turns out that I don’t agree with it at all. The “LessWrong”-style decision theories that I am aware of so far don’t have any impact on the wirehead problem at all—as far as I can see.
Yes, but an agent can understand that it’s fixed utility function which refers to the state of the entire universe is not maximized by allowing itself to be deceived.
Well, possibly. I certainly have an idea about what “the state of the universe” refers to aside from my sensory perceptions of it. What we need math for is to see whether it is possible to build an agent whose belief that it is maximising such a quantity survives extensive self-knowledge about its own operation. Without supporting math, we don’t have much more than a story.
Well, I am an example of an agent who does not want to wirehead for the reasons explained in the posts I linked to. I have some self knowledge about my own operation, though not nearly as much as I would like (I don’t know how to program a computer to be me), but I doubt that more self knowledge, barring valley effects, would do anything other than increase my ability to avoid wireheading.
Actually, our current concept of UDT should handle this problem automatically, at least in theory. I’ll try to explain how it works.
First, assume that the world is a computer program with known source code W. (The general case is a prior distribution over possible world-programs, the solution will generalize to that case easily.) Further imagine that the agent is also a computer program that knows its own source code A. The way the agent works is by investigating the logical consequences of its decisions; that is, it tries to find plausible mathematical statements of the form “A() == a logically implies W() == w” for different values of a and w.
One way of finding such statements is by inspecting the source code of W and noticing that there’s a copy of A (or its logical equivalent) embedded somewhere within it, and the return value of that embedded copy can be used to compute the return value of W itself. Note that this happens “implicitly”, we don’t need to tell the agent “where” it is within the world, it just needs to search for mathematical statements of the specified form. Also note that if W contains multiple logically equivalent copies of A (e.g. if they’re playing a symmetric PD, or someone somewhere is running a predictor simulation of A, etc.), then the approach handles that automatically too.
See Wei Dai’s original UDT post for another explanation. I’ve made many posts along these lines too, for example, this one describes “embodied” thingies that can dismantle their own hardware for spare parts and still achieve their values.
It sounds pretty wild. Do you think it would help any with the wirehead problem?
Yeah, it solves it.
It’s probably not even our problem. ISTM that we could easily get to beyond-human level using agents that have walled-off brains, and can’t self-modify, or hack into themselves.
You can normally stop such an agent from bashing its own brains in with a bit of operant conditioning.