Eh? That post seems to talk about an inherently Cartesian representation in which reality comes pre-sliced into an environment containing a program with hard-bounded inputs and outputs.
Here’s what I always assumed a UDT agent to be doing: on one hand, it has a prior over world programs—let’s say without loss of generality that there’s just one world program, e.g. a cellular automaton containing a computer containing an operating system running the agent program. It’s not pre-sliced, the agent doesn’t know what transistors are, etc. On the other hand, the agent knows a quined description of itself, as an agent program with inputs and outputs, in whatever high-level language you like.
Then the agent tries to prove theorems of the form, “if the agent program implements such-and-such mapping from inputs to outputs, then the world program behaves a certain way”. To prove a theorem of that form, the agent might notice that some things happening within the world program can be interpreted as “inputs” and “outputs”, and the logical dependence between these is provably the same as the mapping implemented by the agent program. (So the agent will end up finding transistors within the world program, so to speak, while trying to prove theorems of the above form.) Note that the agent might discover multiple copies of itself within the world and set up coordination based on different inputs that they receive, like in Wei’s post.
This approach also has a big problem, which is kind of opposite to the problem described in the RobbBB’s post. Namely, it requires us to describe our utility function at the base level of reality, but that’s difficult because we don’t know how paperclips are represented at the base level of reality! We only know how we perceive paperclips. Solving that problem seems to require some flavor of Paul’s “indirect normativity”, but that’s broken and might be unfixable as I’ve discussed with you before.
Namely, it requires us to describe our utility function at the base level of reality, but that’s difficult because we don’t know how paperclips are represented at the base level of reality! We only know how we perceive paperclips.
In principle you could have a paper-clip perception module which counts paper-clips and define utility in terms of its output, and include huge penalties for world states where the paper-clip perception module has been functionally altered (or, more precisely, for world states where you can’t prove that the paper-clip perception module hasn’t been functionally altered).
Note that a utility function in UDT is supposed to be a mathematical expression in closed form, with no free variables pointing to “perception”. So applying your idea to UDT would require a mathematical model of how agents get their perceptions, e.g. “my perceptions are generated by the universal distribution” like in UDASSA. Such a model would have to address all the usual anthropic questions, like what happens to subjective probabilities if the perception module gets copied conditionally on winning the lottery, etc. And even if we found the right model, I wouldn’t build an AI based on that idea, because it might try to hijack the inputs of the perception module instead of doing useful work.
I’d be really interested in a UDT-like agent with a utility function over perceptions instead of a closed-form mathematical expression, though. Nesov called that hypothetical thing “UDT-AIXI” and we spent some time trying to find a good definition, but unsuccessfully. Do you know how to define such a thing?
I’d be really interested in a UDT-like agent with a utility function over perceptions instead of a closed-form mathematical expression, though. Nesov called that hypothetical thing “UDT-AIXI” and we spent some time trying to find a good definition, but unsuccessfully. Do you know how to define such a thing?
Solving that problem seems to require some flavor of Paul’s “indirect normativity”, but that’s broken and might be unfixable as I’ve discussed with you before.
Yes, see this post. Most of the discussion happened in PM exchanges, but I think you can still get the idea. Feel free to PM me for explanations if you want.
On the other hand, the agent knows a quined description of itself
I can’t remember if there was a specific problem that motivated this stipulation, so: Is this necessary? E.g. humans do not have an exact representation of themselves (they have an exact existence/they ‘just are’ and have an inexact and partially false mental model), yet they can still sort of locate instantiations of themselves within models if you point them to roughly the right bit of the model. It feels like maybe a sufficiently advanced general intelligence should be able to recognise itself given some incomplete knowledge of itself, at least once it is looking in the right place/located itself-that-it-does-not-yet-know-is-itself.
I guess perhaps the quining stipulation for UDT/the source code stipulation for FAI is not to say that it is strictly necessary, but rather to guarantee the intelligence is self-aware enough for us to make interesting statements about it? I.e. that it’s not an axiom per se but rather a tentatively stipulated property that makes it easier to talk about things but which is not claimed to be (necessarily) necessary?
On the other hand, a proof of the necessity of something as complete as source code would not surprise me.
...it requires us to describe our utility function at the base level of reality...
(1) Fix the set of possible worlds (i.e. models) that have positive credence[1]. Specify VNM preferences over all the events/outcomes within those models that we wish to consider. Then assuming the model we’re actually in[2] was in the set of hypotheses we fixed, then we have a utility function over the actual model.
(2) Fix the set of models M that have positive credence, and choose any[3] language L that is powerful enough to describe them. Let F be a function which takes m in M and phi in the set of formulas of L as inputs and outputs the interpretation of phi in M. E.g. if m is a Newtonian model, F(m,‘Me’) will return some configuration of billiard balls, where if m is a quantum model, F(m,‘Me’) will return some region of a universal wavefunction.
Now exhaustively specify preferences in the language, e.g. ‘Vanilla > Chocolate’. Then for each model (possible base level of reality), the interpretation function F will allow that model’s preferences to be generated. This can save up-front work because we can avoid figuring out F(m,phi) until we come across a particular m, so that we do not need to actually ‘compute’ every F(m, phi) in advance, whereas (1) requires specifying them up front. And this is probably more realistic in that it is more like the way humans (and possibly AGI’s) would figure out preferences over things in the model; have a vague idea of concepts in the language then cash them out as and when necessary.
[1]I say not measure, for I remain unconvinced that measure is a general enough concept for an FAI or even humans to make decisions
[2]Dubious and probably confused way to think about it since a key point is that we’re not ‘actually in’ any one model; somewhat more accurately we are every matching instantiation, and perhaps more accurately still we’re the superposition of all those matching instantiations
[3]It’s not trivial that it’s trivial that we get the same results from any choice of language so that we can indeed choose arbitrarily
maybe a sufficiently advanced general intelligence should be able to recognize itself given some incomplete knowledge of itself
Yeah, that would be good, but we have no idea how to do it mathematically yet. Humans have intuitions that approximate that, but evolution probably didn’t give us a mechanism that’s correct in general, we’ll have to come up with the right math ourselves.
Let F be a function which takes m in M and phi in the set of formulas of L as inputs and outputs the interpretation of phi in M
The big problem is defining F. If you squint the right way, you can view Paul’s idea as a way of asking the AI to figure out F.
I endorse cousin_it’s explanation of how UDT is supposed to work, without using explicit phenomenological bridging rules or hypotheses. Is there anything about it that you don’t understand or disagree with?
As far as I can see, UDT doesn’t have this problem. This post might be relevant. Or am I missing something?
Eh? That post seems to talk about an inherently Cartesian representation in which reality comes pre-sliced into an environment containing a program with hard-bounded inputs and outputs.
Not sure I agree. Let me try to spell it out.
Here’s what I always assumed a UDT agent to be doing: on one hand, it has a prior over world programs—let’s say without loss of generality that there’s just one world program, e.g. a cellular automaton containing a computer containing an operating system running the agent program. It’s not pre-sliced, the agent doesn’t know what transistors are, etc. On the other hand, the agent knows a quined description of itself, as an agent program with inputs and outputs, in whatever high-level language you like.
Then the agent tries to prove theorems of the form, “if the agent program implements such-and-such mapping from inputs to outputs, then the world program behaves a certain way”. To prove a theorem of that form, the agent might notice that some things happening within the world program can be interpreted as “inputs” and “outputs”, and the logical dependence between these is provably the same as the mapping implemented by the agent program. (So the agent will end up finding transistors within the world program, so to speak, while trying to prove theorems of the above form.) Note that the agent might discover multiple copies of itself within the world and set up coordination based on different inputs that they receive, like in Wei’s post.
This approach also has a big problem, which is kind of opposite to the problem described in the RobbBB’s post. Namely, it requires us to describe our utility function at the base level of reality, but that’s difficult because we don’t know how paperclips are represented at the base level of reality! We only know how we perceive paperclips. Solving that problem seems to require some flavor of Paul’s “indirect normativity”, but that’s broken and might be unfixable as I’ve discussed with you before.
In principle you could have a paper-clip perception module which counts paper-clips and define utility in terms of its output, and include huge penalties for world states where the paper-clip perception module has been functionally altered (or, more precisely, for world states where you can’t prove that the paper-clip perception module hasn’t been functionally altered).
Note that a utility function in UDT is supposed to be a mathematical expression in closed form, with no free variables pointing to “perception”. So applying your idea to UDT would require a mathematical model of how agents get their perceptions, e.g. “my perceptions are generated by the universal distribution” like in UDASSA. Such a model would have to address all the usual anthropic questions, like what happens to subjective probabilities if the perception module gets copied conditionally on winning the lottery, etc. And even if we found the right model, I wouldn’t build an AI based on that idea, because it might try to hijack the inputs of the perception module instead of doing useful work.
I’d be really interested in a UDT-like agent with a utility function over perceptions instead of a closed-form mathematical expression, though. Nesov called that hypothetical thing “UDT-AIXI” and we spent some time trying to find a good definition, but unsuccessfully. Do you know how to define such a thing?
My model of naturalized induction allows it: http://lesswrong.com/lw/jq9/intelligence_metrics_with_naturalized_induction/
Do you have a link to this discussion?
Yes, see this post. Most of the discussion happened in PM exchanges, but I think you can still get the idea. Feel free to PM me for explanations if you want.
I can’t remember if there was a specific problem that motivated this stipulation, so: Is this necessary? E.g. humans do not have an exact representation of themselves (they have an exact existence/they ‘just are’ and have an inexact and partially false mental model), yet they can still sort of locate instantiations of themselves within models if you point them to roughly the right bit of the model. It feels like maybe a sufficiently advanced general intelligence should be able to recognise itself given some incomplete knowledge of itself, at least once it is looking in the right place/located itself-that-it-does-not-yet-know-is-itself.
I guess perhaps the quining stipulation for UDT/the source code stipulation for FAI is not to say that it is strictly necessary, but rather to guarantee the intelligence is self-aware enough for us to make interesting statements about it? I.e. that it’s not an axiom per se but rather a tentatively stipulated property that makes it easier to talk about things but which is not claimed to be (necessarily) necessary?
On the other hand, a proof of the necessity of something as complete as source code would not surprise me.
(1) Fix the set of possible worlds (i.e. models) that have positive credence[1]. Specify VNM preferences over all the events/outcomes within those models that we wish to consider. Then assuming the model we’re actually in[2] was in the set of hypotheses we fixed, then we have a utility function over the actual model.
(2) Fix the set of models M that have positive credence, and choose any[3] language L that is powerful enough to describe them. Let F be a function which takes m in M and phi in the set of formulas of L as inputs and outputs the interpretation of phi in M. E.g. if m is a Newtonian model, F(m,‘Me’) will return some configuration of billiard balls, where if m is a quantum model, F(m,‘Me’) will return some region of a universal wavefunction.
Now exhaustively specify preferences in the language, e.g. ‘Vanilla > Chocolate’. Then for each model (possible base level of reality), the interpretation function F will allow that model’s preferences to be generated. This can save up-front work because we can avoid figuring out F(m,phi) until we come across a particular m, so that we do not need to actually ‘compute’ every F(m, phi) in advance, whereas (1) requires specifying them up front. And this is probably more realistic in that it is more like the way humans (and possibly AGI’s) would figure out preferences over things in the model; have a vague idea of concepts in the language then cash them out as and when necessary.
[1]I say not measure, for I remain unconvinced that measure is a general enough concept for an FAI or even humans to make decisions
[2]Dubious and probably confused way to think about it since a key point is that we’re not ‘actually in’ any one model; somewhat more accurately we are every matching instantiation, and perhaps more accurately still we’re the superposition of all those matching instantiations
[3]It’s not trivial that it’s trivial that we get the same results from any choice of language so that we can indeed choose arbitrarily
Yeah, that would be good, but we have no idea how to do it mathematically yet. Humans have intuitions that approximate that, but evolution probably didn’t give us a mechanism that’s correct in general, we’ll have to come up with the right math ourselves.
The big problem is defining F. If you squint the right way, you can view Paul’s idea as a way of asking the AI to figure out F.
I endorse cousin_it’s explanation of how UDT is supposed to work, without using explicit phenomenological bridging rules or hypotheses. Is there anything about it that you don’t understand or disagree with?