Two things that could very well come out of misunderstandings of the material:
If we have an agent whose actions affect future observations, why can’t we think of information about the agent’s embedding in the environment as being encoded in its observations? For example, in the heating-up game, we could imagine a computer that has sensors that detect heat emanating from its hardware, and that the data from those sensors is incorporated into the input stream of observations of the “environment”. The agent could then learn from past experience that certain actions lead to certain patterns of observations, which correspond to what people seem to mean when they say that it is giving off a certain amount of heat. We humans have a causal model in which those patterns in the computer’s observations are coming specifically from patterns in the data from specific sensors, which are triggered by the computer giving off heat, which are caused by those actions. A computer could well have an internal representation of a similar causal model, but that seems to me like part of a specific solution to the same general problem of predicting future observations and determining future actions given past observations and past actions. Even if an agent reconfigures its sensors, say, taking over every camera and microphone it can get control of over the internet, that new configuration of sensors will just get incorporated into the stream of observations, and it can “know” what it’s doing by predicting abrupt shifts in various properties of the observation stream.
It makes sense to me that modeling rewards as exogenously incorporated into observations doesn’t account for the possibility of specific predetermined values. But doesn’t the agent ultimately need to make decisions based purely on the information contained in its observations? We might externally judge the agent’s performance according to values that we have that depend on more information than the agent has access to, but if the ultimate goal is to program an agent to make decisions based on specific values, those values need to be at least approximated based purely on information the agent has, and we may as well define the values to be the approximation on which the agent bases its decisions. This absolutely makes it possible that under the vast majority of value systems the most “effective” agents would be ones that take over their inputs, but I think it makes sense in that case to say that those are the wrong value systems, rather than that the agent is ineffective. This doesn’t get rid of any of the ontological issues, it just shifts the burden of dealing with them to the definition of the value system, rather than to the fundamental setup of the problem of inductively optimizing an interaction, where it seems to me that the value system can be a parameter of the problem in an unambiguous way:
We define a value system, V, to be a function that takes in a finite partial history of observations and actions and outputs an incremental reward. For a given agent A and environment M, the incremental reward under value system V would be $r^{M,A,V}_t = V(M^A_{\prec = t}, A^M_{\prec t})$, with total reward $R_{M,V}(A) = \sum_{1}^{\lceil M \rceil}r^{M,A,V}_t$ used to score the agent relative to this value system (where we limit our consideration to value systems where this necessarily converges for all environments and agents). Then we could measure the effectiveness of agent A relative to value system V as $\sum_{M \in \mathcal{T}}2^{-\langle M \rangle}R_{M,V}(A)$.
Thanks for the comments! I’ll try to answer briefly.
It’s useful to think of the task as one of defining the intended intelligence metric. The Legg-Hutter metric can’t represent a heating-up game, because there isn’t a “heat” channel from the agent to the environment. Is it possible that an agent with high Legg-Hutter intelligence might be able to succeed on the heating up game, given a heat channel? Yes, this is possible. But AIXItl would almost certainly not be able to do this (it cannot consider limiting computation for a few timesteps), and you shouldn’t expect agents with high LH score to do this, because this isn’t what LH measures. Embedding an agent with high LH score in a heating up game violates an assumption under which the agent was shown to behave well. The problem here is not this one game in particular, the problem is that we still don’t know how to define the actual (naturalized) intelligence metric we care about. If we could formalize a set of universes and an embedding rule that allows us to measure agents on problems where their physical embodiment matters, that would constitute progress.
You’re correct that the agent ultimately needs to choose based purely on information from its observations (well, that and the priors), but there’s a difference between agents that are attempting to optimize what they see, and agents that are attempting to optimize what actually happened. Yes, the latter is ultimately an observation-based decision process, but it’s a fairly complicated one (note the difficulty of cashing out the word “actually” and the need to worry about the agent’s beliefs and their accuracy). The problem of ontology identification is not one of avoiding the fact that the agent must decide based on observations “alone”, it’s one of figuring out how to build agents that do the specific type of observation-based decision that we prefer (e.g. optimizing for reality rather than sense data). The real question, after all, is “we want agents that optimize actual reality, how do we build them?”—this requires cashing out some parts of the question that are glossed over if you define environments as “a thing that spits out an observation and a reward in each timestep.”
With regards to your suggestion of a metric which allows the value function to vary, this is all well and good, but now how do I find the V that actually corresponds to my goals? Say I want the V which scores the agent well for maximizing diamond in reality. This requires specifying a function which (1) takes observations; (2) uses them along with priors and knowledge about how the agent behaves to compute the expected state of outside reality; and (3) computes how much diamond is actually in reality and scores accordingly. But that’s not a value function, that’s most of an AGI!
It’s fine to say that the utility function must ultimately be defined over percepts, but in order to give me the function over percepts that I actually want (e.g. one that figures out how reality looks and scores an agent for maximizing it appropriately), I need a value function which turns percepts into a world model, figures out what the agent is going to do, and solves the ontology identification problem in order to rate the resulting world history. A huge part of the problem of intelligence is figuring out how to define a function of percepts which optimizes goals in actual reality—so while you’re welcome to think of ontology identification as part of “picking the right value function”, you eventually have to unpack that process, and the ontology identification problem is one of the hurdles that arises when you try to do so.
This absolutely makes it possible that under the vast majority of value systems the most “effective” agents would be ones that take over their inputs, but I think it makes sense in that case to say that those are the wrong value systems, rather than that the agent is ineffective.
Certainly the agent is ineffective. It destroyed information which could have reduced its value-uncertainty.
Two things that could very well come out of misunderstandings of the material:
If we have an agent whose actions affect future observations, why can’t we think of information about the agent’s embedding in the environment as being encoded in its observations? For example, in the heating-up game, we could imagine a computer that has sensors that detect heat emanating from its hardware, and that the data from those sensors is incorporated into the input stream of observations of the “environment”. The agent could then learn from past experience that certain actions lead to certain patterns of observations, which correspond to what people seem to mean when they say that it is giving off a certain amount of heat. We humans have a causal model in which those patterns in the computer’s observations are coming specifically from patterns in the data from specific sensors, which are triggered by the computer giving off heat, which are caused by those actions. A computer could well have an internal representation of a similar causal model, but that seems to me like part of a specific solution to the same general problem of predicting future observations and determining future actions given past observations and past actions. Even if an agent reconfigures its sensors, say, taking over every camera and microphone it can get control of over the internet, that new configuration of sensors will just get incorporated into the stream of observations, and it can “know” what it’s doing by predicting abrupt shifts in various properties of the observation stream.
It makes sense to me that modeling rewards as exogenously incorporated into observations doesn’t account for the possibility of specific predetermined values. But doesn’t the agent ultimately need to make decisions based purely on the information contained in its observations? We might externally judge the agent’s performance according to values that we have that depend on more information than the agent has access to, but if the ultimate goal is to program an agent to make decisions based on specific values, those values need to be at least approximated based purely on information the agent has, and we may as well define the values to be the approximation on which the agent bases its decisions. This absolutely makes it possible that under the vast majority of value systems the most “effective” agents would be ones that take over their inputs, but I think it makes sense in that case to say that those are the wrong value systems, rather than that the agent is ineffective. This doesn’t get rid of any of the ontological issues, it just shifts the burden of dealing with them to the definition of the value system, rather than to the fundamental setup of the problem of inductively optimizing an interaction, where it seems to me that the value system can be a parameter of the problem in an unambiguous way:
We define a value system, V, to be a function that takes in a finite partial history of observations and actions and outputs an incremental reward. For a given agent A and environment M, the incremental reward under value system V would be $r^{M,A,V}_t = V(M^A_{\prec = t}, A^M_{\prec t})$, with total reward $R_{M,V}(A) = \sum_{1}^{\lceil M \rceil}r^{M,A,V}_t$ used to score the agent relative to this value system (where we limit our consideration to value systems where this necessarily converges for all environments and agents). Then we could measure the effectiveness of agent A relative to value system V as $\sum_{M \in \mathcal{T}}2^{-\langle M \rangle}R_{M,V}(A)$.
Thanks for the comments! I’ll try to answer briefly.
It’s useful to think of the task as one of defining the intended intelligence metric. The Legg-Hutter metric can’t represent a heating-up game, because there isn’t a “heat” channel from the agent to the environment. Is it possible that an agent with high Legg-Hutter intelligence might be able to succeed on the heating up game, given a heat channel? Yes, this is possible. But AIXItl would almost certainly not be able to do this (it cannot consider limiting computation for a few timesteps), and you shouldn’t expect agents with high LH score to do this, because this isn’t what LH measures. Embedding an agent with high LH score in a heating up game violates an assumption under which the agent was shown to behave well. The problem here is not this one game in particular, the problem is that we still don’t know how to define the actual (naturalized) intelligence metric we care about. If we could formalize a set of universes and an embedding rule that allows us to measure agents on problems where their physical embodiment matters, that would constitute progress.
You’re correct that the agent ultimately needs to choose based purely on information from its observations (well, that and the priors), but there’s a difference between agents that are attempting to optimize what they see, and agents that are attempting to optimize what actually happened. Yes, the latter is ultimately an observation-based decision process, but it’s a fairly complicated one (note the difficulty of cashing out the word “actually” and the need to worry about the agent’s beliefs and their accuracy). The problem of ontology identification is not one of avoiding the fact that the agent must decide based on observations “alone”, it’s one of figuring out how to build agents that do the specific type of observation-based decision that we prefer (e.g. optimizing for reality rather than sense data). The real question, after all, is “we want agents that optimize actual reality, how do we build them?”—this requires cashing out some parts of the question that are glossed over if you define environments as “a thing that spits out an observation and a reward in each timestep.”
With regards to your suggestion of a metric which allows the value function to vary, this is all well and good, but now how do I find the V that actually corresponds to my goals? Say I want the V which scores the agent well for maximizing diamond in reality. This requires specifying a function which (1) takes observations; (2) uses them along with priors and knowledge about how the agent behaves to compute the expected state of outside reality; and (3) computes how much diamond is actually in reality and scores accordingly. But that’s not a value function, that’s most of an AGI!
It’s fine to say that the utility function must ultimately be defined over percepts, but in order to give me the function over percepts that I actually want (e.g. one that figures out how reality looks and scores an agent for maximizing it appropriately), I need a value function which turns percepts into a world model, figures out what the agent is going to do, and solves the ontology identification problem in order to rate the resulting world history. A huge part of the problem of intelligence is figuring out how to define a function of percepts which optimizes goals in actual reality—so while you’re welcome to think of ontology identification as part of “picking the right value function”, you eventually have to unpack that process, and the ontology identification problem is one of the hurdles that arises when you try to do so.
I hope that helps!
Certainly the agent is ineffective. It destroyed information which could have reduced its value-uncertainty.