Formalizing Two Problems of Realistic World Models
I’m pleased to announce a new paper from MIRI: Formalizing Two Problems of Realistic World Models.
Abstract:
An intelligent agent embedded within the real world must reason about an environment which is larger than the agent, and learn how to achieve goals in that environment. We discuss attempts to formalize two problems: one of induction, where an agent must use sensory data to infer a universe which embeds (and computes) the agent, and one of interaction, where an agent must learn to achieve complex goals in the universe. We review related problems formalized by Solomonoff and Hutter, and explore challenges that arise when attempting to formalize analogous problems in a setting where the agent is embedded within the environment.
This is the fifth of six papers discussing active research topics that we’ve been looking into at MIRI. It discusses a few difficulties that arise when attempting to formalize problems of induction and evaluation in settings where an agent is attempting to learn about (and act upon) a universe from within. These problems have been much discussed on LessWrong; for further reading, see the links below. This paper is intended to better introduce the topic, and motivate it as relevant to FAI research.
The (rather short) introduction to the paper is reproduced below.
Two things that could very well come out of misunderstandings of the material:
If we have an agent whose actions affect future observations, why can’t we think of information about the agent’s embedding in the environment as being encoded in its observations? For example, in the heating-up game, we could imagine a computer that has sensors that detect heat emanating from its hardware, and that the data from those sensors is incorporated into the input stream of observations of the “environment”. The agent could then learn from past experience that certain actions lead to certain patterns of observations, which correspond to what people seem to mean when they say that it is giving off a certain amount of heat. We humans have a causal model in which those patterns in the computer’s observations are coming specifically from patterns in the data from specific sensors, which are triggered by the computer giving off heat, which are caused by those actions. A computer could well have an internal representation of a similar causal model, but that seems to me like part of a specific solution to the same general problem of predicting future observations and determining future actions given past observations and past actions. Even if an agent reconfigures its sensors, say, taking over every camera and microphone it can get control of over the internet, that new configuration of sensors will just get incorporated into the stream of observations, and it can “know” what it’s doing by predicting abrupt shifts in various properties of the observation stream.
It makes sense to me that modeling rewards as exogenously incorporated into observations doesn’t account for the possibility of specific predetermined values. But doesn’t the agent ultimately need to make decisions based purely on the information contained in its observations? We might externally judge the agent’s performance according to values that we have that depend on more information than the agent has access to, but if the ultimate goal is to program an agent to make decisions based on specific values, those values need to be at least approximated based purely on information the agent has, and we may as well define the values to be the approximation on which the agent bases its decisions. This absolutely makes it possible that under the vast majority of value systems the most “effective” agents would be ones that take over their inputs, but I think it makes sense in that case to say that those are the wrong value systems, rather than that the agent is ineffective. This doesn’t get rid of any of the ontological issues, it just shifts the burden of dealing with them to the definition of the value system, rather than to the fundamental setup of the problem of inductively optimizing an interaction, where it seems to me that the value system can be a parameter of the problem in an unambiguous way:
We define a value system, V, to be a function that takes in a finite partial history of observations and actions and outputs an incremental reward. For a given agent A and environment M, the incremental reward under value system V would be $r^{M,A,V}_t = V(M^A_{\prec = t}, A^M_{\prec t})$, with total reward $R_{M,V}(A) = \sum_{1}^{\lceil M \rceil}r^{M,A,V}_t$ used to score the agent relative to this value system (where we limit our consideration to value systems where this necessarily converges for all environments and agents). Then we could measure the effectiveness of agent A relative to value system V as $\sum_{M \in \mathcal{T}}2^{-\langle M \rangle}R_{M,V}(A)$.
Thanks for the comments! I’ll try to answer briefly.
It’s useful to think of the task as one of defining the intended intelligence metric. The Legg-Hutter metric can’t represent a heating-up game, because there isn’t a “heat” channel from the agent to the environment. Is it possible that an agent with high Legg-Hutter intelligence might be able to succeed on the heating up game, given a heat channel? Yes, this is possible. But AIXItl would almost certainly not be able to do this (it cannot consider limiting computation for a few timesteps), and you shouldn’t expect agents with high LH score to do this, because this isn’t what LH measures. Embedding an agent with high LH score in a heating up game violates an assumption under which the agent was shown to behave well. The problem here is not this one game in particular, the problem is that we still don’t know how to define the actual (naturalized) intelligence metric we care about. If we could formalize a set of universes and an embedding rule that allows us to measure agents on problems where their physical embodiment matters, that would constitute progress.
You’re correct that the agent ultimately needs to choose based purely on information from its observations (well, that and the priors), but there’s a difference between agents that are attempting to optimize what they see, and agents that are attempting to optimize what actually happened. Yes, the latter is ultimately an observation-based decision process, but it’s a fairly complicated one (note the difficulty of cashing out the word “actually” and the need to worry about the agent’s beliefs and their accuracy). The problem of ontology identification is not one of avoiding the fact that the agent must decide based on observations “alone”, it’s one of figuring out how to build agents that do the specific type of observation-based decision that we prefer (e.g. optimizing for reality rather than sense data). The real question, after all, is “we want agents that optimize actual reality, how do we build them?”—this requires cashing out some parts of the question that are glossed over if you define environments as “a thing that spits out an observation and a reward in each timestep.”
With regards to your suggestion of a metric which allows the value function to vary, this is all well and good, but now how do I find the V that actually corresponds to my goals? Say I want the V which scores the agent well for maximizing diamond in reality. This requires specifying a function which (1) takes observations; (2) uses them along with priors and knowledge about how the agent behaves to compute the expected state of outside reality; and (3) computes how much diamond is actually in reality and scores accordingly. But that’s not a value function, that’s most of an AGI!
It’s fine to say that the utility function must ultimately be defined over percepts, but in order to give me the function over percepts that I actually want (e.g. one that figures out how reality looks and scores an agent for maximizing it appropriately), I need a value function which turns percepts into a world model, figures out what the agent is going to do, and solves the ontology identification problem in order to rate the resulting world history. A huge part of the problem of intelligence is figuring out how to define a function of percepts which optimizes goals in actual reality—so while you’re welcome to think of ontology identification as part of “picking the right value function”, you eventually have to unpack that process, and the ontology identification problem is one of the hurdles that arises when you try to do so.
I hope that helps!
Certainly the agent is ineffective. It destroyed information which could have reduced its value-uncertainty.
Suppose that instead of having well-defined actions AIXI only has access to observables and its reward function. It might seem hopeless, but consider the subset of environments containing an implementation of a UTM which is evaluating T_A, a Turing machine implementing action-less AIXI, in which the implementation of the UTM has side effects in the next turn of the environment. This embeds AIXI-actions as side effects of an actual implementation of AIXI running as T_A on a UTM in the set of environments with observables matching those that the abstract AIXI-like algorithm observes. To maximize the reward function the agent must recognize its implementation embedded in the UTM in M and predict the consequences of the side effects of various choices it could make, substituting its own output for the output of the UTM in the next turn of the simulated environment (to avoid recursively simulating itself), choosing the side effects to maximize rewards.
In this context counterfactuals are simulations of the next turns of M resulting from the possible side effects of its current UTM. To be useful there must be a relation from computation-suffixes of T_A to the potential side effects of the UTM. In other words the agent must be able to cause a specific side effect as a result of state-transitions or tape operations performed causally-after it has determined which side effect will maximize rewards. This could be as straightforward as the UTM using the most-recently written bits on the tape to choose a side effect.
In the heating game, the UTM running T_A must be physically implemented as something that has a side effect corresponding to temperature, which causally effects the box of rewards, and all these causes must be predictable from the observables accessible to T_A in the UTM. Similarly, if there is an anvil suspended above a physical implementation of a UTM running T_A, the agent can avoid an inability to increase rewards in the future of environments in which the anvil is caused to drop (or escape from the UTM before dropping it).
This reduces the naturalized induction problem to the tiling/consistent reflection problem; the agent must choose which agent it wants to be in the next turn(s) through side effects that can change its future implementation.
Thanks for the comments, this is an interesting line of reasoning :-)
An AIXI that takes no actions is just a Solomonoff inductor, and this might give you some intuition for why, if you embed AIXI-without-actions into a UTM with side effects, you won’t end up with anything approaching good behavior. In each turn, it will run all environment hypotheses and update its distribution to be consistent with observation—and then do nothing. It won’t be able to “learn” how to manage the “side effects”; AIXI is simply not an algorithm that attempts to do any such thing.
You’re correct, though, that examining a setup where a UTM has side effects (and picking an algorithm to run on that UTM) is indeed a way to examine the “naturalized” problems. In fact, this idea is very similar to Orseau and Ring’s space-time embedded intelligence formalism. The big question here (which we must answer in order to be able to talk about which algorithms “perform well”) is what distribution over environments an agent will be rated against.
I’m not exactly sure how you would formalize this. Say you have a machine M implemented by a UTM which has side effects on the environment. M is doing internal predictions but has no outputs. There’s this thing you could do which is predict what would happen from running M (given your uncertainty about how the side effects work), but that’s not a counterfactual, that’s a prediction: constructing a counterfactual would require considering different possible computations that M could execute. (There are easy ways to cash out this sort of counterfactual using CDT or EDT, but you run into the usual logical counterfactual problems if you try to construct these sorts of counterfactuals using UDT, as far as I can tell.)
Yeah, once you figure out which distribution over environments to score against and how to formalize your counterfactuals, the problem reduces to “pick the action with the best future side effects”, which throws you directly up against the Vingean reflection problem in any environment where your capabilities include building something smarter than you :-)