This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol’ Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it’s clear what the problem is, it’s clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise.
Unfortunately, mathematical legibility is not the same as accessibility; the post does have a wide inductive gap.
Warning: Inductive Gap
This post builds on top of two important pieces for modelling embedded agents which don’t have their own posts (to my knowledge). The pieces are:
Lazy world models
Lazy utility functions (or value functions more generally)
In hindsight, I probably should have written up separate posts on them; they seem obvious once they click, but they were definitely not obvious beforehand.
Lazy World Models
One of the core conceptual difficulties of embedded agency is that agents need to reason about worlds which are bigger than themselves. They’re embedded in the world, therefore the world must be as big as the entire agent plus whatever environment the world includes outside of the agent. If the agent has a model of the world, the physical memory storing that model must itself fit inside of the world. The data structure containing the world model must represent a world larger than the storage space the data structure takes up.
That sounds tricky at first, but if you’ve done some functional programming before, then data structures like this actually pretty run-of-the-mill. For instance, we can easily make infinite lists which take up finite memory. The trick is to write a generator for the list, and then evaluate it lazily—i.e. only query for list elements which we actually need, and never actually iterate over the whole thing.
In the same way, we can represent a large world (potentially even an infinite world) using a smaller amount of memory. We specify the model via a generator, and then evaluate queries against the model lazily. If we’re thinking in terms of probabilistic models, then our generator could be e.g. a function in a probabilistic programming language, or (equivalently but through a more mathematical lens) a probabilistic causal model leveraging recursion. The generator compactly specifies a model containing many random variables (potentially even infinitely many), but we never actually run inference on the full infinite set of variables. Instead, we use lazy algorithms which only reason about the variables necessary for particular queries.
Once we know to look for it, it’s clear that humans use some kind of lazy world models in our own reasoning. We never directly estimate the state of the entire world. Rather, when we have a question, we think about whatever “variables” are relevant to that question. We perform inference using whatever “generator” we already have stored in our heads, and we avoid recursively unpacking any variables which aren’t relevant to the question at hand.
Lazy Utility/Values
Building on the notion of lazy world models: it’s not very helpful to have a lazy world model if we need to evaluate the whole data structure in order to make a decision. Fortunately, even if our utility/values depend on lots of things, we don’t actually need to evaluate utility/values in order to make a decision. We just need to compare the utility/value across different possible choices.
In practice, most decisions we make don’t impact most of the world in significant predictable ways. (More precisely: the impact of most of our decisions on most of the world is wiped out by noise.) So, rather than fully estimating utility/value we just calculate how each choice changes total utility/value, based only on the variables significantly and predictably influenced by the decision.
A simple example (from here): if we have a utility function ∑if(Xi), and we’re making a decision which only effects X3, then we don’t need to estimate the sum at all; we only need to estimate f(X3) for each option.
Again, once we know to look for it, it’s clear that humans do something like this. Most of my actions do not effect a random person in Mumbai (and to the extent there is an effect, it’s drowned out by noise). Even though I value the happiness of that random person in Mumbai, I never need to think about them, because my actions don’t significantly impact them in any way I can predict. I never actually try to estimate “how good the whole world is” according to my own values.
Where This Post Came From
In the second half of 2020, I was thinking about existing real-world analogues/instances of various parts of the AI alignment problem and embedded agency, in hopes of finding a case where someone already had a useful frame or even solution which could be translated over to AI. “Theory of the firm” (a subfield of economics) was one promising area. From wikipedia:
In simplified terms, the theory of the firm aims to answer these questions:
Existence. Why do firms emerge? Why are not all transactions in the economy mediated over the market?
Boundaries. Why is the boundary between firms and the market located exactly there with relation to size and output variety? Which transactions are performed internally and which are negotiated on the market?
Organization. Why are firms structured in such a specific way, for example as to hierarchy or decentralization? What is the interplay of formal and informal relationships?
Heterogeneity of firm actions/performances. What drives different actions and performances of firms?
Evidence. What tests are there for respective theories of the firm?
To the extent that we can think of companies as embedded agents, these mirror a lot of the general questions of embedded agency. Also, alignment of incentives is a major focus in the literature on the topic.
Most of the existing literature I read was not very useful in its own right. But I generally tried to abstract out the most central ideas and bottlenecks, and generalize them enough to apply to more general problems. The most important insight to come out of this process was: sometimes we cannot tell what happened, even in hindsight. This is a major problem for incentives: for instance, if we can’t tell even in hindsight who made a mistake, then we don’t know where to assign credit/blame. (This idea became the post When Hindsight Isn’t 20/20: Incentive Design With Imperfect Credit Allocation.)
Similarly, this is a major problem for bets: we can’t bet on something if we cannot tell what the outcome was, even in hindsight.
Following that thread further: sometimes we cannot tell how good an outcome was, even in hindsight. For instance, we could imagine paying someone to etch our names on a plaque on a spacecraft and then launch it on a trajectory out of the solar system. In this case, we would presumably care a lot that our names were actually etched on the plaque; we would be quite unhappy if it turned out that our names were left off. Yet if someone took off the plaque at the last minute, or left our names off of it, we might never find out. In other words, we might not ever know, even in hindsight, whether our values were actually satisfied.
There’s a sense in which this is obvious mathematically from Bayesian expected utility maximization. The “expected” part of “expected utility” sure does suggest that we don’t know the actual utility. Usually we think of utility as something we will know later, but really there’s no reason to assume that. The math does not say we need to be able to figure out utility in hindsight. The inputs to utility are random variables in our world model, and we may not ever know the values of those random variables.
Once I started actually paying attention to the idea that the inputs to the utility function are random variables in the agent’s world model, and that we may never know the values of those variables, the next step followed naturally. Of course those variables may not correspond to anything observable in the physical world, even in principle. Of course they could be latent variables. Then the connection to the Pointer Problem became clear.
It seems like “generators” should just be simple functions over natural abstractions? But I see two different ways to go with this, inspired either by the minimal latents approach, or by the redundant-information one.
First, suppose I want to figure out a high-level model of some city, say Berlin. I already have a “city” abstraction, let’s call it P(Λ), which summarizes my general knowledge about cities in terms of a probability distribution over possible structures. I also know a bunch of facts about Berlin specifically, let’s call their sum F. Then my probability distribution over Berlin’ structure is just P(XBerlin)=P(Λ|F).
Alternatively, suppose I want to model the low-level dynamics of some object I have an abstract representation for. In this case, suppose it’s the business scene of Berlin. I condition my abstraction of a business P(B) on everything I know about Berlin, P(B|XBerlin), then sample from the resulting distribution several times until I get a “representative set”. Then I model its behavior directly.
Why This Post Is Interesting
This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol’ Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it’s clear what the problem is, it’s clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise.
Unfortunately, mathematical legibility is not the same as accessibility; the post does have a wide inductive gap.
Warning: Inductive Gap
This post builds on top of two important pieces for modelling embedded agents which don’t have their own posts (to my knowledge). The pieces are:
Lazy world models
Lazy utility functions (or value functions more generally)
In hindsight, I probably should have written up separate posts on them; they seem obvious once they click, but they were definitely not obvious beforehand.
Lazy World Models
One of the core conceptual difficulties of embedded agency is that agents need to reason about worlds which are bigger than themselves. They’re embedded in the world, therefore the world must be as big as the entire agent plus whatever environment the world includes outside of the agent. If the agent has a model of the world, the physical memory storing that model must itself fit inside of the world. The data structure containing the world model must represent a world larger than the storage space the data structure takes up.
That sounds tricky at first, but if you’ve done some functional programming before, then data structures like this actually pretty run-of-the-mill. For instance, we can easily make infinite lists which take up finite memory. The trick is to write a generator for the list, and then evaluate it lazily—i.e. only query for list elements which we actually need, and never actually iterate over the whole thing.
In the same way, we can represent a large world (potentially even an infinite world) using a smaller amount of memory. We specify the model via a generator, and then evaluate queries against the model lazily. If we’re thinking in terms of probabilistic models, then our generator could be e.g. a function in a probabilistic programming language, or (equivalently but through a more mathematical lens) a probabilistic causal model leveraging recursion. The generator compactly specifies a model containing many random variables (potentially even infinitely many), but we never actually run inference on the full infinite set of variables. Instead, we use lazy algorithms which only reason about the variables necessary for particular queries.
Once we know to look for it, it’s clear that humans use some kind of lazy world models in our own reasoning. We never directly estimate the state of the entire world. Rather, when we have a question, we think about whatever “variables” are relevant to that question. We perform inference using whatever “generator” we already have stored in our heads, and we avoid recursively unpacking any variables which aren’t relevant to the question at hand.
Lazy Utility/Values
Building on the notion of lazy world models: it’s not very helpful to have a lazy world model if we need to evaluate the whole data structure in order to make a decision. Fortunately, even if our utility/values depend on lots of things, we don’t actually need to evaluate utility/values in order to make a decision. We just need to compare the utility/value across different possible choices.
In practice, most decisions we make don’t impact most of the world in significant predictable ways. (More precisely: the impact of most of our decisions on most of the world is wiped out by noise.) So, rather than fully estimating utility/value we just calculate how each choice changes total utility/value, based only on the variables significantly and predictably influenced by the decision.
A simple example (from here): if we have a utility function ∑if(Xi), and we’re making a decision which only effects X3, then we don’t need to estimate the sum at all; we only need to estimate f(X3) for each option.
Again, once we know to look for it, it’s clear that humans do something like this. Most of my actions do not effect a random person in Mumbai (and to the extent there is an effect, it’s drowned out by noise). Even though I value the happiness of that random person in Mumbai, I never need to think about them, because my actions don’t significantly impact them in any way I can predict. I never actually try to estimate “how good the whole world is” according to my own values.
Where This Post Came From
In the second half of 2020, I was thinking about existing real-world analogues/instances of various parts of the AI alignment problem and embedded agency, in hopes of finding a case where someone already had a useful frame or even solution which could be translated over to AI. “Theory of the firm” (a subfield of economics) was one promising area. From wikipedia:
To the extent that we can think of companies as embedded agents, these mirror a lot of the general questions of embedded agency. Also, alignment of incentives is a major focus in the literature on the topic.
Most of the existing literature I read was not very useful in its own right. But I generally tried to abstract out the most central ideas and bottlenecks, and generalize them enough to apply to more general problems. The most important insight to come out of this process was: sometimes we cannot tell what happened, even in hindsight. This is a major problem for incentives: for instance, if we can’t tell even in hindsight who made a mistake, then we don’t know where to assign credit/blame. (This idea became the post When Hindsight Isn’t 20/20: Incentive Design With Imperfect Credit Allocation.)
Similarly, this is a major problem for bets: we can’t bet on something if we cannot tell what the outcome was, even in hindsight.
Following that thread further: sometimes we cannot tell how good an outcome was, even in hindsight. For instance, we could imagine paying someone to etch our names on a plaque on a spacecraft and then launch it on a trajectory out of the solar system. In this case, we would presumably care a lot that our names were actually etched on the plaque; we would be quite unhappy if it turned out that our names were left off. Yet if someone took off the plaque at the last minute, or left our names off of it, we might never find out. In other words, we might not ever know, even in hindsight, whether our values were actually satisfied.
There’s a sense in which this is obvious mathematically from Bayesian expected utility maximization. The “expected” part of “expected utility” sure does suggest that we don’t know the actual utility. Usually we think of utility as something we will know later, but really there’s no reason to assume that. The math does not say we need to be able to figure out utility in hindsight. The inputs to utility are random variables in our world model, and we may not ever know the values of those random variables.
Once I started actually paying attention to the idea that the inputs to the utility function are random variables in the agent’s world model, and that we may never know the values of those variables, the next step followed naturally. Of course those variables may not correspond to anything observable in the physical world, even in principle. Of course they could be latent variables. Then the connection to the Pointer Problem became clear.
It seems like “generators” should just be simple functions over natural abstractions? But I see two different ways to go with this, inspired either by the minimal latents approach, or by the redundant-information one.
First, suppose I want to figure out a high-level model of some city, say Berlin. I already have a “city” abstraction, let’s call it P(Λ), which summarizes my general knowledge about cities in terms of a probability distribution over possible structures. I also know a bunch of facts about Berlin specifically, let’s call their sum F. Then my probability distribution over Berlin’ structure is just P(XBerlin)=P(Λ|F).
Alternatively, suppose I want to model the low-level dynamics of some object I have an abstract representation for. In this case, suppose it’s the business scene of Berlin. I condition my abstraction of a business P(B) on everything I know about Berlin, P(B|XBerlin), then sample from the resulting distribution several times until I get a “representative set”. Then I model its behavior directly.
This doesn’t seem quite right, though.