The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables
An AI actively trying to figure out what I want might show me snapshots of different possible worlds and ask me to rank them. Of course, I do not have the processing power to examine entire worlds; all I can really do is look at some pictures or video or descriptions. The AI might show me a bunch of pictures from one world in which a genocide is quietly taking place in some obscure third-world nation, and another in which no such genocide takes place. Unless the AI already considers that distinction important enough to draw my attention to it, I probably won’t notice it from the pictures, and I’ll rank those worlds similarly—even though I’d prefer the one without the genocide. Even if the AI does happen to show me some mass graves (probably secondhand, e.g. in pictures of news broadcasts), and I rank them low, it may just learn that I prefer my genocides under-the-radar.
The obvious point of such an example is that an AI should optimize for the real-world things I value, not just my estimates of those things. I don’t just want to think my values are satisfied, I want them to actually be satisfied. Unfortunately, this poses a conceptual difficulty: what if I value the happiness of ghosts? I don’t just want to think ghosts are happy, I want ghosts to actually be happy. What, then, should the AI do if there are no ghosts?
Human “values” are defined within the context of humans’ world-models, and don’t necessarily make any sense at all outside of the model (i.e. in the real world). Trying to talk about my values “actually being satisfied” is a type error.
Some points to emphasize here:
My values are not just a function of my sense data, they are a function of the state of the whole world, including parts I can’t see—e.g. I value the happiness of people I will never meet.
I cannot actually figure out or process the state of the whole world
… therefore, my values are a function of things I do not know and will not ever know—e.g. whether someone I will never encounter is happy right now
This isn’t just a limited processing problem; I do not have enough data to figure out all these things I value, even in principle.
This isn’t just a problem of not enough data, it’s a problem of what kind of data. My values depend on what’s going on “inside” of things which look the same—e.g. whether a smiling face is actually a rictus grin
This isn’t just a problem of needing sufficiently low-level data. The things I care about are still ultimately high-level things, like humans or trees or cars. While the things I value are in principle a function of low-level world state, I don’t directly care about molecules.
Some of the things I value may not actually exist—I may simply be wrong about which high-level things inhabit our world.
I care about the actual state of things in the world, not my own estimate of the state—i.e. if the AI tricks me into thinking things are great (whether intentional trickery or not), that does not make things great.
These features make it rather difficult to “point” to values—it’s not just hard to formally specify values, it’s hard to even give a way to learn values. It’s hard to say what it is we’re supposed to be learning at all. What, exactly, are the inputs to my value-function? It seems like:
Inputs to values are not complete low-level world states (since people had values before we knew what quantum fields were, and still have values despite not knowing the full state of the world), but…
I value the actual state of the world rather than my own estimate of the world-state (i.e. I want other people to actually be happy, not just look-to-me like they’re happy).
How can both of those intuitions seem true simultaneously? How can the inputs to my values-function be the actual state of the world, but also high-level objects which may not even exist? What things in the low-level physical world are those “high-level objects” pointing to?
If I want to talk about “actually satisfying my values” separate from my own estimate of my values, then I need some way to say what the values-relevant pieces of my world model are “pointing to” in the real world.
I think this problem—the “pointers to values” problem, and the “pointers” problem more generally—is the primary conceptual barrier to alignment right now. This includes alignment of both “principled” and “prosaic” AI. The one major exception is pure human-mimicking AI, which suffers from a mostly-unrelated set of problems (largely stemming from the shortcomings of humans, especially groups of humans).
I have yet to see this problem explained, by itself, in a way that I’m satisfied by. I’m stealing the name from some of Abram’s posts, and I think he’s pointing to the same thing I am, but I’m not 100% sure.
The goal of this post is to demonstrate what the problem looks like for a (relatively) simple Bayesian-utility-maximizing agent, and what challenges it leads to. This has the drawback of defining things only within one particular model, but the advantage of showing how a bunch of nominally-different failure modes all follow from the same root problem: utility is a function of latent variables. We’ll look at some specific alignment strategies, and see how and why they fail in this simple model.
One thing I hope people will take away from this: it’s not the “values” part that’s conceptually difficult, it’s the “pointers” part.
The Setup
We have a Bayesian expected-utility-maximizing agent, as a theoretical stand-in for a human. The agent’s world-model is a causal DAG over variables , and it chooses actions to maximize - i.e. it’s using standard causal decision theory. We will assume the agent has a full-blown Cartesian boundary, so we don’t need to worry about embeddedness and all that. In short, this is a textbook-standard causal-reasoning agent.
One catch: the agent’s world-model uses the sorts of tricks in Writing Causal Models Like We Write Programs, so the world-model can represent a very large world without ever explicitly evaluating probabilities of every variable in the world-model. Submodels are expanded lazily when they’re needed. You can still conceptually think of this as a standard causal DAG, it’s just that the model is lazily evaluated.
In particular, thinking of this agent as a human, this means that our human can value the happiness of someone they’ve never met, never thought about, and don’t know exists. The utility can be a function of variables which the agent will never compute, because the agent never needs to fully compute u in order to maximize it—it just needs to know how u changes as a function of the variables influenced by its actions.
Key assumption: most of the variables in the agent’s world-model are not observables. Drawing the analogy to humans: most of the things in our world-models are not raw photon counts in our eyes or raw vibration frequencies/intensities in our ears. Our world-models include things like trees and rocks and cars, objects whose existence and properties are inferred from the raw sense data. Even lower-level objects, like atoms and molecules, are latent variables; the raw data from our eyes and ears does not include the exact positions of atoms in a tree. The raw sense data itself is not sufficient to fully determine the values of the latent variables, in general; even a perfect Bayesian reasoner cannot deduce the true position of every atom in a tree from a video feed.
Now, the basic problem: our agent’s utility function is mostly a function of latent variables. Human values are mostly a function of rocks and trees and cars and other humans and the like, not the raw photon counts hitting our eyeballs. Human values are over inferred variables, not over sense data.
Furthermore, human values are over the “true” values of the latents, not our estimates—e.g. I want other people to actually be happy, not just to look-to-me like they’re happy. Ultimately, is the agent’s estimate of its own utility (thus the expectation), and the agent may not ever know the “true” value of its own utility—i.e. I may prefer that someone who went missing ten years ago lives out a happy life, but I may never find out whether that happened. On the other hand, it’s not clear that there’s a meaningful sense in which any “true” utility-value exists at all, since the agent’s latents may not correspond to anything physical—e.g. a human may value the happiness of ghosts, which is tricky if ghosts don’t exist in the real world.
On top of all that, some of those variables are implicit in the model’s lazy data structure and the agent will never think about them at all. I can value the happiness of people I do not know and will never encounter or even think about.
So, if an AI is to help optimize for , then it’s optimizing for something which is a function of latent variables in the agent’s model. Those latent variables:
May not correspond to any particular variables in the AI’s world-model and/or the physical world
May not be estimated by the agent at all (because lazy evaluation)
May not be determined by the agent’s observed data
… and of course the agent’s model might just not be very good, in terms of predictive power.
As usual, neither we (the system’s designers) nor the AI will have direct access to the model; we/it will only see the agent’s behavior (i.e. input/output) and possibly a low-level system in which the agent is embedded. The agent itself may have some introspective access, but not full or perfectly reliable introspection.
Despite all that, we want to optimize for the agent’s utility, not just the agent’s estimate of its utility. Otherwise we run into wireheading-like problems, problems with the agent’s world model having poor predictive power, etc. But the agent’s utility is a function of latents which may not be well-defined at all outside the context of the agent’s estimator (a.k.a. world-model). How can we optimize for the agent’s “true” utility, not just an estimate, when the agent’s utility function is defined as a function of latents which may not correspond to anything outside of the agent’s estimator?
The Pointers Problem
We can now define the pointers problem—not only “pointers to values”, but the problem of pointers more generally. The problem: what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model? And what does that “correspondence” even mean—how do we turn it into an objective for the AI, or some other concrete thing outside the agent’s own head?
Why call this the “pointers” problem? Well, let’s take the agent’s perspective, and think about what its algorithm feels like from the inside. From inside the agent’s mind, it doesn’t feel like those latent variables are latent variables in a model. It feels like those latent variables are real things out in the world which the agent can learn about. The latent variables feel like “pointers” to real-world objects and their properties. But what are the referents of these pointers? What are the real-world things (if any) to which they’re pointing? That’s the pointers problem.
Is it even solvable? Definitely not always—there probably is no real-world referent for e.g. the human concept of a ghost. Similarly, I can have a concept of a perpetual motion machine, despite the likely-impossibility of any such thing existing. Between abstraction and lazy evaluation, latent variables in an agent’s world-model may not correspond to anything in the world.
That said, it sure seems like at least some latent variables do correspond to structures in the world. The concept of “tree” points to a pattern which occurs in many places on Earth. Even an alien or AI with radically different world-model could recognize that repeating pattern, realize that examining one tree probably yields information about other trees, etc. The pattern has predictive power, and predictive power is not just a figment of the agent’s world-model.
So we’d like to know both (a) when a latent variable corresponds to something in the world (or another world model) at all, and (b) what it corresponds to. We’d like to solve this in a way which (probably among other use-cases) lets the AI treat the things-corresponding-to-latents as the inputs to the utility function it’s supposed to learn and optimize.
To the extent that human values are a function of latent variables in humans’ world-models, this seems like a necessary step not only for an AI to learn human values, but even just to define what it means for an AI to learn human values. What does it mean to “learn” a function of some other agent’s latent variables, without necessarily adopting that agent’s world-model? If the AI doesn’t have some notion of what the other agent’s latent variables even “are”, then it’s not meaningful to learn a function of those variables. It would be like an AI “learning” to imitate grep, but without having any access to string or text data, and without the AI itself having any interface which would accept strings or text.
Pointer-Related Maladies
Let’s look at some example symptoms which can arise from failure to solve specific aspects of the pointers problem.
Genocide Under-The-Radar
Let’s go back to the opening example: an AI shows us pictures from different possible worlds and asks us to rank them. The AI doesn’t really understand yet what things we care about, so it doesn’t intentionally draw our attention to certain things a human might consider relevant—like mass graves. Maybe we see a few mass-grave pictures from some possible worlds (probably in pictures from news sources, since that’s how such information mostly spreads), and we rank those low, but there are many other worlds where we just don’t notice the problem from the pictures the AI shows us. In the end, the AI decides that we mostly care about avoiding worlds where mass graves appear in the news—i.e. we prefer that mass killings stay under the radar.
How does this failure fit in our utility-function-of-latents picture?
This is mainly a failure to distinguish between the agent’s estimate of its own utility , and the “real” value of the agent’s utility (insofar as such a thing exists). The AI optimizes for our estimate, but does not give us enough data to very accurately estimate our utility in each world—indeed, it’s unlikely that a human could even handle that much information. So, it ends up optimizing for factors which bias our estimate—e.g. the availability of information about bad things.
Note that this intuitive explanation assumes a solution to the pointers problem: it only makes sense to the extent that there’s a “real” value of from which the “estimate” can diverge.
Not-So-Easy Wireheading Problems
The under-the-radar genocide problem looks roughly like a typical wireheading problem, so we should try a roughly-typical wireheading solution: rather than the AI showing world-pictures, it should just tell us what actions it could take, and ask us to rank actions directly.
If we were ideal Bayesian reasoners with accurate world models and infinite compute, and knew exactly where the AI’s actions fit in our world model, then this might work. Unfortunately, the failure of any of those assumptions breaks the approach:
We don’t have the processing power to predict all the impacts of the AI’s actions
Our world models may not be accurate enough to correctly predict the impact of the AI’s actions, even if we had enough processing power
The AI’s actions may not even fit neatly into our world model—e.g. even the idea of genetic engineering might not fit the world-model of premodern human thinkers
Mathematically, we’re trying to optimize , i.e. optimize expected utility given the AI’s actions. Note that this is necessarily an expectation under the human’s model, since that’s the only context in which is well-defined. In order for that to work out well, we need to be able to fully evaluate that estimate (sufficient processing power), we need the estimate to be accurate (sufficient predictive power), and we need to be defined within the model in the first place.
The question of whether our world-models are sufficiently accurate is particularly hairy here, since accuracy is usually only defined in terms of how well we estimate our sense-data. But the accuracy we care about here is how well we “estimate” the values of latent variables and . What does that even mean, when the latent variables may not correspond to anything in the world?
People I Will Never Meet
“Human values cannot be determined from human behavior” seems almost old-hat at this point, but it’s worth taking a moment to highlight just how underdetermined values are from behavior. It’s not just that humans have biases of one kind or another, or that revealed preferences diverge from stated preferences. Even in our perfect Bayesian utility-maximizer, utility is severely underdetermined from behavior, because the agent does not have perfect estimates of its latent variables. Behavior depends only on the agent’s estimate, so it cannot account for “error” in the agent’s estimates of latent variable values, nor can it tell us about how the agent values variables which are not coupled to its own choices.
The happiness of people I will never interact with is a good example of this. There may be people in the world whose happiness will not ever be significantly influenced by my choices. Presumably, then, my choices cannot tell us about how much I value such peoples’ happiness. And yet, I do value it.
“Misspecified” Models
In Latent Variables and Model Misspecification, jsteinhardt talks about “misspecification” of latent variables in the AI’s model. His argument is that things like the “value function” are latent variables in the AI’s world-model, and are therefore potentially very sensitive to misspecification of the AI’s model.
In fact, I think the problem is more severe than that.
The value function’s inputs are latent variables in the human’s model, and are therefore sensitive to misspecification in the human’s model. If the human’s model does not match reality well, then their latent variables will be something wonky and not correspond to anything in the world. And AI designers do not get to pick the human’s model. These wonky variables, not corresponding to anything in the world, are a baked-in part of the problem, unavoidable even in principle. Even if the AI’s world model were “perfectly specified”, it would either be a bad representation of the world (in which case predictive power becomes an issue) or a bad representation of the human’s model (in which case those wonky latents aren’t defined).
The AI can’t model the world well with the human’s model, but the latents on which human values depend aren’t well-defined outside the human’s model. Rock and a hard place.
Takeaway
Within the context of a Bayesian utility-maximizer (representing a human), utility/values are a function of latent variables in the agent’s model. That’s a problem, because those latent variables do not necessarily correspond to anything in the environment, and even when they do, we don’t have a good way to say what they correspond to.
So, an AI trying to help the agent is stuck: if the AI uses the human’s world-model, then it may just be wrong outright (in predictive terms). But if the AI doesn’t use the human’s world-model, then the latents on which the utility function depends may not be defined at all.
Thus, the pointers problem, in the Bayesian context: figure out which things in the world (if any) correspond to the latent variables in a model. What do latent variables in my model “point to” in the real world?
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by 29 Aug 2022 1:23 UTC; 413 points) (
- How To Get Into Independent Research On Alignment/Agency by 19 Nov 2021 0:00 UTC; 355 points) (
- ARC’s first technical report: Eliciting Latent Knowledge by 14 Dec 2021 20:09 UTC; 228 points) (
- Testing The Natural Abstraction Hypothesis: Project Intro by 6 Apr 2021 21:24 UTC; 168 points) (
- What Is The Alignment Problem? by 16 Jan 2025 1:20 UTC; 156 points) (
- The Plan − 2023 Version by 29 Dec 2023 23:34 UTC; 151 points) (
- Book Launch: “The Carving of Reality,” Best of LessWrong vol. III by 16 Aug 2023 23:52 UTC; 131 points) (
- Selection Theorems: A Program For Understanding Agents by 28 Sep 2021 5:03 UTC; 124 points) (
- Long-Term Future Fund: May 2021 grant recommendations by 27 May 2021 6:44 UTC; 110 points) (EA Forum;
- Voting Results for the 2020 Review by 2 Feb 2022 18:37 UTC; 108 points) (
- Calling for Student Submissions: AI Safety Distillation Contest by 23 Apr 2022 20:24 UTC; 102 points) (EA Forum;
- Prizes for the 2020 Review by 20 Feb 2022 21:07 UTC; 94 points) (
- Some Rules for an Algebra of Bayes Nets by 16 Nov 2023 23:53 UTC; 78 points) (
- The Core of the Alignment Problem is... by 17 Aug 2022 20:07 UTC; 76 points) (
- 2020 Review Article by 14 Jan 2022 4:58 UTC; 74 points) (
- How We Picture Bayesian Agents by 8 Apr 2024 18:12 UTC; 70 points) (
- Don’t design agents which exploit adversarial inputs by 18 Nov 2022 1:48 UTC; 70 points) (
- What Selection Theorems Do We Expect/Want? by 1 Oct 2021 16:03 UTC; 67 points) (
- The Pointers Problem: Clarifications/Variations by 5 Jan 2021 17:29 UTC; 61 points) (
- How Do Selection Theorems Relate To Interpretability? by 9 Jun 2022 19:39 UTC; 60 points) (
- Suggestions of posts on the AF to review by 16 Feb 2021 12:40 UTC; 56 points) (
- The Shortest Path Between Scylla and Charybdis by 18 Dec 2023 20:08 UTC; 50 points) (
- Why does (any particular) AI safety work reduce s-risks more than it increases them? by 3 Oct 2021 16:55 UTC; 48 points) (EA Forum;
- Calling for Student Submissions: AI Safety Distillation Contest by 24 Apr 2022 1:53 UTC; 48 points) (
- 13 Jul 2024 1:07 UTC; 45 points) 's comment on Alignment: “Do what I would have wanted you to do” by (
- [Intro to brain-like-AGI safety] 14. Controlled AGI by 11 May 2022 13:17 UTC; 45 points) (
- [Intro to brain-like-AGI safety] 9. Takeaways from neuro 2/2: On AGI motivation by 23 Mar 2022 12:48 UTC; 44 points) (
- Goal Alignment Is Robust To the Sharp Left Turn by 13 Jul 2022 20:23 UTC; 43 points) (
- 4 Apr 2022 21:19 UTC; 41 points) 's comment on Call For Distillers by (
- Comparing Four Approaches to Inner Alignment by 29 Jul 2022 21:06 UTC; 38 points) (
- World-Model Interpretability Is All We Need by 14 Jan 2023 19:37 UTC; 35 points) (
- The Big Picture Of Alignment (Talk Part 2) by 25 Feb 2022 2:53 UTC; 34 points) (
- [Hebbian Natural Abstractions] Introduction by 21 Nov 2022 20:34 UTC; 34 points) (
- Mesatranslation and Metatranslation by 9 Nov 2022 18:46 UTC; 25 points) (
- 17 Dec 2020 18:43 UTC; 24 points) 's comment on Why Subagents? by (
- Model-based RL, Desires, Brains, Wireheading by 14 Jul 2021 15:11 UTC; 22 points) (
- Towards the Operationalization of Philosophy & Wisdom by 28 Oct 2024 19:45 UTC; 20 points) (
- 13 Jul 2024 16:31 UTC; 20 points) 's comment on A simple case for extreme inner misalignment by (
- My decomposition of the alignment problem by 2 Sep 2024 0:21 UTC; 20 points) (
- Greed Is the Root of This Evil by 13 Oct 2022 20:40 UTC; 18 points) (
- 7 May 2021 14:17 UTC; 17 points) 's comment on Dumb dichotomies in ethics, part 2: instrumental vs. intrinsic values by (
- 14 Dec 2020 20:58 UTC; 17 points) 's comment on World State is the Wrong Abstraction for Impact by (
- 5 Jun 2024 12:17 UTC; 16 points) 's comment on The Standard Analogy by (
- 30 Jul 2021 11:58 UTC; 16 points) 's comment on Refactoring Alignment (attempt #2) by (
- 14 Jun 2022 13:44 UTC; 13 points) 's comment on Steering AI to care for animals, and soon by (EA Forum;
- 28 Dec 2020 15:41 UTC; 13 points) 's comment on Selection vs Control by (
- How path-dependent are human values? by 15 Apr 2022 9:34 UTC; 13 points) (
- Clarifying Alignment Fundamentals Through the Lens of Ontology by 7 Oct 2024 20:57 UTC; 12 points) (
- 8 Jul 2021 1:51 UTC; 11 points) 's comment on Potential Bottlenecks to Taking Over The World by (
- 31 May 2021 11:20 UTC; 11 points) 's comment on What is the most effective way to donate to AGI XRisk mitigation? by (
- 8 Oct 2023 22:30 UTC; 11 points) 's comment on Evaluating the historical value misspecification argument by (
- 22 Jun 2021 17:09 UTC; 11 points) 's comment on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the “utility function” abstraction by (
- 7 Aug 2022 15:33 UTC; 11 points) 's comment on Rant on Problem Factorization for Alignment by (
- 20 Jul 2024 15:09 UTC; 11 points) 's comment on A more systematic case for inner misalignment by (
- [Linkpost] How To Get Into Independent Research On Alignment/Agency by 14 Feb 2022 21:40 UTC; 10 points) (EA Forum;
- Thoughts on Max Tegmark’s AI verification by 22 Dec 2023 20:38 UTC; 10 points) (
- Do Deep Neural Networks Have Brain-like Representations?: A Summary of Disagreements by 18 Nov 2024 0:07 UTC; 9 points) (
- 25 Aug 2024 10:00 UTC; 8 points) 's comment on What is it to solve the alignment problem? by (
- 17 Jan 2023 6:02 UTC; 8 points) 's comment on World-Model Interpretability Is All We Need by (
- 26 Dec 2024 16:11 UTC; 8 points) 's comment on A shot at the diamond-alignment problem by (
- 6 Apr 2022 20:06 UTC; 8 points) 's comment on What I Was Thinking About Before Alignment by (
- 25 Mar 2024 15:20 UTC; 8 points) 's comment on On attunement by (
- 2 Apr 2024 13:15 UTC; 8 points) 's comment on Coherence of Caches and Agents by (
- 8 Dec 2022 15:48 UTC; 7 points) 's comment on If Wentworth is right about natural abstractions, it would be bad for alignment by (
- 12 Jul 2024 15:34 UTC; 6 points) 's comment on Instruction-following AGI is easier and more likely than value aligned AGI by (
- 2 Dec 2022 18:50 UTC; 6 points) 's comment on The Plan − 2022 Update by (
- 17 Nov 2024 15:29 UTC; 6 points) 's comment on Q Home’s Shortform by (
- 29 Mar 2022 18:39 UTC; 6 points) 's comment on Why Agent Foundations? An Overly Abstract Explanation by (
- 6 Nov 2024 21:40 UTC; 6 points) 's comment on Goal: Understand Intelligence by (
- 13 Jul 2024 17:47 UTC; 5 points) 's comment on Alignment: “Do what I would have wanted you to do” by (
- 22 Nov 2024 22:54 UTC; 5 points) 's comment on Consciousness as a conflationary alliance term for intrinsically valued internal experiences by (
- 6 Dec 2021 22:58 UTC; 5 points) 's comment on Declustering, reclustering, and filling in thingspace by (
- Popular materials about environmental goals/agent foundations? People wanting to discuss such topics? by 22 Jan 2025 3:30 UTC; 5 points) (
- Reflective Equilibria and the Hunt for a Formalized Pragmatism by 7 Jun 2023 22:55 UTC; 4 points) (EA Forum;
- 6 Aug 2021 5:39 UTC; 4 points) 's comment on Why Subagents? by (
- 5 Jun 2022 18:02 UTC; 4 points) 's comment on Why Subagents? by (
- 4 Dec 2024 17:20 UTC; 4 points) 's comment on Is the mind a program? by (
- 9 Jul 2024 23:32 UTC; 4 points) 's comment on When is a mind me? by (
- 1 Sep 2023 8:33 UTC; 3 points) 's comment on Announcing the AI Fables Writing Contest! by (EA Forum;
- 6 Nov 2022 7:53 UTC; 3 points) 's comment on Open & Welcome Thread—November 2022 by (
- 25 Jul 2023 20:38 UTC; 3 points) 's comment on DragonGod’s Shortform by (
- 22 Jun 2022 14:56 UTC; 2 points) 's comment on Why Agent Foundations? An Overly Abstract Explanation by (
- 23 Dec 2023 12:05 UTC; 2 points) 's comment on Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) by (
- Towards the Operationalization of Philosophy & Wisdom by 28 Oct 2024 19:45 UTC; 1 point) (EA Forum;
- 5 Aug 2021 14:18 UTC; 1 point) 's comment on Re-Define Intent Alignment? by (
- 24 Dec 2023 1:58 UTC; 1 point) 's comment on Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) by (
- 2 Sep 2022 17:23 UTC; 1 point) 's comment on Linda Linsefors’s Shortform by (
- 28 May 2023 19:32 UTC; 1 point) 's comment on Adumbrations on AGI from an outsider by (
This post states a subproblem of AI alignment which the author calls “the pointers problem”. The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user’s model to the AI’s model? This is very similar to the “ontological crisis” problem described by De Blanc, only De Blanc uses POMDPs instead of causal networks, and frames it in terms of a single agent changing their ontology, rather than translation from user to AI.
The question the author asks here is important, but not that novel (the author himself cites Demski as prior work). Perhaps the use of causal networks is a better angle, but this post doesn’t do much to show it. Even so, having another exposition of an important topic, with different points of emphasis, will probably benefit many readers.
The primary aspect missing from the discussion in the post, in my opinion, is the nature of the user as a learning agent. The user doesn’t have a fixed world-model: or, if they do, then this model is best seen as a prior. This observation hints at the resolution of the apparent paradox wherein the utility function is defined in terms of a wrong model. But it still requires us to explain how the utility is defined s.t. it is applicable to every hypothesis in the prior.
(What follows is no longer a “review” per se, inasmuch as a summary of my own thoughts on the topic.)
Here is a formal model of how a utility function for learning agents can work, when it depends on latent variables.
Fix A a set of actions and O a set of observations. We start with an ontological model which is a crisp infra-POMPD. That is, there is a set of states Sont, an initial state s0ont∈Sont, a transition infra-kernel Tont:Sont×A→□(Sont×O) and a reward function r:Sont→R. Here, □X stands for closed convex sets of probability distributions on X. In other words, this a POMDP with an underspecified transition kernel.
We then build a prior which consists of refinements of the ontological model. That is, each hypothesis in the prior is an infra-POMDP with state space S, initial state s0∈S, transition infra-kernel T:S×A→□(S×O) and an interpretation mapping ι:S→Sont which is a morphism of infra-POMDPs (i.e.ι(s0)=s0ont and the obvious diagram of transition infra-kernels commutes). The reward function on S is just the composition r∘ι. Notice that while the ontological model must be an infra-POMDP to get a non-degenerate learning agent (moreover, it can be desirable to make it non-dogmatic about observables in some formal sense), the hypotheses in the prior can also be ordinary (Baysian) POMDPs.
Given such a prior plus a time discount function, we can consider the corresponding infra-Bayesian agent (or even just Bayesian agent if we chose all hypothesis to be Bayesian). Such an agent optimizes rewards which depend on latent variables, even though it does not know the correct world-model in advance. It does fit the world to the immutable ontological model (which is necessary to make sense of the latent variables to which the reward function refers), but the ontological model has enough freedom to accommodate many possible worlds.
The next question is then how would we transfer such a utility function from the user to the AI. Here, like noted by Demski, we want the AI to use not just the user’s utility function but also the user’s prior. Because, we want running such an AI to be rational from the subjective perspective of the user. This creates a puzzle: if the AI is using the same prior, and the user behaves nearly-optimally for their own prior (since otherwise how would we even infer the utility function and prior), how can the AI outperform the user?
The answer, I think, is via the AI having different action/observation channels from the user. At first glance this might seem unsatisfactory: we expect the AI to be “smarter”, not just to have better peripherals. However, using Turing RL we can represent the former as a special case of the latter. Specifically, part of the additional peripherals is access to a programmable computer, which effectively gives the AI a richer hypothesis space than the user.
The formalism I outlined here leaves many questions, for example what kind of learning guarantees to expect in the face of possible ambiguities between observationally indistinguishable hypothesis[1]. Nevertheless, I think it creates a convenient framework for studying the question raised in the post. A potential different approach is using infra-Bayesian physicalism, which also describes agents with utility functions that depend on latent variables. However, it is unclear whether it’s reasonable to apply the later to humans.
See also my article “RL with imperceptible rewards”
Why This Post Is Interesting
This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol’ Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it’s clear what the problem is, it’s clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise.
Unfortunately, mathematical legibility is not the same as accessibility; the post does have a wide inductive gap.
Warning: Inductive Gap
This post builds on top of two important pieces for modelling embedded agents which don’t have their own posts (to my knowledge). The pieces are:
Lazy world models
Lazy utility functions (or value functions more generally)
In hindsight, I probably should have written up separate posts on them; they seem obvious once they click, but they were definitely not obvious beforehand.
Lazy World Models
One of the core conceptual difficulties of embedded agency is that agents need to reason about worlds which are bigger than themselves. They’re embedded in the world, therefore the world must be as big as the entire agent plus whatever environment the world includes outside of the agent. If the agent has a model of the world, the physical memory storing that model must itself fit inside of the world. The data structure containing the world model must represent a world larger than the storage space the data structure takes up.
That sounds tricky at first, but if you’ve done some functional programming before, then data structures like this actually pretty run-of-the-mill. For instance, we can easily make infinite lists which take up finite memory. The trick is to write a generator for the list, and then evaluate it lazily—i.e. only query for list elements which we actually need, and never actually iterate over the whole thing.
In the same way, we can represent a large world (potentially even an infinite world) using a smaller amount of memory. We specify the model via a generator, and then evaluate queries against the model lazily. If we’re thinking in terms of probabilistic models, then our generator could be e.g. a function in a probabilistic programming language, or (equivalently but through a more mathematical lens) a probabilistic causal model leveraging recursion. The generator compactly specifies a model containing many random variables (potentially even infinitely many), but we never actually run inference on the full infinite set of variables. Instead, we use lazy algorithms which only reason about the variables necessary for particular queries.
Once we know to look for it, it’s clear that humans use some kind of lazy world models in our own reasoning. We never directly estimate the state of the entire world. Rather, when we have a question, we think about whatever “variables” are relevant to that question. We perform inference using whatever “generator” we already have stored in our heads, and we avoid recursively unpacking any variables which aren’t relevant to the question at hand.
Lazy Utility/Values
Building on the notion of lazy world models: it’s not very helpful to have a lazy world model if we need to evaluate the whole data structure in order to make a decision. Fortunately, even if our utility/values depend on lots of things, we don’t actually need to evaluate utility/values in order to make a decision. We just need to compare the utility/value across different possible choices.
In practice, most decisions we make don’t impact most of the world in significant predictable ways. (More precisely: the impact of most of our decisions on most of the world is wiped out by noise.) So, rather than fully estimating utility/value we just calculate how each choice changes total utility/value, based only on the variables significantly and predictably influenced by the decision.
A simple example (from here): if we have a utility function ∑if(Xi), and we’re making a decision which only effects X3, then we don’t need to estimate the sum at all; we only need to estimate f(X3) for each option.
Again, once we know to look for it, it’s clear that humans do something like this. Most of my actions do not effect a random person in Mumbai (and to the extent there is an effect, it’s drowned out by noise). Even though I value the happiness of that random person in Mumbai, I never need to think about them, because my actions don’t significantly impact them in any way I can predict. I never actually try to estimate “how good the whole world is” according to my own values.
Where This Post Came From
In the second half of 2020, I was thinking about existing real-world analogues/instances of various parts of the AI alignment problem and embedded agency, in hopes of finding a case where someone already had a useful frame or even solution which could be translated over to AI. “Theory of the firm” (a subfield of economics) was one promising area. From wikipedia:
To the extent that we can think of companies as embedded agents, these mirror a lot of the general questions of embedded agency. Also, alignment of incentives is a major focus in the literature on the topic.
Most of the existing literature I read was not very useful in its own right. But I generally tried to abstract out the most central ideas and bottlenecks, and generalize them enough to apply to more general problems. The most important insight to come out of this process was: sometimes we cannot tell what happened, even in hindsight. This is a major problem for incentives: for instance, if we can’t tell even in hindsight who made a mistake, then we don’t know where to assign credit/blame. (This idea became the post When Hindsight Isn’t 20/20: Incentive Design With Imperfect Credit Allocation.)
Similarly, this is a major problem for bets: we can’t bet on something if we cannot tell what the outcome was, even in hindsight.
Following that thread further: sometimes we cannot tell how good an outcome was, even in hindsight. For instance, we could imagine paying someone to etch our names on a plaque on a spacecraft and then launch it on a trajectory out of the solar system. In this case, we would presumably care a lot that our names were actually etched on the plaque; we would be quite unhappy if it turned out that our names were left off. Yet if someone took off the plaque at the last minute, or left our names off of it, we might never find out. In other words, we might not ever know, even in hindsight, whether our values were actually satisfied.
There’s a sense in which this is obvious mathematically from Bayesian expected utility maximization. The “expected” part of “expected utility” sure does suggest that we don’t know the actual utility. Usually we think of utility as something we will know later, but really there’s no reason to assume that. The math does not say we need to be able to figure out utility in hindsight. The inputs to utility are random variables in our world model, and we may not ever know the values of those random variables.
Once I started actually paying attention to the idea that the inputs to the utility function are random variables in the agent’s world model, and that we may never know the values of those variables, the next step followed naturally. Of course those variables may not correspond to anything observable in the physical world, even in principle. Of course they could be latent variables. Then the connection to the Pointer Problem became clear.
It seems like “generators” should just be simple functions over natural abstractions? But I see two different ways to go with this, inspired either by the minimal latents approach, or by the redundant-information one.
First, suppose I want to figure out a high-level model of some city, say Berlin. I already have a “city” abstraction, let’s call it P(Λ), which summarizes my general knowledge about cities in terms of a probability distribution over possible structures. I also know a bunch of facts about Berlin specifically, let’s call their sum F. Then my probability distribution over Berlin’ structure is just P(XBerlin)=P(Λ|F).
Alternatively, suppose I want to model the low-level dynamics of some object I have an abstract representation for. In this case, suppose it’s the business scene of Berlin. I condition my abstraction of a business P(B) on everything I know about Berlin, P(B|XBerlin), then sample from the resulting distribution several times until I get a “representative set”. Then I model its behavior directly.
This doesn’t seem quite right, though.