I’ve been accepted as a mentor for the next AI Safety Camp. You can apply to work with me on the tiling problem. The goal will be to develop reflectively consistent UDT-inspired decision theories, and try to prove tiling theorems for them.

The deadline for applicants is November 17.

The program will run from January 11 to April 27. It asks for a 10 hour/week commitment.

I am not being funded for this.^[1] You can support my work on Patreon.

My project description follows:

Summary

The Tiling Agents problem (aka reflective consistency) consists of analysing when one agent (the “predecessor”) will choose to deliberately modify another agent (the “successor”). Usually, the predecessor and successor are imagined as the same agent across time, so we are studying self-modification. A set of properties “tiles” if those properties, when present in both predecessor and successor, guarantee that any self-modifications will avoid changing those properties.

You can think of this as the question of when agents will preserve certain desirable properties (such as safety-relevant properties) when given the opportunity to self-modify. Another way to think about it is the slightly broader question: when can one intelligence trust another? The bottleneck for avoiding harmful self-modifications is self-trust; so getting tiling results is mainly a matter of finding conditions for trust.

The search for tiling results has four main motivations:

AI-AI tiling, for the purpose of finding conditions under which AI systems will want to preserve safety-relevant properties.
Human-AI tiling, for the purpose of understanding when we can justifiably trust AI systems.
AI-Human tiling, as a model of corrigibility. (Will AIs choose to interfere with human decisions?)
Tiling as a consistency constraint on decision theories, for the purpose of studying rationality.

These three application areas have a large overlap, and all four seem important.

The non-summary

Motivation

In the big picture, tiling seems like perhaps the single most well-motivated approach to theoretical AI safety: it allows us to directly formally address the question of how humans can justifiably trust AIs. However, extremely few people are directly working on this approach. Indeed, out of all the people who have worked on tiling in the past, the only person I’m currently aware of who continues to work on this is myself.

I think part of this is about timelines. Tiling results remain very theoretical. I am optimistic that significant improvements to tiling results are possible with some focused attention, but I think there is a long way to go before tiling results will be realistic enough to offer concrete advice which helps with the real problems we face. (At least, a long way to go to yield “non-obvious” advice.)

However, it still seems to me like more people should be working on this approach overall. There are many unpublished results and ideas, so my aim with this project is to get some people up to speed and in a position to make progress.

Tiling Overview

The basic idea of tiling is that an agent architecture is “no good” in some sense if we can show that an agent designed according to that architecture would self-modify to something which does not match the architecture, given the choice. This is a highly fruitful sort of coherence requirement to impose, in the sense that it rules out a lot of proposals, including most existing decision theories.

One motivation for this criterion is for building robust AI systems: if an AI system would remove its safety constraints given the chance, then we are effectively in a fight with it (we have to avoid giving it a chance). Hence, even if we don’t plan to allow the AI to self-modify, it seems wise to build safety precautions in a way which we can show to be self-preserving. This perspective on tiling focuses on AI-AI tiling.

A broader motivation is to study the structure of trust. AI alignment is, from this perspective, the study of how to build trustworthy systems. Tiling is the study of how one mindlike thing can trust another mindlike thing. If we can make tiling theorems sufficiently realistic, then we can derive principles of trust which can provide guidance about how to create trustworthy systems in the real world. This perspective focuses on human-AI tiling.

A third motivation imposes tiling as a constraint on decision theories, as a way of trying to understand rationality better, in the hopes of subsequently using the resulting decision theory to guide our decision-making (eg, with respect to AI risk). Tiling appears to be a rather severe requirement for decision theories, so (provided one is convinced that tiling is an important consistency requirement) it weeds out a lot of bad answers that might otherwise seem good. Novel decision-theoretic insights might point to crucial considerations which would otherwise have been missed.

Reflective Oracles

The first big obstacle to a theorem showing when Bayesian-style agents can be reflectively stable was the difficulty of Bayesian agents representing themselves at all. Tiling was first studied in the setting of formal logic, because logic has thoroughly researched tools for self-reference. Probability theory lacked such tools.

MIRI research carried out by Jessica Taylor, Paul Christiano, and Benja Fallenstein gave rise to the necessary tools to represent Bayesian agents capable of having hypotheses about themselves and other agents similar to themselves: Reflective Oracles (and very similarly, Paul’s reflective probability distribution).

However, these conceptual tools solve the problem by specifying a computational complexity class in which a sort of “prophecy” is possible: computations can look ahead to the output of any computation, including themselves. We don’t live in a universe in which this flavor of computation appears feasible.

As a consequence, tiling results based on Reflective Oracles would not solve the “Vingean Reflection” problem: it does not show us how it is possible to trust another agent without being able to reason about that agent’s future actions in detail. It is easy to trust someone if you can anticipate everything they will do with precision. Unfortunately, it is not at all realistic.

Vingean Reflection is a crucial desideratum to keep in mind when looking for tiling results.

Logical Uncertainty

We call the more realistic domain, where an agent does not have the ability to run arbitrary computations to completion, “logical uncertainty” (although in hindsight, “computational uncertainty” would have been a clearer name).

MIRI’s Logical Induction (primarily due to Scott G) gives us a mathematical framework which slightly generalizes Bayesianism, to address logical uncertainty in a more suitable way. This gives rise to a “bounded rationality” perspective, where agents are not perfectly rational, but avoid a given class of easily recognizable inconsistencies.

We have a limited tiling result for logical-induction-based decision theory (due to Sam). I hope to significantly improve upon this by building on research into logically uncertain updateless decision theory which Martin Soto and I collaborated on last summer (none of which has been published yet), and also by building on the idea of open-minded updatelessness.

This area contains the clearest open problems and the lowest-hanging fruit for new tiling theorems.

However, I can imagine this area of research getting entirely solved without yet providing significant insight into the human-AI tiling problem (that is, the AI safety problem). My intuition is that it primarily addresses AI-AI tiling, and specifically in the case where the “values” of the AI are entirely pinned down in a strong sense. Therefore, to derive significant insights about AI risks, it seems important to generalise tiling further, including more of the messiness of the real-world problems we face.

Value Uncertainty

Open-minded updatelessness allows us to align with an unknown prior, in the same way that regular updatelessness allows us to align with a known prior.

Specifying agents who are aligned with unknown values is the subject of value learning & Stuart Russel’s alignment program, focusing on assistance games (CIRL).

If we combine the two, we get a general notion of aligning with unknown preference structures. This gives us a highly general decision-theoretic concept which I hope to formally articulate and study over the next year. This is effectively a new type of uncertainty. You could call it meta-uncertainty, although that’s probably a poor name choice since it could point to so many other things. Perhaps “open-minded uncertainty”?

In particular, with traditional uncertainty, we can model an AI which is uncertain about human values and trying to learn them from humans; however, the humans themselves have to know their own values (as is assumed in CIRL). With open-minded uncertainty, I think there will be a much better picture of AIs aligning with humans who are themselves uncertain about their own values. My suspicion is that this will offer a clear solution to the corrigibility problem.

More generally, I think of this as a step on a trajectory toward stripping away traditional decision-theoretic assumptions and creating an “empty” tiling structure which can tile a broad variety of belief structures, value structures, decision theories, etc. If the AI just wants to do “what the humans would want in this situation” then it can conform to whatever policy humans “would want”, even if it is irrational by some definition.

Another important research thread here is how to integrate the insights of Quantilization; softly optimizing imperfect proxy values seems like a critical safety tool. Infrabayesianism appears to offer important theoretical insight into making quantilizer-like strategies tile.

Value Plurality

Here’s where we bring in the concerns of Critch’s negotiable reinforcement learning: an agent aligned to multiple stakeholders whose values may differ from each other. Several directions for moving beyond Critch’s result present themselves:

Making it into a proper ‘tiling’ result by using tools from previous subsections; Critch only shows Pareto-optimality, but we would like to show that we can trust such a system in a deeper sense.
Combining tools outlined in previous subsections to analyze realistic stakeholders who don’t fully know what they want or what they believe, and who are themselves boundedly rational.
Using concepts from bargaining theory and voting theory / social choice theory to go beyond pareto-optimality and include notions of fairness. In particular, we would like to ensure that the outcome is not catastrophic with respect to the values of any of the stakeholders. We also need to care about how manipulable the values are under ‘strategic voting’ and how this impacts the outcome.
Is there an appealing formalism for multi-stakeholder Quantilization?

In terms of general approach: I want to take the formalism of logical induction and try to extend it to recognize the “traders” as potentially having their own values, rather than only beliefs. This resembles some of the ideas of shard theory.

Ontology Plurality

Critch’s negotiable RL formalism not only assumes that the beliefs and values of the stakeholders are known; it also assumes that the stakeholders and the AI agent all share a common ontology in which these beliefs and values can be described. To meaningfully address problems such as ontological crisis, we need to move beyond such assumptions, and model “where the ontology comes from” more deeply.

My ideas here are still rather murky and speculative. One thread involves extending the market metaphor behind logical induction, to model new stocks being introduced to the market. Another idea involves modeling the market structure using linear logic.

Cooperation & Coordination

“Value plurality” addresses the many-to-one alignment problem, IE, many stakeholders with one AI serving their interests. To mitigate risk scenarios associated with multipolar futures (futures containing multiple superhuman intelligences), we want to address the more general many-to-many case. This involves seeking some assurance that powerful AIs with differing values will cooperate with each other rather than engage in value-destroying conflict.

The basic picture of multi-agent rationality that we get from game theory appears very pessimistic in this regard. While cooperative equilibria are possible in some situations, those same situations usually support “folk theorems” which show that “almost any” equilibrium is possible, even really bad equilibria where no one gets what they want.

Two research threads which interest me here are Nisan’s higher-order game theory and Payor’s Lemma. Nisan has some unpublished results about cooperation. My hope is that these results can also apply to more realistic formalisms by reinterpreting his hierarchy in terms of iterated policy selection. Meanwhile, Payor’s Lemma offers new hope for getting phenomena similar to Lobian cooperation to be more robust.

^
I’m still seeking funding, although I have some options I am considering.