Current thoughts on Paul Christano’s research agenda

This post summarizes my thoughts on Paul Christiano’s agenda in general and ALBA in particular.


(note: at the time of writing, I am not employed at MIRI)

(in general, opinions expressed here are strong and weakly held)

AI alignment research as strategy

Roughly, AI alignment is the problem of using a system of humans and computers to do a good thing. AI alignment research will tend to look at the higher levels of what the system of humans and computers is doing. Thus, it is strategy for doing good things with humans and computers.

“Doing good things with humans and computers” is a broad class. Computing machines have been around a long time, and systems of humans involving rules computed by humans have been around much longer. Looking at AI alignment as strategy will bring in intuitions from domains like history, law, economics, and political philosophy. I think these intuitions are useful for bringing AI strategy into near mode.

Paul Christiano’s agenda as strategy

Paul Christiano’s agenda “goes for the throat” in ways that the other agendas, such as the agent foundations agenda, do not. Thus, it yields an actual strategy, rather than a list of research questions about strategy. I will now analyze Paul Christiano’s research agenda as strategy. Here I caricature Paul Christiano’s strategic assumptions:

  1. There are 2 phases to agentic activity: an influence-maximizing phase, and a value-fulfillment phase.

  2. Any strategy can be decomposed into (a) the expansion strategy (the strategy for the influence-maximizing phase) and (b) the payload (the strategy for the value-fulfillment phase), which can vary independently in principle.

  3. The power of an agent after the influence-maximizing phase is a function of its starting power and the effectiveness of its expansion strategy.

  4. An outcome is good if the vast majority of power after the influence-maximizing phase is owned by agents whose payload is good.

  5. Almost all powerful actors start with good values, i.e. they intend to select a good payload.

  6. Selecting a good payload is not hard (e.g. specifying a social arrangement that would reflect and arrive at good moral conclusions)

  7. Powerful actors can’t coordinate with each other; they must expand independently.

  8. Powerful actors with bad values exist.

  9. Good powerful actors can anticipate and copy bad powerful actors’ expansion strategies while retaining the same payload.

These assumptions are a caricature of Paul’s assumptions in that, though they aren’t completely reflective of Paul’s actual views, they strongly state the background philosophy implied by Paul’s research agenda.

From these assumptions, it is possible to derive that a decent strategy for good actors is to anticipate bad actors’ expansion strategies and copy them while retaining the same payload. This corresponds to Paul’s general approach: look at a proposed system that would do a bad thing, then create an equally powerful system that would instead do a good thing.

Here are some thoughts on these assumptions:

  • I think the “expansion/​value-fulfillment” split is broadly correct for highly-programmable, fully-automated systems and broadly incorrect for systems involving humans. E.g. humans can be corrupted by power.

  • I think powerful actors can coordinate with each other. In some sense, altruism derives from coordination strategies, so altruists ought to be more able to cooperate than non-altruists.

  • I think that thinking of power as unidimensional is highly misleading. For example, it is possible to distinguish “amount of resources” from “knowledge about the world”. Those with more knowledge can be influential even without resources, and even without gaining many resources along the way. (Gaining resources can worsen epistemics by increasing the incentives to deceive the person with resources)

  • I think most powerful interests today are not “good” in the sense of “would probably select a good payload”, mostly because most powerful interests do not have good epistemics about such abstract ideas. Thus, good actors will need to use asymmetric strategies, not just symmetric ones. (Due to our current state of philosophical confusion, it seems somewhat likely that no one is currently a good actor in this sense)

ALBA competes with adversaries who use only learning

Paul Christiano has acknowledged that ALBA only works for aligning systems that work through learning, and does not work for aligning systems that use hard-to-learn forms of cognition (e.g. search, logic). Roughly, ALBA will copy the part of a system’s strategy that is explainable by the system’s learning, i.e. that is explaining by the system modelling the world as a set of nested feedback loops. Object recognition is fine, because there’s feedback for that; philosophy is not fine, because there isn’t feedback for that. (Humans can provide feedback about philosophy, but human philosophy is not fully explained by this feedback, e.g. the human giving the feedback needs their own strategy).

The world isn’t a set of nested feedback loops. Agents don’t just learn, they reason. The philosophy of empiricism is more appropriate for an investor, who tries to pick up on signs of a successful project based on limited information and reasoning, than for an entrepreneur, who engages in object-level reasoning in order to actually run a successful project.

To be more concrete, there are multiple ways humans could believe that some AI system would be quite powerful without this belief being justified solely on the basis that the AI system does learning:

  • Perhaps the AI system is organized very similarly to a human brain, and humans have reason to believe that it will behave similarly, even though they haven’t fully explained the brain’s intelligence as a form of learning.

  • Perhaps the AI system is produced through natural selection in a single large world that is difficult to split into smaller worlds without losing efficiency. (If the world can be split, then the different worlds could possibly be treated as “different hypotheses” productively)

  • Perhaps humans understand reasoning better and can create a system that reasons according to humans’ understanding of reasoning.

None of these are the kind of “ML technology” that ALBA uses, so they could not be analyzed effectively enough that ALBA could produce an equivalent.

The last time I talked with Paul, he was aware of these problems. My recollection of his thoughts on these problems are:

  • He thinks that solving alignment for learning systems is more urgent and the correct first step for solving alignment in general.

  • He thinks that it might be possible to see processes such as evolution as a combination of selection and deliberation, and that this analysis might yield enough analysis for producing aligned versions.

ALBA is not amenable to formal analysis

Some parts of ALBA can’t be formally analyzed very well with our current level of philosophical sophistication. For example:

  • Analyzing capability amplification requires modeling humans as computationally-bounded agents. There is not currently a model of agents under which a scheme like HCH would work, and even if there were, showing that humans are such an agent would be quite a challenge.

  • Analyzing informed oversight similarly requires modeling humans as computationally-bounded agents.

  • Analyzing red teams likewise requires modeling the capabilities of the red team versus the learning system.

  • In general, formally analyzing “power” would require formally analyzing decision theory and logical uncertainty.

Notably, the kind of agents for which analyzing capability amplification makes sense are not the kind of agents that ALBA aligns. ALBA aligns learning systems, and humans’ reasoning over time is deliberation rather than learning. I think this is the root of a lot of the difficulty of formally analyzing ALBA.

Plausibility arguments are hard to act on

ALBA is based on plausibility arguments: it’s plausible that capability amplification works, it’s plausible that informed oversight works, and it’s plausible that red teams work. These questions are actually pretty hard to research.

Recently I have updated towards a generalization of this: plausibility arguments are very often hard to act on, since the belief that “X is plausible” does not usually come attached to a strong model implying X is probably true, and these strong models are necessary for generating proofs. (Sometimes, plausibility arguments are sufficient to act on, especially in “easy” domains such as mathematics and puzzle games where there exist search procedures for proving or disproving X. But AI alignment doesn’t seem like one of these domains to me.)

Agreement with many of Paul’s intuitions

This post so far gives the impression that I strongly disagree with Paul. I do disagree with Paul, but I also strongly agree with many of his intuitions:

  • I think that “going for the throat” by trying to solve the whole alignment problem is quite useful, though other approaches should also be used (since “going for the throat” will tend to produce intractable research problems).

  • I think that corrigibility as Paul understands it will turn out to be an important tool for analyzing human-computer systems.

  • I think that splitting an AI system into an expansion strategy and a payload is sometimes useful, and in particular is usually more illuminating than splitting it into beliefs and values.

  • I think that thinking about the big strategic picture and the limiting behavior of a system, as Paul does, is useful.

  • I think that Paul Christiano’s research is quite helpful for reasoning about learning and control systems; for example, counterfactual oversight seems very useful for handling epistemic problems discussed in He Who Pays The Piper Must Know The Tune.

Conclusion

I have presented my thoughts on ALBA. ALBA has significantly improved my conceptual understanding of AI alignment, but is seriously incomplete and difficult to make further theoretical progress on. I don’t know how to make much additional theoretical progress on the alignment problem at this point, but perhaps taking some steps back from specific approaches and doing original seeing on alignment would yield new directions.