How To Go From Interpretability To Alignment: Just Retarget The Search
[EDIT: Many people who read this post were very confused about some things, which I later explained in What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? You might want to read that post first.]
When people talk about prosaic alignment proposals, there’s a common pattern: they’ll be outlining some overcomplicated scheme, and then they’ll say “oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are”, and then they’ll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.) And then usually there’s a whole discussion about the specific problems with the overcomplicated scheme.
In this post I want to argue from a different direction: if we had great interpretability tools, we could just use those to align an AI directly, and skip the overcomplicated schemes. I’ll call the strategy “Just Retarget the Search”.
We’ll need to make two assumptions:
Some version of the natural abstraction hypothesis holds, and the AI ends up with an internal concept for human values, or corrigibility, or what the user intends, or human mimicry, or some other outer alignment target.
The standard mesa-optimization argument from Risks From Learned Optimization holds, and the system ends up developing a general-purpose (i.e. retargetable) internal search process.
Given these two assumptions, here’s how to use interpretability tools to align the AI:
Identify the AI’s internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc).
Identify the retargetable internal search process.
Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target.
Just retarget the search. Bada-bing, bada-boom.
Problems
Of course as written, “Just Retarget the Search” has some issues; we haven’t added any of the bells and whistles to it yet. Probably the “identify the internal representation of the alignment target” step is less like searching through a bunch of internal concepts, and more like writing our intended target in the AI’s internal concept-language. Probably we’ll need to do the retargeting regularly on-the-fly as the system is training, even when the search is only partly-formed, so we don’t end up with a misaligned AI before we get around to retargeting. Probably we’ll need a bunch of empirical work to figure out which possible alignment targets are and are not easily expressible in the AI’s internal language (e.g. I’d guess “user intention” or “human mimicry” are more likely than “human values”). But those details seem relatively straightforward.
A bigger issue is that “Just Retarget the Search” just… doesn’t seem robust enough that we’d want to try it on a superintelligence. We still need to somehow pick the right target (i.e. handle outer alignment), and ideally it’s a target which fails gracefully (i.e. some amount of basin-of-corrigibility). If we fuck up and aim a superintelligence at not-quite-the-right-target, game over. Insofar as “Just Retarget the Search” is a substitute for overcomplicated prosaic alignment schemes, that’s probably fine; most of those schemes are targeting only-moderately-intelligent systems anyway IIUC. On the other hand, we probably want our AI competent enough to handle ontology shifts well, otherwise our target may fall apart.
Then, of course, there’s the assumptions (natural abstractions and retargetable search), either of which could fail. That said, if one or both of the assumptions fail, then (a) that probably messes up a bunch of the overcomplicated prosaic alignment schemes too (e.g. failure of the natural abstraction hypothesis can easily sink interpretability altogether), and (b) that might mean that the system just isn’t that dangerous in the first place (e.g. if it turns out that retargetable internal search is indeed necessary for dangerous intelligence).
Upsides
First big upside of Just Retargeting the Search: it completely and totally eliminates the inner alignment problem. We just directly set the internal optimization target.
Second big upside of Just Retargeting the Search: it’s conceptually simple. The problems and failure modes are mostly pretty obvious. There is no recursion, no complicated diagram of boxes and arrows. We’re not playing two Mysterious Black Boxes against each other.
But the main reason to think about this approach, IMO, is that it’s a true reduction of the problem. Prosaic alignment proposals have a tendency to play a shell game with the Hard Part of the problem, move it around and hide it in different black boxes but never actually eliminate it. “Just Retarget the Search” directly eliminates the inner alignment problem. No shell game, no moving the Hard Part around. It still leaves the outer alignment problem unsolved, it still needs assumptions about natural abstractions and retargetable search, but it completely removes one Hard Part and reduces the problem to something simpler.
As such, I think “Just Retarget the Search” is a good baseline. It’s a starting point for thinking about the parts of the problem it doesn’t solve (e.g. outer alignment), or the ways it might fail (retargetable search, natural abstractions), without having to worry about inner alignment.
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by 29 Aug 2022 1:23 UTC; 413 points) (
- Understanding and controlling a maze-solving policy network by 11 Mar 2023 18:59 UTC; 328 points) (
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 325 points) (
- The Plan − 2022 Update by 1 Dec 2022 20:43 UTC; 239 points) (
- Natural Abstractions: Key claims, Theorems, and Critiques by 16 Mar 2023 16:37 UTC; 237 points) (
- What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? by 15 Aug 2022 22:48 UTC; 146 points) (
- The Plan − 2023 Version by 29 Dec 2023 23:34 UTC; 146 points) (
- Evidence of Learned Look-Ahead in a Chess-Playing Neural Network by 4 Jun 2024 15:50 UTC; 120 points) (
- Shallow review of technical AI safety, 2024 by 29 Dec 2024 12:01 UTC; 112 points) (
- Long-Term Future Fund: April 2023 grant recommendations by 2 Aug 2023 1:31 UTC; 107 points) (EA Forum;
- But is it really in Rome? An investigation of the ROME model editing technique by 30 Dec 2022 2:40 UTC; 104 points) (
- Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small by 2 Feb 2024 6:54 UTC; 102 points) (
- Searching for Search by 28 Nov 2022 15:31 UTC; 94 points) (
- Disentangling Shard Theory into Atomic Claims by 13 Jan 2023 4:23 UTC; 86 points) (
- Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by 14 Sep 2023 2:18 UTC; 85 points) (
- Seriously, what goes wrong with “reward the agent when it makes you smile”? by 11 Aug 2022 22:22 UTC; 84 points) (
- Decision Transformer Interpretability by 6 Feb 2023 7:29 UTC; 84 points) (
- Long-Term Future Fund: April 2023 grant recommendations by 2 Aug 2023 7:54 UTC; 81 points) (
- Human Mimicry Mainly Works When We’re Already Close by 17 Aug 2022 18:41 UTC; 81 points) (
- Neural uncertainty estimation review article (for alignment) by 5 Dec 2023 8:01 UTC; 74 points) (
- Review of AI Alignment Progress by 7 Feb 2023 18:57 UTC; 72 points) (
- Why I’m not working on {debate, RRM, ELK, natural abstractions} by 10 Feb 2023 19:22 UTC; 71 points) (
- High-level interpretability: detecting an AI’s objectives by 28 Sep 2023 19:30 UTC; 69 points) (
- Plan for mediocre alignment of brain-like [model-based RL] AGI by 13 Mar 2023 14:11 UTC; 67 points) (
- Can we efficiently explain model behaviors? by 16 Dec 2022 19:40 UTC; 64 points) (
- Finding Goals in the World Model by 22 Aug 2022 18:06 UTC; 59 points) (
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- AXRP Episode 22 - Shard Theory with Quintin Pope by 15 Jun 2023 19:00 UTC; 52 points) (
- Searching for a model’s concepts by their shape – a theoretical framework by 23 Feb 2023 20:14 UTC; 51 points) (
- The Shortest Path Between Scylla and Charybdis by 18 Dec 2023 20:08 UTC; 50 points) (
- On the lethality of biased human reward ratings by 17 Nov 2023 18:59 UTC; 48 points) (
- What specific thing would you do with AI Alignment Research Assistant GPT? by 8 Jan 2023 19:24 UTC; 45 points) (
- Motivation control by 30 Oct 2024 17:15 UTC; 45 points) (
- Goals selected from learned knowledge: an alternative to RL alignment by 15 Jan 2024 21:52 UTC; 42 points) (
- [ASoT] Some thoughts on human abstractions by 16 Mar 2023 5:42 UTC; 42 points) (
- The Pointer Resolution Problem by 16 Feb 2024 21:25 UTC; 41 points) (
- We have promising alignment plans with low taxes by 10 Nov 2023 18:51 UTC; 40 points) (
- Refining the Sharp Left Turn threat model, part 2: applying alignment techniques by 25 Nov 2022 14:36 UTC; 39 points) (
- Consider trying Vivek Hebbar’s alignment exercises by 24 Oct 2022 19:46 UTC; 38 points) (
- Striking Implications for Learning Theory, Interpretability — and Safety? by 5 Jan 2024 8:46 UTC; 37 points) (
- Discussing how to align Transformative AI if it’s developed very soon by 28 Nov 2022 16:17 UTC; 37 points) (
- Discussing how to align Transformative AI if it’s developed very soon by 28 Nov 2022 16:17 UTC; 36 points) (EA Forum;
- A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N) by 16 May 2023 22:59 UTC; 36 points) (
- Sparse Autoencoders: Future Work by 21 Sep 2023 15:30 UTC; 35 points) (
- World-Model Interpretability Is All We Need by 14 Jan 2023 19:37 UTC; 35 points) (
- Auditing games for high-level interpretability by 1 Nov 2022 10:44 UTC; 33 points) (
- Scaffolded LLMs: Less Obvious Concerns by 16 Jun 2023 10:39 UTC; 32 points) (
- Features and Adversaries in MemoryDT by 20 Oct 2023 7:32 UTC; 31 points) (
- Searching for Searching for Search by 14 Feb 2024 23:51 UTC; 21 points) (
- Why and When Interpretability Work is Dangerous by 28 May 2023 0:27 UTC; 20 points) (
- My decomposition of the alignment problem by 2 Sep 2024 0:21 UTC; 20 points) (
- Looking for Goal Representations in an RL Agent—Update Post by 28 Aug 2024 16:42 UTC; 19 points) (
- Motivation control by 30 Oct 2024 17:15 UTC; 18 points) (EA Forum;
- Corrigibility Via Thought-Process Deference by 24 Nov 2022 17:06 UTC; 17 points) (
- Consider trying Vivek Hebbar’s alignment exercises by 24 Oct 2022 19:46 UTC; 16 points) (EA Forum;
- 27 Dec 2024 9:21 UTC; 16 points) 's comment on The Field of AI Alignment: A Postmortem, and What To Do About It by (
- Notes on Internal Objectives in Toy Models of Agents by 22 Feb 2024 8:02 UTC; 16 points) (
- How model editing could help with the alignment problem by 30 Sep 2023 17:47 UTC; 12 points) (
- Miscellaneous First-Pass Alignment Thoughts by 21 Nov 2022 21:23 UTC; 12 points) (
- Alignment Targets and The Natural Abstraction Hypothesis by 8 Mar 2023 11:45 UTC; 10 points) (
- 5 Oct 2022 22:33 UTC; 10 points) 's comment on How are you dealing with ontology identification? by (
- Towards a solution to the alignment problem via objective detection and evaluation by 12 Apr 2023 15:39 UTC; 9 points) (
- Finding the estimate of the value of a state in RL agents by 3 Jun 2024 20:26 UTC; 7 points) (
- Why and When Interpretability Work is Dangerous by 28 May 2023 0:27 UTC; 6 points) (EA Forum;
- Requisite Variety by 21 Apr 2023 8:07 UTC; 6 points) (
- Activation Engineering Theories of Impact by 18 Jul 2024 16:44 UTC; 6 points) (
- GPT-2 XL’s capacity for coherence and ontology clustering by 30 Oct 2023 9:24 UTC; 6 points) (
- 29 Mar 2023 15:59 UTC; 5 points) 's comment on Nobody’s on the ball on AGI alignment by (EA Forum;
- 1 May 2023 21:55 UTC; 5 points) 's comment on Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes by (
- 23 Aug 2022 23:28 UTC; 5 points) 's comment on AGI Timelines Are Mostly Not Strategically Relevant To Alignment by (
- 15 Sep 2023 21:17 UTC; 4 points) 's comment on Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by (
- Contrapositive Natural Abstraction—Project Intro by 24 Jun 2024 18:37 UTC; 4 points) (
- 13 Jan 2023 19:29 UTC; 4 points) 's comment on Disentangling Shard Theory into Atomic Claims by (
- 2 Nov 2022 16:55 UTC; 4 points) 's comment on “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability by (
- 25 May 2023 4:28 UTC; 3 points) 's comment on Why I’m Not (Yet) A Full-Time Technical Alignment Researcher by (
- 8 Nov 2022 22:42 UTC; 3 points) 's comment on Applying superintelligence without collusion by (
- 19 Mar 2023 18:14 UTC; 3 points) 's comment on ChatGPT (and now GPT4) is very easily distracted from its rules by (
- 24 Apr 2023 2:01 UTC; 2 points) 's comment on An open letter to SERI MATS program organisers by (
- 16 Nov 2023 18:39 UTC; 2 points) 's comment on Will Capabilities Generalise More? by (
- 28 Feb 2023 0:19 UTC; 2 points) 's comment on Deceptive Alignment by (
- 24 Apr 2023 2:00 UTC; 1 point) 's comment on An open letter to SERI MATS program organisers by (
- 29 Sep 2023 11:11 UTC; 1 point) 's comment on High-level interpretability: detecting an AI’s objectives by (
- Is Interpretability All We Need? by 14 Nov 2023 5:31 UTC; 1 point) (
This post expresses an important idea in AI alignment that I have essentially believed for a long time, and which I have not seen expressed elsewhere. (I think a substantially better treatment of the idea is possible, but this post is fine, and you get a lot of points for being the only place where an idea is being shared.)