Testing The Natural Abstraction Hypothesis: Project Intro
The natural abstraction hypothesis says that
Our physical world abstracts well: for most systems, the information relevant “far away” from the system (in various senses) is much lower-dimensional than the system itself. These low-dimensional summaries are exactly the high-level abstract objects/concepts typically used by humans.
These abstractions are “natural”: a wide variety of cognitive architectures will learn to use approximately the same high-level abstract objects/concepts to reason about the world.
If true, the natural abstraction hypothesis would dramatically simplify AI and AI alignment in particular. It would mean that a wide variety of cognitive architectures will reliably learn approximately-the-same concepts as humans use, and that these concepts can be precisely and unambiguously specified.
Ultimately, the natural abstraction hypothesis is an empirical claim, and will need to be tested empirically. At this point, however, we lack even the tools required to test it. This post is an intro to a project to build those tools and, ultimately, test the natural abstraction hypothesis in the real world.
Background & Motivation
One of the major conceptual challenges of designing human-aligned AI is the fact that human values are a function of humans’ latent variables: humans care about abstract objects/concepts like trees, cars, or other humans, not about low-level quantum world-states directly. This leads to conceptual problems of defining “what we want” in physical, reductive terms. More generally, it leads to conceptual problems in translating between human concepts and concepts learned by other systems—e.g. ML systems or biological systems.
If true, the natural abstraction hypothesis provides a framework for translating between high-level human concepts, low-level physical systems, and high-level concepts used by non-human systems.
The foundations of the framework have been sketched out in previous posts.
What is Abstraction? introduces the mathematical formulation of the framework and provides several examples. Briefly: the high-dimensional internal details of far-apart subsystems are independent given their low-dimensional “abstract” summaries. For instance, the Lumped Circuit Abstraction abstracts away all the details of molecule positions or wire shapes in an electronic circuit, and represents the circuit as components each summarized by some low-dimensional behavior—like V = IR for a resistor. This works because the low-level molecular motions in a resistor are independent of the low-level molecular motions in some far-off part of the circuit, given the high-level summary. All the rest of the low-level information is “wiped out” by noise in low-level variables “in between” the far-apart components.
Chaos Induces Abstractions explains one major reason why we expect low-level details to be independent (given high-level summaries) for typical physical systems. If I have a bunch of balls bouncing around perfectly elastically in a box, then the total energy, number of balls, and volume of the box are all conserved, but chaos wipes out all other information about the exact positions and velocities of the balls. My “high-level summary” is then the energy, number of balls, and volume of the box; all other low-level information is wiped out by chaos. This is exactly the abstraction behind the ideal gas law. More generally, given any uncertainty in initial conditions—even very small uncertainty—mathematical chaos “amplifies” that uncertainty until we are maximally uncertain about the system state… except for information which is perfectly conserved. In most dynamical systems, some information is conserved, and the rest is wiped out by chaos.
Anatomy of a Gear: What makes a good “gear” in a gears-level model? A physical gear is a very high-dimensional object, consisting of huge numbers of atoms rattling around. But for purposes of predicting the behavior of the gearbox, we need only a one-dimensional summary of all that motion: the rotation angle of the gear. More generally, a good “gear” is a subsystem which abstracts well—i.e. a subsystem for which a low-dimensional summary can contain all the information relevant to predicting far-away parts of the system.
Science in a High Dimensional World: Imagine that we are early scientists, investigating the mechanics of a sled sliding down a slope. The number of variables which could conceivably influence the sled’s speed is vast: angle of the hill, weight and shape and material of the sled, blessings or curses laid upon the sled or the hill, the weather, wetness, phase of the moon, latitude and/or longitude and/or altitude, astrological motions of stars and planets, etc. Yet in practice, just a relatively-low-dimensional handful of variables suffices—maybe a dozen. A consistent sled-speed can be achieved while controlling only a dozen variables, out of literally billions. And this generalizes: across every domain of science, we find that controlling just a relatively-small handful of variables is sufficient to reliably predict the system’s behavior. Figuring out which variables is, in some sense, the central project of science. This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.
The Problem and The Plan
The natural abstraction hypothesis can be split into three sub-claims, two empirical, one mathematical:
Abstractability: for most physical systems, the information relevant “far away” can be represented by a summary much lower-dimensional than the system itself.
Human-Compatibility: These summaries are the abstractions used by humans in day-to-day thought/language.
Convergence: a wide variety of cognitive architectures learn and use approximately-the-same summaries.
Abstractability and human-compatibility are empirical claims, which ultimately need to be tested in the real world. Convergence is a more mathematical claim, i.e. it will ideally involve proving theorems, though empirical investigation will likely still be needed to figure out exactly which theorems.
These three claims suggest three different kinds of experiment to start off:
Abstractability: does reality abstract well? Corresponding experiment type: run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance is low-dimensional.
Human-Compatibility: do these match human abstractions? Corresponding experiment type: run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance recovers human-recognizable abstractions.
Convergence: are these abstractions learned/used by a wide variety of cognitive architectures? Corresponding experiment type: train a predictor/agent against a simulated environment with known abstractions; look for a learned abstract model.
The first two experiments both require computing information-relevant-at-a-distance in a reasonably-complex simulated environment. The “naive”, brute-force method for this would not be tractable; it would require evaluating high-dimensional integrals over “noise” variables. So the first step will be to find practical algorithms for directly computing abstractions from low-level simulated environments. These don’t need to be fully-general or arbitrarily-precise (at least not initially), but they need to be general enough to apply to a reasonable variety of realistic systems.
Once we have algorithms capable of directly computing the abstractions in a system, training a few cognitive models against that system is an obvious next step. This raises another algorithmic problem: how do we efficiently check whether a cognitive system has learned particular abstractions? Again, this doesn’t need to be fully general or arbitrarily precise. It just needs to be general enough to use as a tool for the next step.
The next step is where things get interesting. Ideally, we want general theorems telling us which cognitive systems will learn which abstractions in which environments. As of right now, I’m not even sure exactly what those theorems should say. (There are some promising directions, like modular variation of goals, but the details are still pretty sparse and it’s not obvious whether these are the right directions.) This is the perfect use-case for a feedback loop between empirical and theoretical work:
Try training various cognitive systems in various environments, see what abstractions they learn.
Build a model which matches the empirical results, then come up with new tests for that model.
Iterate.
Along the way, it should be possible to prove theorems on what abstractions will be learned in at least some cases. Experiments should then probe cases not handled by those theorems, enabling more general models and theorems, eventually leading to a unified theory.
(Of course, in practice this will probably also involve a larger feedback loop, in which lessons learned training models also inform new algorithms for computing abstractions in more-general environments, and for identifying abstractions learned by the models.)
The end result of this process, the holy grail of the project, would be a system which provably learns all learnable abstractions in a fairly general class of environments, and represents those abstractions in a legible way. In other words: it would be a standardized tool for measuring abstractions. Stick it in some environment, and it finds the abstractions in that environment and presents a standard representation of them. Like a thermometer for abstractions.
Then, the ultimate test of the natural abstraction hypothesis would just be a matter of pointing the abstraction-thermometer at the real world, and seeing if it spits out human-recognizable abstract objects/concepts.
Summary
The natural abstraction hypothesis suggests that most high-level abstract concepts used by humans are “natural”: the physical world contains subsystems for which all the information relevant “far away” can be contained in a (relatively) low-dimensional summary. These subsystems are exactly the high-level “objects” or “categories” or “concepts” we recognize in the world. If true, this hypothesis would dramatically simplify the problem of human-aligned AI. It would imply that a wide range of architectures will reliably learn similar high-level concepts from the physical world, that those high-level concepts are exactly the objects/categories/concepts which humans care about (i.e. inputs to human values), and that we can precisely specify those concepts.
The natural abstraction hypothesis is mainly an empirical claim, which needs to be tested in the real world.
My main plan for testing this involves a feedback loop between:
Calculating abstractions in (reasonably-realistic) simulated systems
Training cognitive models on those systems
Empirically identifying patterns in which abstractions are learned by which cognitive models in which environments
Proving theorems about which abstractions are learned by which cognitive models in which environments.
The holy grail of the project would be an “abstraction thermometer”: an algorithm capable of reliably identifying the abstractions in an environment and representing them in a standard format. In other words, a tool for measuring abstractions. This tool could then be used to measure abstractions in the real world, in order to test the natural abstraction hypothesis.
I plan to spend at least the next six months working on this project. Funding for the project has been supplied by the Long-Term Future Fund.
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by Aug 29, 2022, 1:23 AM; 413 points) (
- How To Get Into Independent Research On Alignment/Agency by Nov 19, 2021, 12:00 AM; 356 points) (
- Shallow review of live agendas in alignment & safety by Nov 27, 2023, 11:10 AM; 348 points) (
- On how various plans miss the hard bits of the alignment challenge by Jul 12, 2022, 2:49 AM; 313 points) (
- Why Agent Foundations? An Overly Abstract Explanation by Mar 25, 2022, 11:17 PM; 303 points) (
- How to pursue a career in technical AI alignment by Jun 4, 2022, 9:36 PM; 265 points) (EA Forum;
- Natural Abstractions: Key Claims, Theorems, and Critiques by Mar 16, 2023, 4:37 PM; 241 points) (
- How To Go From Interpretability To Alignment: Just Retarget The Search by Aug 10, 2022, 4:08 PM; 210 points) (
- Shallow review of technical AI safety, 2024 by Dec 29, 2024, 12:01 PM; 189 points) (
- What’s General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? by Aug 15, 2022, 10:48 PM; 156 points) (
- On how various plans miss the hard bits of the alignment challenge by Jul 12, 2022, 5:35 AM; 126 points) (EA Forum;
- Long-Term Future Fund: May 2021 grant recommendations by May 27, 2021, 6:44 AM; 110 points) (EA Forum;
- A mostly critical review of infra-Bayesianism by Feb 28, 2023, 6:37 PM; 107 points) (
- Productive Mistakes, Not Perfect Answers by Apr 7, 2022, 4:41 PM; 100 points) (
- The Telephone Theorem: Information At A Distance Is Mediated By Deterministic Constraints by Aug 31, 2021, 4:50 PM; 96 points) (
- Research agenda: Formalizing abstractions of computations by Feb 2, 2023, 4:29 AM; 93 points) (
- Testing The Natural Abstraction Hypothesis: Project Update by Sep 20, 2021, 3:44 AM; 88 points) (
- Highlights: Wentworth, Shah, and Murphy on “Retargeting the Search” by Sep 14, 2023, 2:18 AM; 85 points) (
- Shallow review of live agendas in alignment & safety by Nov 27, 2023, 11:33 AM; 76 points) (EA Forum;
- What Selection Theorems Do We Expect/Want? by Oct 1, 2021, 4:03 PM; 71 points) (
- Alignment Org Cheat Sheet by Sep 20, 2022, 5:36 PM; 70 points) (
- How to pursue a career in technical AI alignment by Jun 4, 2022, 9:11 PM; 69 points) (
- Path dependence in ML inductive biases by Sep 10, 2022, 1:38 AM; 68 points) (
- Voting Results for the 2021 Review by Feb 1, 2023, 8:02 AM; 66 points) (
- How Do Selection Theorems Relate To Interpretability? by Jun 9, 2022, 7:39 PM; 60 points) (
- The Pragmascope Idea by Aug 4, 2022, 9:52 PM; 59 points) (
- The Brain That Builds Itself by May 31, 2022, 9:42 AM; 57 points) (
- Paradigms of AI alignment: components and enablers by Jun 2, 2022, 6:19 AM; 53 points) (
- Reframing inner alignment by Dec 11, 2022, 1:53 PM; 53 points) (
- [Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques by Mar 16, 2023, 4:38 PM; 48 points) (
- Reply to Nate Soares on Dolphins by Jun 10, 2021, 4:53 AM; 46 points) (
- Computing Natural Abstractions: Linear Approximation by Apr 15, 2021, 5:47 PM; 41 points) (
- Is InstructGPT Following Instructions in Other Languages Surprising? by Feb 13, 2023, 11:26 PM; 39 points) (
- The Natural Abstraction Hypothesis: Implications and Evidence by Dec 14, 2021, 11:14 PM; 39 points) (
- Consider trying Vivek Hebbar’s alignment exercises by Oct 24, 2022, 7:46 PM; 38 points) (
- AISafety.info: What is the “natural abstractions hypothesis”? by Oct 5, 2024, 12:31 PM; 38 points) (
- Blood Is Thicker Than Water 🐬 by Sep 28, 2021, 3:21 AM; 37 points) (
- What Does The Natural Abstraction Framework Say About ELK? by Feb 15, 2022, 2:27 AM; 35 points) (
- A single principle related to many Alignment subproblems? by Apr 30, 2025, 9:49 AM; 34 points) (
- An overview of some promising work by junior alignment researchers by Dec 26, 2022, 5:23 PM; 34 points) (
- [Hebbian Natural Abstractions] Introduction by Nov 21, 2022, 8:34 PM; 34 points) (
- The economy as an analogy for advanced AI systems by Nov 15, 2022, 11:16 AM; 28 points) (
- [AN #148]: Analyzing generalization across more axes than just accuracy or loss by Apr 28, 2021, 6:30 PM; 24 points) (
- Why the Problem of the Criterion Matters by Oct 30, 2021, 8:44 PM; 24 points) (
- [Simulators seminar sequence] #2 Semiotic physics—revamped by Feb 27, 2023, 12:25 AM; 24 points) (
- Neural net / decision tree hybrids: a potential path toward bridging the interpretability gap by Sep 23, 2021, 12:38 AM; 21 points) (
- Why and When Interpretability Work is Dangerous by May 28, 2023, 12:27 AM; 20 points) (
- Jul 16, 2022, 6:16 AM; 19 points) 's comment on Safety Implications of LeCun’s path to machine intelligence by (
- AXRP Episode 11 - Attainable Utility and Power with Alex Turner by Sep 25, 2021, 9:10 PM; 19 points) (
- Abstraction is Bigger than Natural Abstraction by May 31, 2023, 12:00 AM; 18 points) (
- Consider trying Vivek Hebbar’s alignment exercises by Oct 24, 2022, 7:46 PM; 16 points) (EA Forum;
- Categories of Arguing Style : Why being good among rationalists isn’t enough to argue with everyone by May 7, 2023, 5:45 PM; 16 points) (
- Identification of Natural Modularity by Jun 25, 2022, 3:05 PM; 15 points) (
- [Hebbian Natural Abstractions] Mathematical Foundations by Dec 25, 2022, 8:58 PM; 15 points) (
- Jun 14, 2021, 7:00 AM; 14 points) 's comment on Looking Deeper at Deconfusion by (
- Jan 10, 2022, 7:04 AM; 13 points) 's comment on The date of AI Takeover is not the day the AI takes over by (
- Jan 2, 2023, 7:54 PM; 11 points) 's comment on Large language models can provide “normative assumptions” for learning human preferences by (
- May 31, 2021, 11:20 AM; 11 points) 's comment on What is the most effective way to donate to AGI XRisk mitigation? by (
- A mesa-optimization perspective on AI valence and moral patienthood by Sep 9, 2021, 10:23 PM; 10 points) (EA Forum;
- An overview of some promising work by junior alignment researchers by Dec 26, 2022, 5:23 PM; 10 points) (EA Forum;
- [Linkpost] How To Get Into Independent Research On Alignment/Agency by Feb 14, 2022, 9:40 PM; 10 points) (EA Forum;
- Alignment Targets and The Natural Abstraction Hypothesis by Mar 8, 2023, 11:45 AM; 10 points) (
- Countering arguments against working on AI safety by Jul 20, 2022, 6:23 PM; 7 points) (
- Why and When Interpretability Work is Dangerous by May 28, 2023, 12:27 AM; 6 points) (EA Forum;
- Oct 10, 2021, 4:56 PM; 5 points) 's comment on Selection Theorems: A Program For Understanding Agents by (
- Apr 16, 2021, 3:05 PM; 4 points) 's comment on Computing Natural Abstractions: Linear Approximation by (
- Jan 13, 2023, 1:28 AM; 3 points) 's comment on The AI Control Problem in a wider intellectual context by (
- Aug 21, 2022, 6:03 PM; 3 points) 's comment on My Plan to Build Aligned Superintelligence by (
- Abstraction is Bigger than Natural Abstraction by May 31, 2023, 12:00 AM; 2 points) (EA Forum;
- Jun 28, 2021, 9:41 AM; 2 points) 's comment on Frequent arguments about alignment by (
- Nov 4, 2024, 4:53 PM; 2 points) 's comment on Abstractions are not Natural by (
- Mar 16, 2023, 10:22 AM; 2 points) 's comment on Want to predict/explain/control the output of GPT-4? Then learn about the world, not about transformers. by (
- Feb 18, 2022, 5:43 AM; 2 points) 's comment on Implications of automated ontology identification by (
A decent introduction to the natural abstraction hypothesis, and how testing it might be attempted. A very worthy project, but it isn’t that easy to follow for beginners, nor does it provide a good understanding of how the testing might work in detail. What might consist a success, what might consist a failure of this testing? A decent introduction, but only an introduction, and it should have been part of a sequence or a longer post.
I’m curious if you’d looked at this followup (also nominated for review this year) http://lesswrong.com/posts/dNzhdiFE398KcGDc9/testing-the-natural-abstraction-hypothesis-project-update
I have looked at it, but ignored it when commenting on this post, which should stand on its own (or as part of a sequence).
Fair. Fwiw I’d be interested in your review of the followup as a standalone.
Here’s the review, though it’s not very detailed (the post explains why):
https://www.lesswrong.com/posts/dNzhdiFE398KcGDc9/testing-the-natural-abstraction-hypothesis-project-update?commentId=spMRg2NhPogHLgPa8
Yup, makes sense. Thank you!