The natural abstractions hypothesis makes three claims:
Abstractability: to make predictions in our world, it’s enough to know very low-dimensional summaries of systems, i.e., their abstractions (empirical claim)
Human-compatibility: Humans themselves use these abstractions in their thinking (empirical claim)
Convergence/naturality: most cognitive systems use these abstractions to make predictions (mathematical+empirical claim)
John wants to test this hypothesis by:
Running simulations of systems and showing that low-information summaries predict how they evolve
Checking whether these low-information summaries agree with how humans reason about the system
Training predictors/agents on the system and observing whether they use these low-dimensional summaries. Also, try to prove theorems about which systems will use which abstractions in which environments.
The Holy Grail: having a machine that provably detects the low-dimensional abstractions useful for making predictions in almost any system. Then, use it in the real world and observe whether the concepts agree with human real-world concepts. John says this would prove the NAH
Further Thoughts
To me, it seems like conceptual/theoretical progress is at least as much needed as empirical progress since we still don’t quite conceptually understand what it means to “make predictions about a system”:
Clearly, predicting all the low-level details is not possible with abstract summaries alone.
Thus, the only thing one can ever hope to predict with abstract summaries are… other abstract summaries.
However, this seems to create a chicken-egg problem: we already need to know the relevant abstractions in order to assess whether the abstractions are useful for predicting the values of those. It’s not enough to find “any low-dimensional piece of information” that is good for predicting… yeah, for predicting what?
The problem of science that John discusses has a nice interpretation for alignment research:
Probably there is only a small number of variables to tweak in exactly the right way when building advanced AI, and this will be enough — for a superintelligent being, at least — to correctly predict that everything will remain aligned. Let’s find those variables.
This reminds me of Eliezer’s claim that probably, the alignment solution fits into one very simple “book from the future” that contains all the right ideas, similar to how our world now contains the simple idea of a ReLU that wasn’t accessible 20 years ago.
I think if we had this abstraction-thermometer, then we wouldn’t even need “convergence” anymore: simply use the thermometer itself as part of the AGI by pointing the AGI to the revealed human-value concept. Thus, I think I’m fine with reducing the NAH to just the AH, consisting of only the two claims that low-dimensional information will be enough for making predictions and that human concepts (in particular, human values) are such low-dimensional information. Then, we “just” need to build an AGI that points to human values and make sure that no other AGI will be built that doesn’t point there (or even couldn’t point there due to using other abstractions).
If we don’t have the “N” part of NAH, then this makes “alignment by default” less likely. But this isn’t that bad from the perspective of trying to get “alignment by design”.
Summary
The natural abstractions hypothesis makes three claims:
Abstractability: to make predictions in our world, it’s enough to know very low-dimensional summaries of systems, i.e., their abstractions (empirical claim)
Human-compatibility: Humans themselves use these abstractions in their thinking (empirical claim)
Convergence/naturality: most cognitive systems use these abstractions to make predictions (mathematical+empirical claim)
John wants to test this hypothesis by:
Running simulations of systems and showing that low-information summaries predict how they evolve
Checking whether these low-information summaries agree with how humans reason about the system
Training predictors/agents on the system and observing whether they use these low-dimensional summaries. Also, try to prove theorems about which systems will use which abstractions in which environments.
The Holy Grail: having a machine that provably detects the low-dimensional abstractions useful for making predictions in almost any system. Then, use it in the real world and observe whether the concepts agree with human real-world concepts. John says this would prove the NAH
Further Thoughts
To me, it seems like conceptual/theoretical progress is at least as much needed as empirical progress since we still don’t quite conceptually understand what it means to “make predictions about a system”:
Clearly, predicting all the low-level details is not possible with abstract summaries alone.
Thus, the only thing one can ever hope to predict with abstract summaries are… other abstract summaries.
However, this seems to create a chicken-egg problem: we already need to know the relevant abstractions in order to assess whether the abstractions are useful for predicting the values of those. It’s not enough to find “any low-dimensional piece of information” that is good for predicting… yeah, for predicting what?
The problem of science that John discusses has a nice interpretation for alignment research:
Probably there is only a small number of variables to tweak in exactly the right way when building advanced AI, and this will be enough — for a superintelligent being, at least — to correctly predict that everything will remain aligned. Let’s find those variables.
This reminds me of Eliezer’s claim that probably, the alignment solution fits into one very simple “book from the future” that contains all the right ideas, similar to how our world now contains the simple idea of a ReLU that wasn’t accessible 20 years ago.
I think if we had this abstraction-thermometer, then we wouldn’t even need “convergence” anymore: simply use the thermometer itself as part of the AGI by pointing the AGI to the revealed human-value concept. Thus, I think I’m fine with reducing the NAH to just the AH, consisting of only the two claims that low-dimensional information will be enough for making predictions and that human concepts (in particular, human values) are such low-dimensional information. Then, we “just” need to build an AGI that points to human values and make sure that no other AGI will be built that doesn’t point there (or even couldn’t point there due to using other abstractions).
If we don’t have the “N” part of NAH, then this makes “alignment by default” less likely. But this isn’t that bad from the perspective of trying to get “alignment by design”.