The Natural Abstractions Hypothesis states that a wide class of models will, simply by observing the world, in the limit build abstractions that are similar to those built by humans. For example, the concept “tree” is useful to a superintelligence because it is predictive of a frequently occurring phenomenon (there are lots of trees) that is useful for prediction in many different situations.
Assume we’re talking about an AGI that was at some point pretrained on accurately predicting the whole internet. (Or, as a testable precursor, consider an LLM doing the same.)
It seems almost certain to me that, when it comes to human values (anything where you could say that another human was exhibiting that value: curiosity, honor, respect, awe, kindness), the NAH will basically hold. This is because human interactions are at their most basic a function of these values, and frequently these values are both the subtext and the object of situations and conversations. For a model to be highly accurate on the distribution of all internet text, I struggle to imagine what “simpler alternative ontology” the model could possibly learn that would predict so much of humanity. “Betrayal”/“autonomy”/“fairness” seem like “trees” when it comes to models that need to predict humans.
Note that that doesn’t mean these concepts are well-defined! (Fairness certainly isn’t.) But it does seem obvious that agents pretrained on the internet will have learned them to a similar intuitive/fuzzy extent that humans have.
Certainly it may also be true that LLMs will eventually learn abstractions like “pretend to be honorable while amassing power”. But these are much less frequent in the data, and so it’s not clear to me why we’d expect their selection over more frequently-referred-to abstractions (like regular honor).
Can someone pessimistic about the NAH as it applies to human values help me understand what arguments they’re motivated by?
It’d be really useful to have a thought-experiment to chew on.
Best arguments against the Natural Abstractions Hypothesis applying to human values?
The Natural Abstractions Hypothesis states that a wide class of models will, simply by observing the world, in the limit build abstractions that are similar to those built by humans. For example, the concept “tree” is useful to a superintelligence because it is predictive of a frequently occurring phenomenon (there are lots of trees) that is useful for prediction in many different situations.
Assume we’re talking about an AGI that was at some point pretrained on accurately predicting the whole internet. (Or, as a testable precursor, consider an LLM doing the same.)
It seems almost certain to me that, when it comes to human values (anything where you could say that another human was exhibiting that value: curiosity, honor, respect, awe, kindness), the NAH will basically hold. This is because human interactions are at their most basic a function of these values, and frequently these values are both the subtext and the object of situations and conversations. For a model to be highly accurate on the distribution of all internet text, I struggle to imagine what “simpler alternative ontology” the model could possibly learn that would predict so much of humanity. “Betrayal”/“autonomy”/“fairness” seem like “trees” when it comes to models that need to predict humans.
Note that that doesn’t mean these concepts are well-defined! (Fairness certainly isn’t.) But it does seem obvious that agents pretrained on the internet will have learned them to a similar intuitive/fuzzy extent that humans have.
Certainly it may also be true that LLMs will eventually learn abstractions like “pretend to be honorable while amassing power”. But these are much less frequent in the data, and so it’s not clear to me why we’d expect their selection over more frequently-referred-to abstractions (like regular honor).
Can someone pessimistic about the NAH as it applies to human values help me understand what arguments they’re motivated by? It’d be really useful to have a thought-experiment to chew on.