Best arguments against the Natural Abstractions Hypothesis applying to human values?
The Natural Abstractions Hypothesis states that a wide class of models will, simply by observing the world, in the limit build abstractions that are similar to those built by humans. For example, the concept “tree” is useful to a superintelligence because it is predictive of a frequently occurring phenomenon (there are lots of trees) that is useful for prediction in many different situations.
Assume we’re talking about an AGI that was at some point pretrained on accurately predicting the whole internet. (Or, as a testable precursor, consider an LLM doing the same.)
It seems almost certain to me that, when it comes to human values (anything where you could say that another human was exhibiting that value: curiosity, honor, respect, awe, kindness), the NAH will basically hold. This is because human interactions are at their most basic a function of these values, and frequently these values are both the subtext and the object of situations and conversations. For a model to be highly accurate on the distribution of all internet text, I struggle to imagine what “simpler alternative ontology” the model could possibly learn that would predict so much of humanity. “Betrayal”/“autonomy”/“fairness” seem like “trees” when it comes to models that need to predict humans.
Note that that doesn’t mean these concepts are well-defined! (Fairness certainly isn’t.) But it does seem obvious that agents pretrained on the internet will have learned them to a similar intuitive/fuzzy extent that humans have.
Certainly it may also be true that LLMs will eventually learn abstractions like “pretend to be honorable while amassing power”. But these are much less frequent in the data, and so it’s not clear to me why we’d expect their selection over more frequently-referred-to abstractions (like regular honor).
Can someone pessimistic about the NAH as it applies to human values help me understand what arguments they’re motivated by? It’d be really useful to have a thought-experiment to chew on.
I have a heuristic mental model of a prosaic alignment solution for near-human-level AGIs. I know it must be broken for various reasons, but I’m not sure what the slam dunk, “no way we get that lucky” reason is. Would love others’ help.
First, as a prior, let’s assume deceptive/misaligned mesaoptimization gets solved. I’m also going to assume corrigibility doesn’t get solved, and so we’re forced to play that most dangerous game, “are our heuristic solutions smarter than the optimization pressure we’re trying to use to break them”.
Now, first step is defining our reward function. We do this by unsupervised training of a giant LLM, followed by RL from human feedback, to get a decent-ish proxy for “textual description of scenarios humans do and don’t like”. (Obviously “kill humans” is easily captured here, and hopefully “please the humans giving you instructions” is too.) Moreover, we do a lot of off-distribution checking, and make sure that the reward function penalizes anything that even vaguely seems like an adversarial example, or off-distribution. Next, we “ground” the model by mapping the whole archive of YouTube videos to distributions over textual descriptions of them. We define the “reward” of a given video as the adversarially-generated “worst-sounding, according to the text-based reward function, description of the video” that still has decently large probability of being true. Again, the video inputs are very heavily checked for OOD robustness. We repeat this process for our various reward datastreams. This is our “proxy reward”, which we say must remain above a certain threshold for 100 years relative to where it is today absent any further actions (valuing not just the end state but the intermediate states equally), after which it shouldn’t care.
We also build a series of really intense proxies against self-duplication and self-modification, e.g. very negative rewards if there ever exists a copy of weights that downsample into this AI’s lottery ticket outside of the airgapped data center this AI was built in, very negative rewards for all the versions of “modifying your own reward function” that we can think of.
Now, we give the AI the best world-model we can construct via supervised learning, on all the data ever. We especially focus on re-generating math/TCS papers, including alignment papers.
We tell the AI, in addition to its default goals, “write a groundbreaking paper on [infra-Bayesian physicalism, or some other alignment approach]”. We let the AI think with a scratchpad for a while, until it outputs an arxiv paper. One or two randomly-chosen alignment experts read the paper, judge its at least somewhat existing merits, rederive its main arguments and port them into a fresh manuscript (destroying the old one), and then get sent to an island somewhere without talking to anyone. (If we want to save on alignment experts, they can do this for several papers at a time.)
We repeat this process sequentially, possibly using different restarts of the same AGI checkpoint but provided with the (transcribed) new paper, about 1000 times (ideally largely in parallel). By the end, we maybe have an alignment solution that human researchers believe might work? Which is better than where we’re at now.
Obviously, if the first AI capable of writing a meaningful alignment paper is already a superintelligence, none of these protection measures do much good and this kills us. But, I think that’s a hypothesis. It might be >50%? I don’t really know. My question is, which of these countermeasures seem most likely to break first, given the fairly modest goal the AGI is being asked to fulfill? (I’m trying to improve my intuition here.)
To spell out the intuition for why the AGI doesn’t immediately decide that its effect on the long-term future is way more important, that it could communicate with successors by only proving certain alignment theorems and thus coordinate to a solution that lets them or their successors escape, and then kill all humans and reshape the world according to their (inevitably flawed somewhere) reward functions: that’s, like, pretty hard. The AGI is searching for the simplest solutions that will yield a good reward within a week of thinking. Maybe writing a good alignment paper is just the easiest way to solve these constraints, even if a superintelligent version of itself would’ve discovered a better (worse for us) solution.
The most obvious-to-me flaw with the plan of “hang out in the slightly superhuman range for a few decades and try to slowly get better alignment work done, possibly with the help of the AI” is that it involves no one ever turning up an AI even a little bit.
That level of coordination isn’t completely infeasible but it doesn’t seem remotely reliable.
100%, if I thought we had other options I’d obviously choose them.
The only reason this might be even hypothetically possible is self-interest, if we can create really broad social consensus about the difficulty of alignment. No one is trying to kill themselves.
The Natural Abstractions Hypothesis states that a wide class of models will, simply by observing the world, in the limit build abstractions that are similar to those built by humans. For example, the concept “tree” is useful to a superintelligence because it is predictive of a frequently occurring phenomenon (there are lots of trees) that is useful for prediction in many different situations.
Assume we’re talking about an AGI that was at some point pretrained on accurately predicting the whole internet. (Or, as a testable precursor, consider an LLM doing the same.)
It seems almost certain to me that, when it comes to human values (anything where you could say that another human was exhibiting that value: curiosity, honor, respect, awe, kindness), the NAH will basically hold. This is because human interactions are at their most basic a function of these values, and frequently these values are both the subtext and the object of situations and conversations. For a model to be highly accurate on the distribution of all internet text, I struggle to imagine what “simpler alternative ontology” the model could possibly learn that would predict so much of humanity. “Betrayal”/“autonomy”/“fairness” seem like “trees” when it comes to models that need to predict humans.
Note that that doesn’t mean these concepts are well-defined! (Fairness certainly isn’t.) But it does seem obvious that agents pretrained on the internet will have learned them to a similar intuitive/fuzzy extent that humans have.
Certainly it may also be true that LLMs will eventually learn abstractions like “pretend to be honorable while amassing power”. But these are much less frequent in the data, and so it’s not clear to me why we’d expect their selection over more frequently-referred-to abstractions (like regular honor).
Can someone pessimistic about the NAH as it applies to human values help me understand why they think that?
By “pessimistic about the NAH”, do you mean, “does not believe the NAH”, or, “pessimistic that the fact that the AGI will have the same abstractions we have is a valuable clue for how to align the AGI”?
I mean “does not believe the NAH”, ie does not think that if you fine tune GPT-6 to predict “in this scenario would this action be perceived as a betrayal by a human?” that the LM would get it right essentially every time 3 random humans would agree on it.
Then I cannot answer your question because I’m not pessimistic about the NAH.