I don’t think it’s “wider variety of environment” per se. An amoeba and a human operate in the same environments: everywhere there is human, there are amoeba. They are just a very common sort of microscopic life; you probably have some on or even in you as a commensal right now. And amoeba are in many environments that few or no humans are, like at the bottom of ponds and stuff like that. Similarly, leaving aside the absurdity of asking people to compare how consistent agents are ‘ants’ vs ‘sloths’, I’d say ants inhabit much more diverse environments than sloths, whether you consider ants collectively as they colonize the world from pole to pole or any individual ant going from the ant hill’s nursery to the outside to forage and battle, while sloths exist in a handful of biomes and an individual sloth moves from branch to branch, and that’s about it. (And then there are the social complications: you know about ant warfare, of course, and the raids and division of territory and enslavement etc, but did you know some ants can hold “tournaments” (excerpts), where two hills bloodlessly compete, and the loser allows itself to be killed and its survivors enslaved by the winner? I am not aware of anything sloths do which is nearly as complicated.) Or take thermostats: thermostats are everywhere from deep space probes past the orbit of Pluto to deep sea vents; certainly a wider range of ‘environments’ than humans. And while it is even more absurd to ask such a question about “ResNet-18” vs “CLIP”—surely the question of what a ResNet-18 wants and how coherent it is as an agent is well above the pay-grade of everyone surveyed… - these are the sorts of image models which are deployed everywhere by everyone to classify/embed everything, and the small fast old ResNet-18 is probably deployed into many more environments than CLIP models are, particularly in embedded computers. So by any reasonable definition of ‘environment’, the ‘small’ things here are in as many or more as the ‘large’ ones.
So much for ‘it’s the environment’. It is the behavioral diversity here. And of course, a human can exhibit much more behavioral complexity than an amoeba, but the justification of our behavioral complexity is not complexity for the sake of complexity. Complexity is a bad thing. An amoeba finds complexity to be not cost-effective. We find complexity cost-effective because it buys us power. A human has far greater power over its environment than an amoeba does over that exact same environment.
So let’s go back to OP with that in mind. Why should we care about these absurd comparisons, according to Sohl-Dickstein?
If the AI is powerful enough, and pursues its objectives inflexibly enough, then even a subtle misalignment might pose an existential risk to humanity. For instance, if an AI is tasked by the owner of a paperclip company to maximize paperclip production, and it is powerful enough, it will decide that the path to maximum paperclips involves overthrowing human governments, and paving the Earth in robotic paperclip factories.
There is an assumption behind this misalignment fear, which is that a superintelligent AI will also be supercoherent in its behavior^1. An AI could be misaligned because it narrowly pursues the wrong goal (supercoherence). An AI could also be misaligned because it acts in ways that don’t pursue any consistent goal (incoherence). Humans — apparently the smartest creatures on the planet — are often incoherent. We are a hot mess of inconsistent, self-undermining, irrational behavior, with objectives that change over time. Most work on AGI misalignment risk assumes that, unlike us, smart AI will not be a hot mess.
In this post, I experimentally probe the relationship between intelligence and coherence in animals, people, human organizations, and machine learning models. The results suggest that as entities become smarter, they tend to become less, rather than more, coherent. This suggests that superhuman pursuit of a misaligned goal is not a likely outcome of creating AGI.
So the argument here is something like:
an agent can be harmed by another more agent only if that agent is also more ‘coherent in planning’ than the first agent
agents which are more powerful are less coherent (as proven by the following empirical data etc.)
therefore, weaker agents can’t be harmed by more powerful agents
humans are weaker agents than superintelligent AI, etc etc, humans can’t be harmed
Put like this, Sohl-Dickstein’s argument is obviously wrong.
It is entirely unnecessary to get distracted arguing about #2 or asking whether there is too much rater error to bother with or calculating confidence intervals when his argument is so basically flawed to begin with. (You might say it is like trying to disprove AI risk by calculating exactly how hard quantum uncertainty makes predicting pinball. It is a precise answer to the wrong question.)
Leaving aside whether #2 is true, #1 is obviously false and proves too much: it is neither necessary nor sufficient to be more coherent in order to cause bad things to happen.
Consider the following parodic version of OP:
If the “human being” is powerful enough, and pursues its objectives inflexibly enough, then even a subtle misalignment might pose an existential risk to amoeba-kind. For instance, if a human is tasked by the owner of a food factory to maximize food production, and it is powerful enough, it will decide that the path to maximum hygienic food output involves mass genocide of amoeba-kind using bleach and soap, and paving the Earth in fields of crops.
There is an assumption behind this misalignment fear, which is that a superintelligent non-amoeba will also be supercoherent in its behavior^1. A human could be misaligned because it narrowly pursues the wrong goal (supercoherence). An human could also be misaligned because it acts in ways that don’t pursue any consistent goal (incoherence). Amoeba—apparently the smartest creatures on the planet that we amoeba know of—are often incoherent. We are a hot mess of inconsistent, self-undermining, irrational behavior, with objectives that change over time. Most work on human misalignment risk assumes that, unlike us, smart humans will not be a hot mess.
In this post, I experimentally probe the relationship between intelligence and coherence in microbial life, humans, human organizations, and machine learning models. The results suggest that as entities become smarter, they tend to become less, rather than more, coherent. This suggests that superamoeba pursuit of a misaligned goal is not a likely outcome of creating humanity.
Where, exactly, does this version go wrong? If ‘hot mess’ does not disprove this amoeba:human version, why does it disprove human:AGI in the original version?
Well, obviously, more powerful agents do harm or destroy less powerful agents all the time, even if those less powerful agents are more coherent or consistent. They do it by intent, they do it by accident, they do it in a myriad of ways, often completely unwittingly. You could be millions of times less coherent than an amoeba, in some sort of objective sense like ‘regret’ against your respective reward functions*, and yet, you are far more than millions of times more powerful than an amoeba is: you destroy more-coherent amoebas all the time by basic hygiene or by dumping some bleach down the drain or spilling some cleaning chemicals on the floor; and their greater coherence boots nothing. I’ve accidentally damaged and destroyed thermostats that were perfectly coherent in their temperature behavior, it did them no good. ‘AlphaGo’ may be less coherent than a ‘linear CIFAR-10 classifier’, but nevertheless it still destroys weaker agents in a zero-sum war (Go). The US Congress may be ranked near the bottom for coherence, and yet, it still passes the laws that bind (and often harm) Sohl-Dickstein and myself and presumably most of the readers of this argument. And so on.
The power imbalances are simply so extreme that even some degradation in ‘coherence’ is still consistent with enormous negative consequences, both as deliberate plans, and also as sheer accidents or side-effects.
(This reminds me a little of some mistaken arguments about central limit theorems. It is true that if you add up a few variables of a given size, as a percentage, the randomness around the mean will be much larger than if you added up a lot of that size; but what people tend to forget is that the new randomness will be much larger absolutely. So if you are trying to minimize an extreme outcome, like an insurance company trying to avoid any loss of $X, you cannot do so by simply insuring more contracts because your worst-cases keep getting bigger even as they become less likely. The analogy here is that the more powerful agents are around, the more destructive their outliers become, even if they have some neutralizing trend going on as well.)
* Assuming you are willing to naively equate utilities of eg amoebas and humans, it would then not be surprising if humans incurred vastly more total regret over a lifetime, because they have much longer lifetimes with much greater action-spaces, and most RL algorithms have regret bounds that scale with # of actions & timescale. Which seems intuitive. If you have very few choices and your choices make little difference long-term, in part because there isn’t a long-term for you (amoebas live days to years, not decades), the optimal policy can’t improve much on whatever you do.
To make any argument from lower coherence, you need to make it work with a much weaker version of #1, like ‘an agent is less likely to be harmed by a more powerful agent than that power differential implies if the more powerful agent is also less coherent’.
But this is a much harder argument: how much less? How fast does incoherence need to increase in order to more than offset the corresponding increase in power? How do you quantify that at all?
How do you know that this incoherence effect is not already incorporated into any existing estimate of ‘power’ (In fact, how would you estimate ‘power’ in any meaningful sense which doesn’t already incorporate an agent’s difficulty in being coherent? Do you have to invent some computable ‘supercoherent version’ of an agent to run? If you can do that, wouldn’t agents be highly incentivized to invent that super-coherent version of themselves to replace themselves with?)
You can retreat to, ‘well, the lower coherence means that any increased power will be applied less effectively and so the risk is at least epsilon lower than claimed’, but you can’t even show that this isn’t already absorbed into existing fudge factors and measurements and extrapolations and arguments etc. So you wind up claiming basically nothing at all.
I don’t think it’s “wider variety of environment” per se. An amoeba and a human operate in the same environments: everywhere there is human, there are amoeba. They are just a very common sort of microscopic life; you probably have some on or even in you as a commensal right now. And amoeba are in many environments that few or no humans are, like at the bottom of ponds and stuff like that. Similarly, leaving aside the absurdity of asking people to compare how consistent agents are ‘ants’ vs ‘sloths’, I’d say ants inhabit much more diverse environments than sloths, whether you consider ants collectively as they colonize the world from pole to pole or any individual ant going from the ant hill’s nursery to the outside to forage and battle, while sloths exist in a handful of biomes and an individual sloth moves from branch to branch, and that’s about it. (And then there are the social complications: you know about ant warfare, of course, and the raids and division of territory and enslavement etc, but did you know some ants can hold “tournaments” (excerpts), where two hills bloodlessly compete, and the loser allows itself to be killed and its survivors enslaved by the winner? I am not aware of anything sloths do which is nearly as complicated.) Or take thermostats: thermostats are everywhere from deep space probes past the orbit of Pluto to deep sea vents; certainly a wider range of ‘environments’ than humans. And while it is even more absurd to ask such a question about “ResNet-18” vs “CLIP”—surely the question of what a ResNet-18 wants and how coherent it is as an agent is well above the pay-grade of everyone surveyed… - these are the sorts of image models which are deployed everywhere by everyone to classify/embed everything, and the small fast old ResNet-18 is probably deployed into many more environments than CLIP models are, particularly in embedded computers. So by any reasonable definition of ‘environment’, the ‘small’ things here are in as many or more as the ‘large’ ones.
So much for ‘it’s the environment’. It is the behavioral diversity here. And of course, a human can exhibit much more behavioral complexity than an amoeba, but the justification of our behavioral complexity is not complexity for the sake of complexity. Complexity is a bad thing. An amoeba finds complexity to be not cost-effective. We find complexity cost-effective because it buys us power. A human has far greater power over its environment than an amoeba does over that exact same environment.
So let’s go back to OP with that in mind. Why should we care about these absurd comparisons, according to Sohl-Dickstein?
So the argument here is something like:
an agent can be harmed by another more agent only if that agent is also more ‘coherent in planning’ than the first agent
agents which are more powerful are less coherent (as proven by the following empirical data etc.)
therefore, weaker agents can’t be harmed by more powerful agents
humans are weaker agents than superintelligent AI, etc etc, humans can’t be harmed
Put like this, Sohl-Dickstein’s argument is obviously wrong.
It is entirely unnecessary to get distracted arguing about #2 or asking whether there is too much rater error to bother with or calculating confidence intervals when his argument is so basically flawed to begin with. (You might say it is like trying to disprove AI risk by calculating exactly how hard quantum uncertainty makes predicting pinball. It is a precise answer to the wrong question.)
Leaving aside whether #2 is true, #1 is obviously false and proves too much: it is neither necessary nor sufficient to be more coherent in order to cause bad things to happen. Consider the following parodic version of OP:
Where, exactly, does this version go wrong? If ‘hot mess’ does not disprove this amoeba:human version, why does it disprove human:AGI in the original version?
Well, obviously, more powerful agents do harm or destroy less powerful agents all the time, even if those less powerful agents are more coherent or consistent. They do it by intent, they do it by accident, they do it in a myriad of ways, often completely unwittingly. You could be millions of times less coherent than an amoeba, in some sort of objective sense like ‘regret’ against your respective reward functions*, and yet, you are far more than millions of times more powerful than an amoeba is: you destroy more-coherent amoebas all the time by basic hygiene or by dumping some bleach down the drain or spilling some cleaning chemicals on the floor; and their greater coherence boots nothing. I’ve accidentally damaged and destroyed thermostats that were perfectly coherent in their temperature behavior, it did them no good. ‘AlphaGo’ may be less coherent than a ‘linear CIFAR-10 classifier’, but nevertheless it still destroys weaker agents in a zero-sum war (Go). The US Congress may be ranked near the bottom for coherence, and yet, it still passes the laws that bind (and often harm) Sohl-Dickstein and myself and presumably most of the readers of this argument. And so on.
The power imbalances are simply so extreme that even some degradation in ‘coherence’ is still consistent with enormous negative consequences, both as deliberate plans, and also as sheer accidents or side-effects.
(This reminds me a little of some mistaken arguments about central limit theorems. It is true that if you add up a few variables of a given size, as a percentage, the randomness around the mean will be much larger than if you added up a lot of that size; but what people tend to forget is that the new randomness will be much larger absolutely. So if you are trying to minimize an extreme outcome, like an insurance company trying to avoid any loss of $X, you cannot do so by simply insuring more contracts because your worst-cases keep getting bigger even as they become less likely. The analogy here is that the more powerful agents are around, the more destructive their outliers become, even if they have some neutralizing trend going on as well.)
* Assuming you are willing to naively equate utilities of eg amoebas and humans, it would then not be surprising if humans incurred vastly more total regret over a lifetime, because they have much longer lifetimes with much greater action-spaces, and most RL algorithms have regret bounds that scale with # of actions & timescale. Which seems intuitive. If you have very few choices and your choices make little difference long-term, in part because there isn’t a long-term for you (amoebas live days to years, not decades), the optimal policy can’t improve much on whatever you do.
To make any argument from lower coherence, you need to make it work with a much weaker version of #1, like ‘an agent is less likely to be harmed by a more powerful agent than that power differential implies if the more powerful agent is also less coherent’.
But this is a much harder argument: how much less? How fast does incoherence need to increase in order to more than offset the corresponding increase in power? How do you quantify that at all?
How do you know that this incoherence effect is not already incorporated into any existing estimate of ‘power’ (In fact, how would you estimate ‘power’ in any meaningful sense which doesn’t already incorporate an agent’s difficulty in being coherent? Do you have to invent some computable ‘supercoherent version’ of an agent to run? If you can do that, wouldn’t agents be highly incentivized to invent that super-coherent version of themselves to replace themselves with?)
You can retreat to, ‘well, the lower coherence means that any increased power will be applied less effectively and so the risk is at least epsilon lower than claimed’, but you can’t even show that this isn’t already absorbed into existing fudge factors and measurements and extrapolations and arguments etc. So you wind up claiming basically nothing at all.