I’m reasonably compelled by Sperber and Mercer’s claim that explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments.
Seems obviously false. If we simplistically imagine humans as being swayed by, and separately arguing, an increasingly sophisticated series of argument types that we could label 0, 1, 2, …N, N+1, and which are all each encoded in a single allele that somehow arose to fixation, then the capacity to initially recognize and be swayed by a type N+1 argument is a disadvantage when it comes to winning a type N argument using internal sympathy with the audience’s viewpoint, because when that mutation happens for the first time, the other people in the tribe will not find N+1-type arguments compelling, and you do, which leads you to make intuitive mistakes about what they will find compelling. Only after the capacity to recognize type N+1 arguments as good arguments becomes pervasive in other listeners, does the ability to search for type-N+1 arguments congruent to some particular political or selfish purpose, become a fitness advantage. Even if we have underlying capabilities to automatically search for political/selfish arguments of all types we currently recognize as good, this just makes the step from N+1 recognition to N+1 search be simultaneous within an individual organism. It doesn’t change the logic whereby going from N to N+1 in the sequence of recognizably good arguments must have some fitness advantage that is not “in order to win arguments” in order for individuals bearing the N+1 allele to have a fitness advantage over individuals who only have the alleles up to N, because being swayed by N+1 is not an advantage in argument until other individuals have that allele too.
In real life we have a deep pool of fixed genes with a bubbling surface of multiple genes under selection, along with complicated phenotypical interactions etcetera, but none of this changes the larger point so far as I can tell: a bacterium or a mouse have little ability to be swayed by arguments of the sort humans exchange with each other, which defines their lack of reasoning ability more than their difficulty in coming up with good arguments; and an ability to be swayed by an argument of whatever type must be present before there’s any use in improving a search for arguments that meet that criterion. In other words, the journey from the kind of arguments that bacteria recognize, to the kind of arguments that humans recognize, cannot have been driven by an increasingly powerful search for political arguments that appeal to bacteria.
Even if the key word is supposed to be ‘explicit’, we can apply a similar logic to the ability to be swayed by an ‘explicit’ thought and the ability to search for explicit thoughts that sway people.
If arguments had no meaning but to argue other people into things, if they were being subject only to neutral selection or genetic drift or mere conformism, there really wouldn’t be any reason for “the kind of arguments humans can be swayed by” to work to build a spaceship. We’d just end up with some arbitrary set of rules fixed in place. False cynicism.
Now, consider the following simplistic model for naive (un)aligned AGI:
The AGI outputs English sentences. Each time the AGI does, the human operator replies on a scale of 1 to 100 with how good and valuable and useful that sentence seemed to the human. The human may also input other sentences to the AGI as a hint about what kind of output the human is currently looking for; and the AGI also has purely passive sensory inputs like a fixed webcam stream or a pregathered internet archive.
How does this fail as an alignment methodology? Doesn’t this fit very neatly into the existing prosaic methodology of reinforcement learning? Wouldn’t it be very useful on hand to have an intelligence which gives us much more valuable sentences, in response to input, than the sentences that would be generated by a human?
There’s a number of general or generic ways to fail here that aren’t specific to the supposed criterion of the reinforcement learning system, like the AGI ending up with other internal goals and successfully forging a Wifi signal via internal RAM modulation etcetera, but let’s consider failures that are in some sense intrinsic to this paradigm even if the internal AI ends up perfectly aligned on that goal. Let’s even skip over the sense in which we’ve given the AI a long-term incentive to accept some lower rewards in the short term, in order to grab control of the rating button, if the AGI ends up with long-term consequentialist preferences and long-term planning abilities that exactly reflect the outer fitness function. Let’s say that the AGI is only shortsightedly trying to maximize sentence reward on each round and that it is not superintelligent enough to disintermediate the human operators and grab control of the button in one round without multi-round planning. Some of us might be a bit skeptical about whether you can be, effectively, very much smarter than a human, in the first place, without doing some kind of multi-round or long-term internal planning about how to think about things and allocate internal resources; but fine, maybe there’s just so many GPUs running the system that it can do all of its thinking, for each round, on that round. What intrinsically goes wrong?
What intrinsically goes wrong, I’d say, is that the human operators have an ability to recognize good arguments that’s only rated to withstand up to a certain intensity of search, which will break down beyond that point. Our brains’ ability to distinguish good arguments from bad arguments is something we’d expect to be balanced to the kind of argumentative pressure a human brain was presented with in the ancestral environment / environment of evolutionary adaptedness, and if you optimize against a brain much harder than this, you’d expect it to break. There’d be an arms race between politicians exploiting brain features to persuade people of things that were useful to the politician, and brains that were, among other things, trying to pursue the original ‘reasons’ for reasoning that originally and initially made it useful to recognize certain arguments as good arguments before any politicians were trying to exploit them. Again, oversimplified, and there are cases where it’s not tribally good for you to be the only person who sees a politician’s lie as a lie; but the broader point is that there’s going to exist an ecological balance in the ancestral environment between brains trying to persuade other brains, and brains trying to do locally fitness-enhancing cognition while listening to persuaders; and this balance is going to be tuned to the level of search power that politicians had in the environment of evolutionary adaptedness.
Arguably, this is one way of viewing the flood of modern insanity at which social media seems to be the center. For the same reason pandemics get more virulent with larger and more dense population centers, Twitter may be selecting for memes that break humans at a much higher level of optimization pressure than held in the ancestral environment or even just 1000 years earlier than this.
Viewed through the lens of Goodhart’s Curse: When you have an imperfect proxy U for an underlying value V, the highest values of U will represent the places where U diverges upward the most from V and not just the highest underlying values of V. The harder you search for high values of U, the wider the space of possibilities you search, the more that the highest values of U will diverge upwards from V.
### incorporated from a work in progress
Suppose that, in the earlier days of the Web, you’re trying to find webpages with omelet recipes. You have the stunning insight that webpages with omelet recipes often contain the word “omelet” somewhere in them. So you build a search engine that travels URLs to crawl as much of the Web as you can find, indexing all the pages by the words they contain; and then you search for the “omelet” keyword. Works great the first time you try it! Maybe some of the pages are quoting “You can’t make an omelet without breaking eggs” (said by Robespierre, allegedly), but enough pages have actual omelet recipes that you can find them by scrolling down. Better yet, assume that pages that contain the “omelet” keyword more often are more likely to be about omelets. Then you’re fine… in the first iteration of the game.
But the thing is: the easily computer-measurable fact of whether a page contains the “omelet” keyword is not identical to the fact of whether it has the tasty omelet recipe you seek. V, the true value, is whether a page has a tasty recipe for omelets; U, the proxy measure, is how often the page mentions the “omelet” keyword. That some pages are compendiums of quotes from French revolutionaries, instead of omelet recipes, illustrates that U and V can diverge even in the natural ecology.
But once the search engine is built, we are not just listing possible pages and their U-measure at random; we are picking pages with the highest U-measure we can see. If we name the divergence D = U-V then we can say u_i = v_i + d_i. This helps illustrate that by selecting for the highest u_i we can find, we are putting upward selection pressure on both v_i and d_i. We are implicitly searching out, first, the underlying quality V that U is a proxy for, and second, places where U diverges far upward from V, that is, places where the proxy measure breaks down.
If we are living in an unrealistically smooth world where V and D are independent Gaussian distributions with mean 0 and variance 1, then the mean and variance of U is just 0 and 2 (the sum of the means and variances of V and D). If we randomly select an element with u_i=3, then on average it has v_i of 1.5 and d_i of 1.5. If the variance of V is 1 and the variance of D is 10 - if the “noise” from V to U varies much more widely on average than V itself—then most of the height of a high-U item probably comes from a lot of upward noise. But not all of it. On average, if you pick out an element with u_i = 11, it has expected d_i of 10 and v_i of 1; its apparent greatness is mostly noise. But still, the expected v_i is 1, not the average V of 0. The best-looking things are still, in expectation, better than average. They are just not as good as they look.
Ah, but what if everything isn’t all Gaussian distributions? What if there are some regions of the space where D has much higher variance—places where U is much more prone to error as a proxy measure of V? Then selecting for high U tends to steer us to regions of possibility space where U is most mistaken as a measure of V.
And in nonsimple domains, the wider the region of possibility we search, the more likely this is to happen; the more likely it is that some part of the possibility space contains a place where U is a bad proxy for V.
This is an abstract (and widely generalizable) way of seeing the Fall of Altavista. In the beginning, the programmers noticed that naturally occurring webpages containing the word “omelet” were more likely to be about omelets. It is very hard to measure whether a webpage contains a good, coherent, well-written, tasty omelet recipe (what users actually want), but very easy to measure how often a webpage mentions the word “omelet”. And the two facts did seem to correlate (webpages about dogs usually didn’t mention omelets at all). So Altavista built a search engine accordingly.
But if you imagine the full space of all possible text pages, the ones that mention “omelet” most often are not pages with omelet recipes. They are pages with lots of sections that just say “omelet omelet omelet” over and over. In the natural ecology these webpages did not, at first, exist to be indexed! It doesn’t matter that possibility-space is uncorrelated in principle, if we’re only searching an actuality-space where things are in fact correlated.
But once lots of people started using (purely) keyword-based searches for webpages, and frequently searching for “omelet”, spammers had an incentive to reshape their Viagra sales pages to contain “omelet omelet omelet” paragraphs.
That is: Once there was an economic incentive for somebody to make the search engine return a different result, the spammers began to intelligently search for ways to make U return a high result, and this implicitly meant putting the U-V correlation to a vastly stronger test. People naturally making webpages had not previously generated lots of webpages that said “omelet omelet omelet Viagra”. U looked well-correlated with V in the region of textual possibility space that corresponded to the Web’s status quo ante. But when an intelligent spammer imagines a way to try to steer users to their webpage, their imagination is searching through all the kinds of possible webpages they can imagine constructing; they are searching for imaginable places where U-V is very high and not just previously existing places where U-V is high. This means searching a much wider region of possibility for any place where U-V breaks down (or rather, breaks upward) which is why U is being put to a much sterner test.
We can also see issues in computer security from a similar perspective: Regularities that are obseved in narrow possibility spaces often break down in wider regions of the possibility space that can be searched by intelligent optimization. Consider how weird a buffer overflow attack would look, relative to a more “natural” ecology of program execution traces produced by non-malicious actors. Not only does the buffer overflow attack involve an unnaturally huge input, it’s a huge input that overwrites the stack return address in a way that improbably happens to go to one of the most effectual possible destinations. A buffer overflow that results in root privilege escalation might not happen by accident inside a vulnerable system even once before the end of the universe. But an intelligent attacker doesn’t search the space of only things that have already happened, they use their intelligence to search the much wider region of things that they can imagine happening. It says very little about the security of a computer system to say that, on average over the lifetime of the universe, it will never once yield up protected data in response to random inputs or in response to inputs typical of the previously observed distribution.
And the smarter the attacker, the wider the space of system execution traces it can effectively search. Very sophisticated attacks can look like “magic” in the sense that they exploit regularities you didn’t realize existed! As an example in computer security, consider the Rowhammer attack, where repeatedly writing to unprotected RAM causes a nearby protected bit to flip. This violates what you might have thought were the basic axioms governing the computer’s transistors. If you didn’t know the trick behind Rowhammer, somebody could show you the code for the attack, and you just wouldn’t see any advance reason why that code would succeed. You would not predict in advance that this weird code would successfully get root privileges, given all the phenomena inside the computer that you currently know about. This is “magic” in the same way that an air conditioner is magic in 1000AD. It’s not just that the medieval scholar hasn’t yet imagined the air conditioner. Even if you showed them the blueprint for the air conditioner, they wouldn’t see any advance reason to predict that the weird contraption would output cold air. The blueprint is exploiting regularities like the pressure-temperature relationship that they haven’t yet figured out.
To rephrase back into terms of Goodhart’s Law as originally said by Goodhart—“Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes”—statistical regularities that previously didn’t break down in the face of lesser control pressures, can break down in the face of stronger control pressures that effectively search a wider range of possibilities, including possibilities that obey rules you didn’t know were rules. This is more likely to happen the more complicated and rich and poorly understood the system is...
### end of quote
...which is how we can be nearly certain, even in advance of knowing the exact method, that a sufficiently strong search against a rating output by a complicated rich poorly-understood human brain will break that brain in ways that we can’t even understand.
Even if everything goes exactly as planned on an internal level inside the AGI, which in real life is at least 90% of the difficulty, the outer control structure of the High-Rated Sentence Producer is something that, on the face of it, learns to break the operator. The fact that it’s producing sentences more highly rated than a human inside the same box, the very fact that makes the High-Rated Sentence Producer possibly be useful in the first place, implies that it’s searching harder against the rating criterion than a human does. Human ratings are imperfect proxies for validity, accuracy, estimated expectation of true value produced by a policy, etcetera. Human brains are rich and complicated and poorly understood. Such integrity as they possess is probably nearly in balance with the ecological expectation of encountering persuasive invalid arguments produced by other human-level intelligences. We should expect with very high probability that if HRSP searches hard enough against the rating, it will break the brain behind it.
I think I agree with all of this. In fact, this argument is one reason why I think Debate could be valuable, because it will hopefully increase the maximum complexity of arguments that humans can reliably evaluate.
This eventually fails at some point, but hopefully it fails after the point at which we can use Debate to solve alignment in a more scalable way. (I don’t have particularly strong intuitions about whether this hope is justified, though.)
If the High-Rated Sentence Producer was restricted to output only single steps of a mathematical proof and the single steps were evaluated independently, with the human unable to look at previous steps, then I wouldn’t expect this kind of reward hacking to occur. In math proofs, we can build proofs for more complex questions out of individual steps that don’t need to increase in complexity.
As I see it, debate on arbitrary questions could work if we figured out how to do something similar, having arguments split into single steps and evaluated independently (as in the recent OpenAI debate work), such that the debate AI can tackle more complicated questions with steps that are restricted to the complexity that humans can currently work with. Hard to know if this is possible, but still seems worth trying to work on.
We don’t need the model to use that much optimization power, to the point where it breaks the operator. We just need it to perform roughly at human-level, and then we can just deploy many instances of the trained model and accomplish very useful things (e.g. via factored cognition).
So I think it’s important to also note that, getting a neural network to “perform roughly at human-level in an aligned manner” may be a much harder task than getting a neural network to achieve maximal rating by breaking the operator. The former may be a much narrower target. This point is closely related to what you wrote here in the context of amplification:
Speaking of inexact imitation: It seems to me that having an AI output a high-fidelity imitation of human behavior, sufficiently high-fidelity to preserve properties like “being smart” and “being a good person” and “still being a good person under some odd strains like being assembled into an enormous Chinese Room Bureaucracy”, is a pretty huge ask.
It seems to me obvious, though this is the sort of point where I’ve been surprised about what other people don’t consider obvious, that in general exact imitation is a bigger ask than superior capability. Building a Go player that imitates Shuusaku’s Go play so well that a scholar couldn’t tell the difference, is a bigger ask than building a Go player that could defeat Shuusaku in a match. A human is much smarter than a pocket calculator but would still be unable to imitate one without using a paper and pencil; to imitate the pocket calculator you need all of the pocket calculator’s abilities in addition to your own.
Correspondingly, a realistic AI we build that literally passes the strong version of the Turing Test would probably have to be much smarter than the other humans in the test, probably smarter than any human on Earth, because it would have to possess all the human capabilities in addition to its own. Or at least all the human capabilities that can be exhibited to another human over the course of however long the Turing Test lasts. [...]
If arguments had no meaning but to argue other people into things, if they were being subject only to neutral selection or genetic drift or mere conformism, there really wouldn’t be any reason for “the kind of arguments humans can be swayed by” to work to build a spaceship. We’d just end up with some arbitrary set of rules fixed in place.
I agree with this. My position is not that explicit reasoning is arbitrary, but that it developed via an adversarial process where arguers would try to convince listeners of things, and then listeners would try to distinguish between more and less correct arguments. This is in contrast with theories of reason which focus on the helpfulness of reason in allowing individuals to discover the truth by themselves, or theories which focus on its use in collaboration.
Here’s how Sperber and Mercier describe their argument:
Reason is not geared to solitary use, to arriving at better beliefs and decisions on our own. What reason does, rather, is help us justify our beliefs and actions to others, convince them through argumentation, and evaluate the justifications and arguments that others address to us.
I can see how my summary might give a misleading impression; I’ll add an edit to clarify. Does this resolve the disagreement?
To check if I understand correctly, you’re arguing that the selection pressure to use argument in order to win requires the ability to be swayed by arguments, and the latter already requires explicit reasoning?
That seems convincing as a counter-argument to “explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments.”, but I’m not knowledgeable enough about the work quoted to check if they don’t have a more subtle position.
Seems obviously false. If we simplistically imagine humans as being swayed by, and separately arguing, an increasingly sophisticated series of argument types that we could label 0, 1, 2, …N, N+1, and which are all each encoded in a single allele that somehow arose to fixation, then the capacity to initially recognize and be swayed by a type N+1 argument is a disadvantage when it comes to winning a type N argument using internal sympathy with the audience’s viewpoint, because when that mutation happens for the first time, the other people in the tribe will not find N+1-type arguments compelling, and you do, which leads you to make intuitive mistakes about what they will find compelling. Only after the capacity to recognize type N+1 arguments as good arguments becomes pervasive in other listeners, does the ability to search for type-N+1 arguments congruent to some particular political or selfish purpose, become a fitness advantage. Even if we have underlying capabilities to automatically search for political/selfish arguments of all types we currently recognize as good, this just makes the step from N+1 recognition to N+1 search be simultaneous within an individual organism. It doesn’t change the logic whereby going from N to N+1 in the sequence of recognizably good arguments must have some fitness advantage that is not “in order to win arguments” in order for individuals bearing the N+1 allele to have a fitness advantage over individuals who only have the alleles up to N, because being swayed by N+1 is not an advantage in argument until other individuals have that allele too.
In real life we have a deep pool of fixed genes with a bubbling surface of multiple genes under selection, along with complicated phenotypical interactions etcetera, but none of this changes the larger point so far as I can tell: a bacterium or a mouse have little ability to be swayed by arguments of the sort humans exchange with each other, which defines their lack of reasoning ability more than their difficulty in coming up with good arguments; and an ability to be swayed by an argument of whatever type must be present before there’s any use in improving a search for arguments that meet that criterion. In other words, the journey from the kind of arguments that bacteria recognize, to the kind of arguments that humans recognize, cannot have been driven by an increasingly powerful search for political arguments that appeal to bacteria.
Even if the key word is supposed to be ‘explicit’, we can apply a similar logic to the ability to be swayed by an ‘explicit’ thought and the ability to search for explicit thoughts that sway people.
If arguments had no meaning but to argue other people into things, if they were being subject only to neutral selection or genetic drift or mere conformism, there really wouldn’t be any reason for “the kind of arguments humans can be swayed by” to work to build a spaceship. We’d just end up with some arbitrary set of rules fixed in place. False cynicism.
Now, consider the following simplistic model for naive (un)aligned AGI:
The AGI outputs English sentences. Each time the AGI does, the human operator replies on a scale of 1 to 100 with how good and valuable and useful that sentence seemed to the human. The human may also input other sentences to the AGI as a hint about what kind of output the human is currently looking for; and the AGI also has purely passive sensory inputs like a fixed webcam stream or a pregathered internet archive.
How does this fail as an alignment methodology? Doesn’t this fit very neatly into the existing prosaic methodology of reinforcement learning? Wouldn’t it be very useful on hand to have an intelligence which gives us much more valuable sentences, in response to input, than the sentences that would be generated by a human?
There’s a number of general or generic ways to fail here that aren’t specific to the supposed criterion of the reinforcement learning system, like the AGI ending up with other internal goals and successfully forging a Wifi signal via internal RAM modulation etcetera, but let’s consider failures that are in some sense intrinsic to this paradigm even if the internal AI ends up perfectly aligned on that goal. Let’s even skip over the sense in which we’ve given the AI a long-term incentive to accept some lower rewards in the short term, in order to grab control of the rating button, if the AGI ends up with long-term consequentialist preferences and long-term planning abilities that exactly reflect the outer fitness function. Let’s say that the AGI is only shortsightedly trying to maximize sentence reward on each round and that it is not superintelligent enough to disintermediate the human operators and grab control of the button in one round without multi-round planning. Some of us might be a bit skeptical about whether you can be, effectively, very much smarter than a human, in the first place, without doing some kind of multi-round or long-term internal planning about how to think about things and allocate internal resources; but fine, maybe there’s just so many GPUs running the system that it can do all of its thinking, for each round, on that round. What intrinsically goes wrong?
What intrinsically goes wrong, I’d say, is that the human operators have an ability to recognize good arguments that’s only rated to withstand up to a certain intensity of search, which will break down beyond that point. Our brains’ ability to distinguish good arguments from bad arguments is something we’d expect to be balanced to the kind of argumentative pressure a human brain was presented with in the ancestral environment / environment of evolutionary adaptedness, and if you optimize against a brain much harder than this, you’d expect it to break. There’d be an arms race between politicians exploiting brain features to persuade people of things that were useful to the politician, and brains that were, among other things, trying to pursue the original ‘reasons’ for reasoning that originally and initially made it useful to recognize certain arguments as good arguments before any politicians were trying to exploit them. Again, oversimplified, and there are cases where it’s not tribally good for you to be the only person who sees a politician’s lie as a lie; but the broader point is that there’s going to exist an ecological balance in the ancestral environment between brains trying to persuade other brains, and brains trying to do locally fitness-enhancing cognition while listening to persuaders; and this balance is going to be tuned to the level of search power that politicians had in the environment of evolutionary adaptedness.
Arguably, this is one way of viewing the flood of modern insanity at which social media seems to be the center. For the same reason pandemics get more virulent with larger and more dense population centers, Twitter may be selecting for memes that break humans at a much higher level of optimization pressure than held in the ancestral environment or even just 1000 years earlier than this.
Viewed through the lens of Goodhart’s Curse: When you have an imperfect proxy U for an underlying value V, the highest values of U will represent the places where U diverges upward the most from V and not just the highest underlying values of V. The harder you search for high values of U, the wider the space of possibilities you search, the more that the highest values of U will diverge upwards from V.
### incorporated from a work in progress
Suppose that, in the earlier days of the Web, you’re trying to find webpages with omelet recipes. You have the stunning insight that webpages with omelet recipes often contain the word “omelet” somewhere in them. So you build a search engine that travels URLs to crawl as much of the Web as you can find, indexing all the pages by the words they contain; and then you search for the “omelet” keyword. Works great the first time you try it! Maybe some of the pages are quoting “You can’t make an omelet without breaking eggs” (said by Robespierre, allegedly), but enough pages have actual omelet recipes that you can find them by scrolling down. Better yet, assume that pages that contain the “omelet” keyword more often are more likely to be about omelets. Then you’re fine… in the first iteration of the game.
But the thing is: the easily computer-measurable fact of whether a page contains the “omelet” keyword is not identical to the fact of whether it has the tasty omelet recipe you seek. V, the true value, is whether a page has a tasty recipe for omelets; U, the proxy measure, is how often the page mentions the “omelet” keyword. That some pages are compendiums of quotes from French revolutionaries, instead of omelet recipes, illustrates that U and V can diverge even in the natural ecology.
But once the search engine is built, we are not just listing possible pages and their U-measure at random; we are picking pages with the highest U-measure we can see. If we name the divergence D = U-V then we can say u_i = v_i + d_i. This helps illustrate that by selecting for the highest u_i we can find, we are putting upward selection pressure on both v_i and d_i. We are implicitly searching out, first, the underlying quality V that U is a proxy for, and second, places where U diverges far upward from V, that is, places where the proxy measure breaks down.
If we are living in an unrealistically smooth world where V and D are independent Gaussian distributions with mean 0 and variance 1, then the mean and variance of U is just 0 and 2 (the sum of the means and variances of V and D). If we randomly select an element with u_i=3, then on average it has v_i of 1.5 and d_i of 1.5. If the variance of V is 1 and the variance of D is 10 - if the “noise” from V to U varies much more widely on average than V itself—then most of the height of a high-U item probably comes from a lot of upward noise. But not all of it. On average, if you pick out an element with u_i = 11, it has expected d_i of 10 and v_i of 1; its apparent greatness is mostly noise. But still, the expected v_i is 1, not the average V of 0. The best-looking things are still, in expectation, better than average. They are just not as good as they look.
Ah, but what if everything isn’t all Gaussian distributions? What if there are some regions of the space where D has much higher variance—places where U is much more prone to error as a proxy measure of V? Then selecting for high U tends to steer us to regions of possibility space where U is most mistaken as a measure of V.
And in nonsimple domains, the wider the region of possibility we search, the more likely this is to happen; the more likely it is that some part of the possibility space contains a place where U is a bad proxy for V.
This is an abstract (and widely generalizable) way of seeing the Fall of Altavista. In the beginning, the programmers noticed that naturally occurring webpages containing the word “omelet” were more likely to be about omelets. It is very hard to measure whether a webpage contains a good, coherent, well-written, tasty omelet recipe (what users actually want), but very easy to measure how often a webpage mentions the word “omelet”. And the two facts did seem to correlate (webpages about dogs usually didn’t mention omelets at all). So Altavista built a search engine accordingly.
But if you imagine the full space of all possible text pages, the ones that mention “omelet” most often are not pages with omelet recipes. They are pages with lots of sections that just say “omelet omelet omelet” over and over. In the natural ecology these webpages did not, at first, exist to be indexed! It doesn’t matter that possibility-space is uncorrelated in principle, if we’re only searching an actuality-space where things are in fact correlated.
But once lots of people started using (purely) keyword-based searches for webpages, and frequently searching for “omelet”, spammers had an incentive to reshape their Viagra sales pages to contain “omelet omelet omelet” paragraphs.
That is: Once there was an economic incentive for somebody to make the search engine return a different result, the spammers began to intelligently search for ways to make U return a high result, and this implicitly meant putting the U-V correlation to a vastly stronger test. People naturally making webpages had not previously generated lots of webpages that said “omelet omelet omelet Viagra”. U looked well-correlated with V in the region of textual possibility space that corresponded to the Web’s status quo ante. But when an intelligent spammer imagines a way to try to steer users to their webpage, their imagination is searching through all the kinds of possible webpages they can imagine constructing; they are searching for imaginable places where U-V is very high and not just previously existing places where U-V is high. This means searching a much wider region of possibility for any place where U-V breaks down (or rather, breaks upward) which is why U is being put to a much sterner test.
We can also see issues in computer security from a similar perspective: Regularities that are obseved in narrow possibility spaces often break down in wider regions of the possibility space that can be searched by intelligent optimization. Consider how weird a buffer overflow attack would look, relative to a more “natural” ecology of program execution traces produced by non-malicious actors. Not only does the buffer overflow attack involve an unnaturally huge input, it’s a huge input that overwrites the stack return address in a way that improbably happens to go to one of the most effectual possible destinations. A buffer overflow that results in root privilege escalation might not happen by accident inside a vulnerable system even once before the end of the universe. But an intelligent attacker doesn’t search the space of only things that have already happened, they use their intelligence to search the much wider region of things that they can imagine happening. It says very little about the security of a computer system to say that, on average over the lifetime of the universe, it will never once yield up protected data in response to random inputs or in response to inputs typical of the previously observed distribution.
And the smarter the attacker, the wider the space of system execution traces it can effectively search. Very sophisticated attacks can look like “magic” in the sense that they exploit regularities you didn’t realize existed! As an example in computer security, consider the Rowhammer attack, where repeatedly writing to unprotected RAM causes a nearby protected bit to flip. This violates what you might have thought were the basic axioms governing the computer’s transistors. If you didn’t know the trick behind Rowhammer, somebody could show you the code for the attack, and you just wouldn’t see any advance reason why that code would succeed. You would not predict in advance that this weird code would successfully get root privileges, given all the phenomena inside the computer that you currently know about. This is “magic” in the same way that an air conditioner is magic in 1000AD. It’s not just that the medieval scholar hasn’t yet imagined the air conditioner. Even if you showed them the blueprint for the air conditioner, they wouldn’t see any advance reason to predict that the weird contraption would output cold air. The blueprint is exploiting regularities like the pressure-temperature relationship that they haven’t yet figured out.
To rephrase back into terms of Goodhart’s Law as originally said by Goodhart—“Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes”—statistical regularities that previously didn’t break down in the face of lesser control pressures, can break down in the face of stronger control pressures that effectively search a wider range of possibilities, including possibilities that obey rules you didn’t know were rules. This is more likely to happen the more complicated and rich and poorly understood the system is...
### end of quote
...which is how we can be nearly certain, even in advance of knowing the exact method, that a sufficiently strong search against a rating output by a complicated rich poorly-understood human brain will break that brain in ways that we can’t even understand.
Even if everything goes exactly as planned on an internal level inside the AGI, which in real life is at least 90% of the difficulty, the outer control structure of the High-Rated Sentence Producer is something that, on the face of it, learns to break the operator. The fact that it’s producing sentences more highly rated than a human inside the same box, the very fact that makes the High-Rated Sentence Producer possibly be useful in the first place, implies that it’s searching harder against the rating criterion than a human does. Human ratings are imperfect proxies for validity, accuracy, estimated expectation of true value produced by a policy, etcetera. Human brains are rich and complicated and poorly understood. Such integrity as they possess is probably nearly in balance with the ecological expectation of encountering persuasive invalid arguments produced by other human-level intelligences. We should expect with very high probability that if HRSP searches hard enough against the rating, it will break the brain behind it.
I think I agree with all of this. In fact, this argument is one reason why I think Debate could be valuable, because it will hopefully increase the maximum complexity of arguments that humans can reliably evaluate.
This eventually fails at some point, but hopefully it fails after the point at which we can use Debate to solve alignment in a more scalable way. (I don’t have particularly strong intuitions about whether this hope is justified, though.)
If the High-Rated Sentence Producer was restricted to output only single steps of a mathematical proof and the single steps were evaluated independently, with the human unable to look at previous steps, then I wouldn’t expect this kind of reward hacking to occur. In math proofs, we can build proofs for more complex questions out of individual steps that don’t need to increase in complexity.
As I see it, debate on arbitrary questions could work if we figured out how to do something similar, having arguments split into single steps and evaluated independently (as in the recent OpenAI debate work), such that the debate AI can tackle more complicated questions with steps that are restricted to the complexity that humans can currently work with. Hard to know if this is possible, but still seems worth trying to work on.
One might argue:
So I think it’s important to also note that, getting a neural network to “perform roughly at human-level in an aligned manner” may be a much harder task than getting a neural network to achieve maximal rating by breaking the operator. The former may be a much narrower target. This point is closely related to what you wrote here in the context of amplification:
I agree with this. My position is not that explicit reasoning is arbitrary, but that it developed via an adversarial process where arguers would try to convince listeners of things, and then listeners would try to distinguish between more and less correct arguments. This is in contrast with theories of reason which focus on the helpfulness of reason in allowing individuals to discover the truth by themselves, or theories which focus on its use in collaboration.
Here’s how Sperber and Mercier describe their argument:
I can see how my summary might give a misleading impression; I’ll add an edit to clarify. Does this resolve the disagreement?
To check if I understand correctly, you’re arguing that the selection pressure to use argument in order to win requires the ability to be swayed by arguments, and the latter already requires explicit reasoning?
That seems convincing as a counter-argument to “explicit reasoning in humans primarily evolved not in order to help us find out about the world, but rather in order to win arguments.”, but I’m not knowledgeable enough about the work quoted to check if they don’t have a more subtle position.