I find this essay interesting as a case study in discourse and argumentation norms. Particularly as a case study of issues with discourse around AI risk.
When I first skimmed this essay when it came out, I thought it was ok, but mostly uninteresting or obvious. Then, on reading the comments and looking back at the body, I thought it did some pretty bad strawmanning.
I reread the essay yesterday and now I feel quite differently. Parts (i), (ii), and (iv) which don’t directly talk about AI are actually great and many of the more subtle points are pretty well executed. The connection to AI risk in part (iii) is quite bad and notably degrades the essay as a whole. I think a well-executed connection to AI risk would have been good. Part (iii) seems likely to contribute to AI risk being problematically politicized and negatively polarized (e.g. low quality dunks and animosity). Further, I think this is characteristic of problems I have with the current AI risk discourse.
In parts (i), (ii), and (iv), it is mostly clear that the Spokesperson is an exaggerated straw person who doesn’t correspond to any particular side of an issue. This seems like a reasonable rhetorical move to better explain a point. However, part (iii) has big issues in how it connects the argument to AI risk. Eliezer ends up defeating a specific and weak argument against AI risk. This is an argument that actually does get made, but unfortunately, he both associates this argument with the entire view of AI risk skepticism (in the essay, the “AI-permitting faction”) and he fails to explain that the debunking doesn’t apply to many common arguments which sound similar but are actually reasonable. Correspondingly, the section suffers from an ethnic tension style issue: in practice, it attacks an entire view by associating it with a bad argument for that view (but of course, reversed stupidity is not intelligence). This issue is made worse because the argument Eliezer attacks is also very similar to various more reasonable arguments that can’t be debunked in the same way and Eliezer doesn’t clearly call this out. Thus, these more reasonable arguments are attacked in association. It seems reasonable for Eliezer to connect the discussion to AI risk directly, but I think the execution was poor.
How could Eliezer avoid these issues? I think when debunking or arguing against bad arguments, you should explain that you’re attacking bad arguments and that there exist other commonly made arguments which are better or at least harder to debunk. It also helps to disassociate the specific bad arguments from a general cause or view as much as possible. This essay seems to associate the bad “Empiricism!” argument with AI risk skepticism nearly as much as possible. Whenever you push back against an argument which is similar or similar-sounding to other arguments, but the push back doesn’t apply in the same way to those other arguments, it’s useful to explicitly spell out the limitations of the push back.[1] One possible bar to try to reach is that people who disagree strongly on the topic due to other more reasonable arguments should feel happy to endorse the push back against those specific bad arguments.
There is perhaps a bit of a motte-and-bailey with this essay where Eliezer can strongly defend debunking a specific bad argument (the motte), but there is an implication that the argument also pushes against more complex and less clearly bad arguments (the bailey). (I’m not saying that Eliezer actively engages in motte-and-bailey in the essay, just that this essay probably has this property to some extent in practice.) That said there is also perhaps a motte-and-bailey for many of the arguments that Eliezer argues against where the motte is a more complex and narrow argument and the bailey is “Empiricism! is why AI risk is fake”.
A thoughtful reader can recognize and avoid their own cognitive biases and correctly think though the exact correct implications of the arguments made here. But, I wish Eliezer did this work for the reader to reduce negative polarization.
When debunking bad arguments, it’s useful to be clear about what you aren’t covering or implying.
What are the close-by arguments that are actually reasonable? Here is a list of close-by arguments (not necessarily endorsed by me!):
On empirical updates from current systems: If current AI systems are broadly pretty easy to steer and there is good generalization of this steering, that should serve as some evidence that future more powerful AI systems will also be relatively easier to steer. This will help prevent concerns like scheming from arising in the first place or make these issues easier to remove.
This argument holds to some extent regardless of whether current AIs are smart enough to think through and successfully execute scheming strategies. For instance, imagine we were in a world where steering current AIs was clearly extremely hard: AIs would quickly overfit and goodhart training processes, RLHF was finicky and had terrible sample efficiency, and AIs were much worse at sample efficiently updating on questions about human deontological constraints relative to questions about how to successfully accomplish other tasks. In such a world, I think we should justifiably be more worried about future systems.
And in fact, people do argue about how hard it is to steer current systems and what this implies. For an example of a version of an argument like this, see here, though note that I disagree with various things.
It’s pretty unclear what predictions Eliezer made about the steerability of future AI systems and he should lose some credit for not making clear predictions. Further, my sense is his implied predictions don’t look great. (Paul’s predictions as of about 2016 seem pretty good from my understanding, though not that clearly laid out, and are consistent with his threat models.)
On unfalsifiability: It should be possible to empirically produce evidence for or against scheming prior to it being too late. The fact that MIRI-style doom views often don’t discuss predictions about experimental evidence and also don’t make reasonably convincing arguments that it will be very hard to produce evidence in test beds is concerning. It’s a bad sign if advocates for a view don’t try hard to make it falsifiable prior to that view implying aggressive action.
My view is that we’ll probably get a moderate amount of evidence on scheming prior to catastrophe (perhaps 3x update either way), with some chance that scheming will basically be confirmed in an earlier model. And, it is in principle possible to obtain certainty either way about scheming using experiments, though this might be very tricky and a huge amount of work for various reasons.
On empiricism: There isn’t good empirical evidence for scheming and instead the case for scheming depends on dubious arguments. Conceptual arguments have a bad track record, so to estimate the probability of scheming we should mostly guess based on the most basic and simple conceptual arguments and weight more complex arguments very little. If you do this, scheming looks unlikely.
I roughly half agree with this argument, but I’d note that you also have to discount conceptual arguments against scheming in the same way and that the basic and simple conceptual arguments seem to indicate that scheming isn’t that unlikely. (I’d say around 25%.)
I basically endorse argument 1, and one other update you haven’t mentioned but which is important is that the values of a human turn out to be less complicated and fragile, and more generalizable than people thought (this is because human values data is likely a small part of GPT-4, and yet it can correctly answer a lot of morality questions, and I think LLMs are genuinely learning new regularities here, so they can generalize from their training data).
Another way to put this is that posts should often discuss their limitations, particular when debunking bad arguments that are similar to more reasonable arguments.
I think discussing limitations clearly is a reasonable norm for scientific papers that reduces the extent to which people intentionally or unintentionally get away with implying their results prove more than they do.
I find this essay interesting as a case study in discourse and argumentation norms. Particularly as a case study of issues with discourse around AI risk.
When I first skimmed this essay when it came out, I thought it was ok, but mostly uninteresting or obvious. Then, on reading the comments and looking back at the body, I thought it did some pretty bad strawmanning.
I reread the essay yesterday and now I feel quite differently. Parts (i), (ii), and (iv) which don’t directly talk about AI are actually great and many of the more subtle points are pretty well executed. The connection to AI risk in part (iii) is quite bad and notably degrades the essay as a whole. I think a well-executed connection to AI risk would have been good. Part (iii) seems likely to contribute to AI risk being problematically politicized and negatively polarized (e.g. low quality dunks and animosity). Further, I think this is characteristic of problems I have with the current AI risk discourse.
In parts (i), (ii), and (iv), it is mostly clear that the Spokesperson is an exaggerated straw person who doesn’t correspond to any particular side of an issue. This seems like a reasonable rhetorical move to better explain a point. However, part (iii) has big issues in how it connects the argument to AI risk. Eliezer ends up defeating a specific and weak argument against AI risk. This is an argument that actually does get made, but unfortunately, he both associates this argument with the entire view of AI risk skepticism (in the essay, the “AI-permitting faction”) and he fails to explain that the debunking doesn’t apply to many common arguments which sound similar but are actually reasonable. Correspondingly, the section suffers from an ethnic tension style issue: in practice, it attacks an entire view by associating it with a bad argument for that view (but of course, reversed stupidity is not intelligence). This issue is made worse because the argument Eliezer attacks is also very similar to various more reasonable arguments that can’t be debunked in the same way and Eliezer doesn’t clearly call this out. Thus, these more reasonable arguments are attacked in association. It seems reasonable for Eliezer to connect the discussion to AI risk directly, but I think the execution was poor.
I think my concerns are notably similar to the issues people had with The Sun is big, but superintelligences will not spare Earth a little sunlight and I’ve encountered similar issues in many things written by Eliezer and Nate.
How could Eliezer avoid these issues? I think when debunking or arguing against bad arguments, you should explain that you’re attacking bad arguments and that there exist other commonly made arguments which are better or at least harder to debunk. It also helps to disassociate the specific bad arguments from a general cause or view as much as possible. This essay seems to associate the bad “Empiricism!” argument with AI risk skepticism nearly as much as possible. Whenever you push back against an argument which is similar or similar-sounding to other arguments, but the push back doesn’t apply in the same way to those other arguments, it’s useful to explicitly spell out the limitations of the push back.[1] One possible bar to try to reach is that people who disagree strongly on the topic due to other more reasonable arguments should feel happy to endorse the push back against those specific bad arguments.
There is perhaps a bit of a motte-and-bailey with this essay where Eliezer can strongly defend debunking a specific bad argument (the motte), but there is an implication that the argument also pushes against more complex and less clearly bad arguments (the bailey). (I’m not saying that Eliezer actively engages in motte-and-bailey in the essay, just that this essay probably has this property to some extent in practice.) That said there is also perhaps a motte-and-bailey for many of the arguments that Eliezer argues against where the motte is a more complex and narrow argument and the bailey is “Empiricism! is why AI risk is fake”.
A thoughtful reader can recognize and avoid their own cognitive biases and correctly think though the exact correct implications of the arguments made here. But, I wish Eliezer did this work for the reader to reduce negative polarization.
When debunking bad arguments, it’s useful to be clear about what you aren’t covering or implying.
Part (iv) helps to explain the scope of the broader point, but doesn’t explain the limitations in specifically the AI case.
What are the close-by arguments that are actually reasonable? Here is a list of close-by arguments (not necessarily endorsed by me!):
On empirical updates from current systems: If current AI systems are broadly pretty easy to steer and there is good generalization of this steering, that should serve as some evidence that future more powerful AI systems will also be relatively easier to steer. This will help prevent concerns like scheming from arising in the first place or make these issues easier to remove.
This argument holds to some extent regardless of whether current AIs are smart enough to think through and successfully execute scheming strategies. For instance, imagine we were in a world where steering current AIs was clearly extremely hard: AIs would quickly overfit and goodhart training processes, RLHF was finicky and had terrible sample efficiency, and AIs were much worse at sample efficiently updating on questions about human deontological constraints relative to questions about how to successfully accomplish other tasks. In such a world, I think we should justifiably be more worried about future systems.
And in fact, people do argue about how hard it is to steer current systems and what this implies. For an example of a version of an argument like this, see here, though note that I disagree with various things.
It’s pretty unclear what predictions Eliezer made about the steerability of future AI systems and he should lose some credit for not making clear predictions. Further, my sense is his implied predictions don’t look great. (Paul’s predictions as of about 2016 seem pretty good from my understanding, though not that clearly laid out, and are consistent with his threat models.)
On unfalsifiability: It should be possible to empirically produce evidence for or against scheming prior to it being too late. The fact that MIRI-style doom views often don’t discuss predictions about experimental evidence and also don’t make reasonably convincing arguments that it will be very hard to produce evidence in test beds is concerning. It’s a bad sign if advocates for a view don’t try hard to make it falsifiable prior to that view implying aggressive action.
My view is that we’ll probably get a moderate amount of evidence on scheming prior to catastrophe (perhaps 3x update either way), with some chance that scheming will basically be confirmed in an earlier model. And, it is in principle possible to obtain certainty either way about scheming using experiments, though this might be very tricky and a huge amount of work for various reasons.
On empiricism: There isn’t good empirical evidence for scheming and instead the case for scheming depends on dubious arguments. Conceptual arguments have a bad track record, so to estimate the probability of scheming we should mostly guess based on the most basic and simple conceptual arguments and weight more complex arguments very little. If you do this, scheming looks unlikely.
I roughly half agree with this argument, but I’d note that you also have to discount conceptual arguments against scheming in the same way and that the basic and simple conceptual arguments seem to indicate that scheming isn’t that unlikely. (I’d say around 25%.)
I basically endorse argument 1, and one other update you haven’t mentioned but which is important is that the values of a human turn out to be less complicated and fragile, and more generalizable than people thought (this is because human values data is likely a small part of GPT-4, and yet it can correctly answer a lot of morality questions, and I think LLMs are genuinely learning new regularities here, so they can generalize from their training data).
Implications for AI risk of course abound.
Another way to put this is that posts should often discuss their limitations, particular when debunking bad arguments that are similar to more reasonable arguments.
I think discussing limitations clearly is a reasonable norm for scientific papers that reduces the extent to which people intentionally or unintentionally get away with implying their results prove more than they do.