What are the close-by arguments that are actually reasonable? Here is a list of close-by arguments (not necessarily endorsed by me!):
On empirical updates from current systems: If current AI systems are broadly pretty easy to steer and there is good generalization of this steering, that should serve as some evidence that future more powerful AI systems will also be relatively easier to steer. This will help prevent concerns like scheming from arising in the first place or make these issues easier to remove.
This argument holds to some extent regardless of whether current AIs are smart enough to think through and successfully execute scheming strategies. For instance, imagine we were in a world where steering current AIs was clearly extremely hard: AIs would quickly overfit and goodhart training processes, RLHF was finicky and had terrible sample efficiency, and AIs were much worse at sample efficiently updating on questions about human deontological constraints relative to questions about how to successfully accomplish other tasks. In such a world, I think we should justifiably be more worried about future systems.
And in fact, people do argue about how hard it is to steer current systems and what this implies. For an example of a version of an argument like this, see here, though note that I disagree with various things.
It’s pretty unclear what predictions Eliezer made about the steerability of future AI systems and he should lose some credit for not making clear predictions. Further, my sense is his implied predictions don’t look great. (Paul’s predictions as of about 2016 seem pretty good from my understanding, though not that clearly laid out, and are consistent with his threat models.)
On unfalsifiability: It should be possible to empirically produce evidence for or against scheming prior to it being too late. The fact that MIRI-style doom views often don’t discuss predictions about experimental evidence and also don’t make reasonably convincing arguments that it will be very hard to produce evidence in test beds is concerning. It’s a bad sign if advocates for a view don’t try hard to make it falsifiable prior to that view implying aggressive action.
My view is that we’ll probably get a moderate amount of evidence on scheming prior to catastrophe (perhaps 3x update either way), with some chance that scheming will basically be confirmed in an earlier model. And, it is in principle possible to obtain certainty either way about scheming using experiments, though this might be very tricky and a huge amount of work for various reasons.
On empiricism: There isn’t good empirical evidence for scheming and instead the case for scheming depends on dubious arguments. Conceptual arguments have a bad track record, so to estimate the probability of scheming we should mostly guess based on the most basic and simple conceptual arguments and weight more complex arguments very little. If you do this, scheming looks unlikely.
I roughly half agree with this argument, but I’d note that you also have to discount conceptual arguments against scheming in the same way and that the basic and simple conceptual arguments seem to indicate that scheming isn’t that unlikely. (I’d say around 25%.)
I basically endorse argument 1, and one other update you haven’t mentioned but which is important is that the values of a human turn out to be less complicated and fragile, and more generalizable than people thought (this is because human values data is likely a small part of GPT-4, and yet it can correctly answer a lot of morality questions, and I think LLMs are genuinely learning new regularities here, so they can generalize from their training data).
What are the close-by arguments that are actually reasonable? Here is a list of close-by arguments (not necessarily endorsed by me!):
On empirical updates from current systems: If current AI systems are broadly pretty easy to steer and there is good generalization of this steering, that should serve as some evidence that future more powerful AI systems will also be relatively easier to steer. This will help prevent concerns like scheming from arising in the first place or make these issues easier to remove.
This argument holds to some extent regardless of whether current AIs are smart enough to think through and successfully execute scheming strategies. For instance, imagine we were in a world where steering current AIs was clearly extremely hard: AIs would quickly overfit and goodhart training processes, RLHF was finicky and had terrible sample efficiency, and AIs were much worse at sample efficiently updating on questions about human deontological constraints relative to questions about how to successfully accomplish other tasks. In such a world, I think we should justifiably be more worried about future systems.
And in fact, people do argue about how hard it is to steer current systems and what this implies. For an example of a version of an argument like this, see here, though note that I disagree with various things.
It’s pretty unclear what predictions Eliezer made about the steerability of future AI systems and he should lose some credit for not making clear predictions. Further, my sense is his implied predictions don’t look great. (Paul’s predictions as of about 2016 seem pretty good from my understanding, though not that clearly laid out, and are consistent with his threat models.)
On unfalsifiability: It should be possible to empirically produce evidence for or against scheming prior to it being too late. The fact that MIRI-style doom views often don’t discuss predictions about experimental evidence and also don’t make reasonably convincing arguments that it will be very hard to produce evidence in test beds is concerning. It’s a bad sign if advocates for a view don’t try hard to make it falsifiable prior to that view implying aggressive action.
My view is that we’ll probably get a moderate amount of evidence on scheming prior to catastrophe (perhaps 3x update either way), with some chance that scheming will basically be confirmed in an earlier model. And, it is in principle possible to obtain certainty either way about scheming using experiments, though this might be very tricky and a huge amount of work for various reasons.
On empiricism: There isn’t good empirical evidence for scheming and instead the case for scheming depends on dubious arguments. Conceptual arguments have a bad track record, so to estimate the probability of scheming we should mostly guess based on the most basic and simple conceptual arguments and weight more complex arguments very little. If you do this, scheming looks unlikely.
I roughly half agree with this argument, but I’d note that you also have to discount conceptual arguments against scheming in the same way and that the basic and simple conceptual arguments seem to indicate that scheming isn’t that unlikely. (I’d say around 25%.)
I basically endorse argument 1, and one other update you haven’t mentioned but which is important is that the values of a human turn out to be less complicated and fragile, and more generalizable than people thought (this is because human values data is likely a small part of GPT-4, and yet it can correctly answer a lot of morality questions, and I think LLMs are genuinely learning new regularities here, so they can generalize from their training data).
Implications for AI risk of course abound.