Thanks for elaborating. There seem to be two different ideas:
1), that it is a promising strategy to try and constrain early AGI capabilities and knowledge
2), that even without such constraints, a paperclipper entails a smaller risk of worst-case outcomes with large amounts of disvalue, compared to a near miss. (Brian Tomasik has also written about this.)
1) is very plausible, perhaps even obvious, though as you say it’s not clear how feasible this will be. I’m not convinced of 2), even though I’ve heard / read many people expressing this idea. I think it’s unclear what would result in more disvalue in expectation. For instance, a paperclipper would have no qualms to threaten other actors (with something that we would consider disvalue), while a near-miss might still have, depending on what exactly the failure mode is. In terms of incidental suffering, it’s true that a near-miss is more likely to do something about human minds, but again it’s also possible the system is, despite the failure, still compassionate enough to refrain from this, or use digital anesthesia. (It all depends on what plausible failure modes look like, and that’s very hard to say.)
I agree with you that the “stereotyped image of AI catastrophe” is not what failure will most likely look like, and it’s great to see more discussion of alternative scenarios. But why exactly should we expect that the problems you describe will be exacerbated in a future with powerful AI, compared to the state of contemporary human societies? Humans also often optimise for what’s easy to measure, especially in organisations. Is the concern that current ML systems are unable to optimise hard-to-measure goals, or goals that are hard to represent in a computerised form? That is true but I think of this as a limitation of contemporary ML approaches rather than a fundamental property of advanced AI. With general intelligence, it should also be possible to optimise goals that are hard-to-measure.
Similarly, humans / companies / organisations regularly exhibit influence-seeking behaviour, and this can cause harm but it’s also usually possible to keep it in check to at least a certain degree.
So, while you point at things that can plausibly go wrong, I’d say that these are perennial issues that may become better or worse during and after the transition to advanced AI, and it’s hard to predict what will happen. Of course, this does not make a very appealing tale of doom – but maybe it would be best to dispense with tales of doom altogether.
I’m also not yet convinced that “these capture the most important dynamics of catastrophe.” Specifically, I think the following are also potentially serious issues:
- Unfortunate circumstances in future cooperation problems between AI systems (and / or humans) result in widespread defection, leading to poor outcomes for everyone.
- Conflicts between key future actors (AI or human) result in large quantities of disvalue (agential s-risks).
- New technology leads to radical value drift of a form that we wouldn’t endorse.