Thanks for elaborating. There seem to be two different ideas:
1), that it is a promising strategy to try and constrain early AGI capabilities and knowledge
2), that even without such constraints, a paperclipper entails a smaller risk of worst-case outcomes with large amounts of disvalue, compared to a near miss. (Brian Tomasik has also written about this.)
1) is very plausible, perhaps even obvious, though as you say it’s not clear how feasible this will be. I’m not convinced of 2), even though I’ve heard / read many people expressing this idea. I think it’s unclear what would result in more disvalue in expectation. For instance, a paperclipper would have no qualms to threaten other actors (with something that we would consider disvalue), while a near-miss might still have, depending on what exactly the failure mode is. In terms of incidental suffering, it’s true that a near-miss is more likely to do something about human minds, but again it’s also possible the system is, despite the failure, still compassionate enough to refrain from this, or use digital anesthesia. (It all depends on what plausible failure modes look like, and that’s very hard to say.)
Thanks for elaborating. There seem to be two different ideas:
1), that it is a promising strategy to try and constrain early AGI capabilities and knowledge
2), that even without such constraints, a paperclipper entails a smaller risk of worst-case outcomes with large amounts of disvalue, compared to a near miss. (Brian Tomasik has also written about this.)
1) is very plausible, perhaps even obvious, though as you say it’s not clear how feasible this will be. I’m not convinced of 2), even though I’ve heard / read many people expressing this idea. I think it’s unclear what would result in more disvalue in expectation. For instance, a paperclipper would have no qualms to threaten other actors (with something that we would consider disvalue), while a near-miss might still have, depending on what exactly the failure mode is. In terms of incidental suffering, it’s true that a near-miss is more likely to do something about human minds, but again it’s also possible the system is, despite the failure, still compassionate enough to refrain from this, or use digital anesthesia. (It all depends on what plausible failure modes look like, and that’s very hard to say.)