Another risk from bugs comes not from the AGI system caring incorrectly about our values, but from having inadequate security. If our values are accurately encoded in an AGI system that cares about satisfying them, they become a target for threats from other actors who can gain from manipulating the first system.
I agree that this is a serious risk, but I wouldn’t categorise it as a “risk from bugs”. Every actor with goals faces the possibility that other actors may attempt to gain bargaining leverage by threatening to deliberately thwart these goals. So this does not require bugs; rather, the problem arises by default for any actor (human or AI), and I think there’s no obvious solution. (I’ve written about surrogate goals as a possible solution for at least some parts of the problem).
the very worst outcomes seem more likely if the system was trained using human modelling because these worst outcomes depend on the information in human models.
What about the possibility that the AGI system threatens others, rather than being threatened itself? Prima facie, that might also lead to worst-case outcomes. Do you envision a system that’s not trained using human modelling and therefore just wouldn’t know enough about human minds to make any effective threats? I’m not sure how an AI system can meaningfully be said to have “human-level general intelligence” and yet be completely inept in this regard. (Also, if you have such fine-grained control over what your system does or does not know about, or if you can have it do very powerful things without possessing dangerous kinds of knowledge and abilities, then I think many commonly discussed AI safety problems become non-issues anyway, as you can just constrain the system acccordingly.)
What about the possibility that the AGI system threatens others, rather than being threatened itself? Prima facie, that might also lead to worst-case outcomes.
I think a good intuition pump for this idea is to contrast an arbitrarily powerful paperclip maximizer with an arbitrarily powerful something-like-happiness maximizer.
A paperclip maximizer might resort to threats to get what it wants; and in the long run, it will want to convert all resources into paperclips and infrastructure, to the exclusion of everything humans want. But the “normal” failure modes here tend to look like human extinction.
In contrast, a lot of “normal” failure modes for a something-like-happiness maximizer might look like torture, because the system is trying to optimize something about human brains, rather than just trying to remove humans from the picture so it can do its own thing.
Do you envision a system that’s not trained using human modelling and therefore just wouldn’t know enough about human minds to make any effective threats? I’m not sure how an AI system can meaningfully be said to have “human-level general intelligence” and yet be completely inept in this regard.
I don’t know specifically what Ramana and Scott have in mind, but I’m guessing it’s a combination of:
If the system isn’t trained using human-related data, its “goals” (or the closest things to goals it has) are more likely to look like the paperclip maximizer above, and less likely to look like the something-like-happiness maximizer. This greatly reduces downside risk if the system becomes more capable than we intended.
When AI developers build the first AGI systems, the right move will probably be to keep their capabilities to a bare minimum — often the minimum stated in this context is “make your system just capable enough to help make sure the world’s AI doesn’t cause an existential catastrophe in the near future”. If that minimal goal doesn’t fluency with certain high-risk domains, then developers should just avoid letting their AGI systems learn about those domains, at least until they’ve gotten a lot of experience with alignment.
The first developers are in an especially tough position, because they have to act under more time pressure and they’ll have very little experience with working AGI systems. As such, it makes sense to try to make their task as easy as possible. Alignment isn’t all-or-nothing, and being able to align a system with one set of capabilities doesn’t mean you can do so for a system with stronger or more varied capabilities.
If you want to say that such a system isn’t technically a “human-level general intelligence”, that’s fine; the important question is about impact rather than definitions, as long as it’s clear that when I say “AGI” I mean something like “system that’s doing qualitatively the right kind of reasoning to match human performance in arbitrary domains, in large enough quantities to be competitive in domains like software engineering and theoretical physics”, not “system that can in fact match human performance in arbitrary domains”.
(Also, if you have such fine-grained control over what your system does or does not know about, or if you can have it do very powerful things without possessing dangerous kinds of knowledge and abilities, then I think many commonly discussed AI safety problems become non-issues anyway, as you can just constrain the system [accordingly].)
Yes, this is one of the main appeals of designing systems that (a) make it easy to blacklist or whitelist certain topics, (b) make it easy to verify that the system really is or isn’t thinking about a particular domain, and (c) make it easy to blacklist human modeling in particular. It’s a very big deal if you can just sidestep a lot of the core difficulties in AI safety (in your earliest AGI systems). E.g., operator manipulation, deception, mind crime, and some aspects of the fuzziness and complexity of human value.
We don’t currently know how to formalize ideas like ‘whitelisting cognitive domains’, however, and we don’t know how to align an AGI system in principle for much more modest tasks, even given a solution to those problems.
Thanks for elaborating. There seem to be two different ideas:
1), that it is a promising strategy to try and constrain early AGI capabilities and knowledge
2), that even without such constraints, a paperclipper entails a smaller risk of worst-case outcomes with large amounts of disvalue, compared to a near miss. (Brian Tomasik has also written about this.)
1) is very plausible, perhaps even obvious, though as you say it’s not clear how feasible this will be. I’m not convinced of 2), even though I’ve heard / read many people expressing this idea. I think it’s unclear what would result in more disvalue in expectation. For instance, a paperclipper would have no qualms to threaten other actors (with something that we would consider disvalue), while a near-miss might still have, depending on what exactly the failure mode is. In terms of incidental suffering, it’s true that a near-miss is more likely to do something about human minds, but again it’s also possible the system is, despite the failure, still compassionate enough to refrain from this, or use digital anesthesia. (It all depends on what plausible failure modes look like, and that’s very hard to say.)
I agree that this is a serious risk, but I wouldn’t categorise it as a “risk from bugs”. Every actor with goals faces the possibility that other actors may attempt to gain bargaining leverage by threatening to deliberately thwart these goals. So this does not require bugs; rather, the problem arises by default for any actor (human or AI), and I think there’s no obvious solution. (I’ve written about surrogate goals as a possible solution for at least some parts of the problem).
What about the possibility that the AGI system threatens others, rather than being threatened itself? Prima facie, that might also lead to worst-case outcomes. Do you envision a system that’s not trained using human modelling and therefore just wouldn’t know enough about human minds to make any effective threats? I’m not sure how an AI system can meaningfully be said to have “human-level general intelligence” and yet be completely inept in this regard. (Also, if you have such fine-grained control over what your system does or does not know about, or if you can have it do very powerful things without possessing dangerous kinds of knowledge and abilities, then I think many commonly discussed AI safety problems become non-issues anyway, as you can just constrain the system acccordingly.)
I think a good intuition pump for this idea is to contrast an arbitrarily powerful paperclip maximizer with an arbitrarily powerful something-like-happiness maximizer.
A paperclip maximizer might resort to threats to get what it wants; and in the long run, it will want to convert all resources into paperclips and infrastructure, to the exclusion of everything humans want. But the “normal” failure modes here tend to look like human extinction.
In contrast, a lot of “normal” failure modes for a something-like-happiness maximizer might look like torture, because the system is trying to optimize something about human brains, rather than just trying to remove humans from the picture so it can do its own thing.
I don’t know specifically what Ramana and Scott have in mind, but I’m guessing it’s a combination of:
If the system isn’t trained using human-related data, its “goals” (or the closest things to goals it has) are more likely to look like the paperclip maximizer above, and less likely to look like the something-like-happiness maximizer. This greatly reduces downside risk if the system becomes more capable than we intended.
When AI developers build the first AGI systems, the right move will probably be to keep their capabilities to a bare minimum — often the minimum stated in this context is “make your system just capable enough to help make sure the world’s AI doesn’t cause an existential catastrophe in the near future”. If that minimal goal doesn’t fluency with certain high-risk domains, then developers should just avoid letting their AGI systems learn about those domains, at least until they’ve gotten a lot of experience with alignment.
The first developers are in an especially tough position, because they have to act under more time pressure and they’ll have very little experience with working AGI systems. As such, it makes sense to try to make their task as easy as possible. Alignment isn’t all-or-nothing, and being able to align a system with one set of capabilities doesn’t mean you can do so for a system with stronger or more varied capabilities.
If you want to say that such a system isn’t technically a “human-level general intelligence”, that’s fine; the important question is about impact rather than definitions, as long as it’s clear that when I say “AGI” I mean something like “system that’s doing qualitatively the right kind of reasoning to match human performance in arbitrary domains, in large enough quantities to be competitive in domains like software engineering and theoretical physics”, not “system that can in fact match human performance in arbitrary domains”.
Yes, this is one of the main appeals of designing systems that (a) make it easy to blacklist or whitelist certain topics, (b) make it easy to verify that the system really is or isn’t thinking about a particular domain, and (c) make it easy to blacklist human modeling in particular. It’s a very big deal if you can just sidestep a lot of the core difficulties in AI safety (in your earliest AGI systems). E.g., operator manipulation, deception, mind crime, and some aspects of the fuzziness and complexity of human value.
We don’t currently know how to formalize ideas like ‘whitelisting cognitive domains’, however, and we don’t know how to align an AGI system in principle for much more modest tasks, even given a solution to those problems.
Thanks for elaborating. There seem to be two different ideas:
1), that it is a promising strategy to try and constrain early AGI capabilities and knowledge
2), that even without such constraints, a paperclipper entails a smaller risk of worst-case outcomes with large amounts of disvalue, compared to a near miss. (Brian Tomasik has also written about this.)
1) is very plausible, perhaps even obvious, though as you say it’s not clear how feasible this will be. I’m not convinced of 2), even though I’ve heard / read many people expressing this idea. I think it’s unclear what would result in more disvalue in expectation. For instance, a paperclipper would have no qualms to threaten other actors (with something that we would consider disvalue), while a near-miss might still have, depending on what exactly the failure mode is. In terms of incidental suffering, it’s true that a near-miss is more likely to do something about human minds, but again it’s also possible the system is, despite the failure, still compassionate enough to refrain from this, or use digital anesthesia. (It all depends on what plausible failure modes look like, and that’s very hard to say.)