It looks to me like this conversation is to some extent repeating a pattern which I’ve seen in AI safety conversations before:
Safety advocate: AI might destroy us if it doesn’t have the right safeguards. Safety skeptic: That’s stupid, because why would anyone build it without those safeguards.
It feels like people keep talking past each other, since both essentially agree about the need for safeguards. Rather the disagreement seems to be over something more like… “does the default path of AI development involve existential risks or not”, where the safety advocate argues that we should be thinking about this a lot beforehand, much more than with other technologies. On the other hand, the skeptic sees AI as being much more comparable to any other technology, in that there are risks and there will probably be accidents until we figure out how to do it safely, but we will do that figuring out as a normal part of developing the technology and we can’t really do much of that figuring out until we actually have the technology.
My view is that you have to build AIs with a bunch of safeguards to stop it destroying *itself* while it doesn’t have great knowledge of the world or the consequences of its actions. So some of the arguments around companies/governments skimping on safety don’t hold in the naive sense.
So things like how do you :
Stop a robot jumping off something too high
Stop an AI DOSing it’s own network connection
Stop a robot disassembling itself
When it is not vastly capable. Solving these things would give you a bunch of knowledge of safeguards and how to build them. I wrote about some of problems here
It is only when you expect a system to radically gain capability without needing any safeguards, does it makes sense to expect there to be a dangerous AI created by a team with no experience of safe guards or how to embed them.
One thing you can do to stop a robot from destroying itself is to give it more-or-less any RL reward function whatsoever, and get better and better at designing it to understand the world and itself and act in the service of getting that reward (because of instrumental convergence). For example, each time the robot destroys itself, you build a new one seeded with the old one’s memory, and tell it that its actions last time got a negative reward. Then it will learn not to do that in the future. Remember, an AGI doesn’t need a robot body; a prototype AGI that accidentally corrupts its own code can be recreated instantaneously for zero cost. Why then build safeguards?
Safeguards would be more likely if the AGI were, say, causing infrastructure damage while learning. I can definitely see someone, say, removing internet access, after mishaps like that. That’s still not an adequate safeguard, in that when the AGI gets intelligent enough, it could hack or social-engineer its way through safeguards that were working before.
I think this scheme doesn’t quite catch the abulia trap (where the AGI discovers a way to directly administer itself reward, and then ceases to interact with the outside world), in that it’s not clear that the AI learns about the map/territory distinction and to locate its goals in the territory (one way to avoid this) instead of just a prohibition against many sorts of self-modification or reward tampering (which avoids this until it comes up with a clever new approach).
I might be misunderstanding you, but I feel like this is sort of missing a key point. It seems like there could be situations in which the AI does indeed, as you point out, require “a bunch of safeguards to stop it destroying *itself*”, in order to advance to a high level of capabilities. These could be built by its engineers, or developed by the AI itself, perhaps through trial and error.
But that doesn’t seem to mean it’d have safeguards to not destroy other things we value, or in some more abstract sense “destroy” our future potential (e.g., by colonising space and “wasting” the resources optimising for something that we don’t/barely care about, even if it doesn’t harm anything on Earth). It seems possible for an AI to get safeguards like how to not have its robotic manifestation jump off things too high or disassemble itself, and thereby be “safe enough” itself to become more capable, but to not have the sort of “safeguards” that e.g. Russell cares about.
Indeed, this seems to related to the core point of ideas like instrumental convergent subgoals and differential progress. We or the AI might get really good at building its capabilities and building safeguards that allow it to become more capable or avoid harm to itself or its own current “goals”, without necessarily getting good at building safeguards to protect “what we truly value”.
But here’s two things you might have meant that would be consistent with what I’ve said:
It is only when you expect a system to radically gain capability without needing any safeguards to protect a particular thing that it makes sense to expect there to be a dangerous AI created by a team with no experience of safe guards to protect that particular thing or how to embed them. This may inform LeCun’s views, if he’s focusing on safeguards for the AI’s own ability to operate in the world, since these will have to be developed in order for the AI to become more capable. But Russell may be focusing on the fact that a system really could radically gain capability without needing safeguards to protect what we value.
It is only when you expect a system to radically gain capability without needing any safeguards of any type, does it makes sense to expect there to be a dangerous AI created by a team with no experience of safeguards in general or how to embed them. Since AI designers will have to learn how to develop and embed some types of safeguard, they’re likely to pick up general skills for that, which could then also be useful for building safeguards to protect what we value.
If what you meant is the latter, then I don’t think I’m comfortable resting on the assumption that lessons from developing/embedding “capability safeguards” (so to speak) will transfer to a high degree to “safety safeguards”. Although I haven’t looked into it a great deal.
It is only when you expect a system to radically gain capability without needing any safeguards, does it makes sense to expect there to be a dangerous AI created by a team with no experience of safe guards or how to embed them.
That sounds right to me. Also worth noting that much of what parents do for the first few years of a child’s life is just trying to stop the child from killing/injuring themselves, when the child’s own understanding of the world isn’t sufficiently developed yet.
It looks to me like this conversation is to some extent repeating a pattern which I’ve seen in AI safety conversations before:
Safety advocate: AI might destroy us if it doesn’t have the right safeguards.
Safety skeptic: That’s stupid, because why would anyone build it without those safeguards.
It feels like people keep talking past each other, since both essentially agree about the need for safeguards. Rather the disagreement seems to be over something more like… “does the default path of AI development involve existential risks or not”, where the safety advocate argues that we should be thinking about this a lot beforehand, much more than with other technologies. On the other hand, the skeptic sees AI as being much more comparable to any other technology, in that there are risks and there will probably be accidents until we figure out how to do it safely, but we will do that figuring out as a normal part of developing the technology and we can’t really do much of that figuring out until we actually have the technology.
My view is that you have to build AIs with a bunch of safeguards to stop it destroying *itself* while it doesn’t have great knowledge of the world or the consequences of its actions. So some of the arguments around companies/governments skimping on safety don’t hold in the naive sense.
So things like how do you :
Stop a robot jumping off something too high
Stop an AI DOSing it’s own network connection
Stop a robot disassembling itself
When it is not vastly capable. Solving these things would give you a bunch of knowledge of safeguards and how to build them. I wrote about some of problems here
It is only when you expect a system to radically gain capability without needing any safeguards, does it makes sense to expect there to be a dangerous AI created by a team with no experience of safe guards or how to embed them.
One thing you can do to stop a robot from destroying itself is to give it more-or-less any RL reward function whatsoever, and get better and better at designing it to understand the world and itself and act in the service of getting that reward (because of instrumental convergence). For example, each time the robot destroys itself, you build a new one seeded with the old one’s memory, and tell it that its actions last time got a negative reward. Then it will learn not to do that in the future. Remember, an AGI doesn’t need a robot body; a prototype AGI that accidentally corrupts its own code can be recreated instantaneously for zero cost. Why then build safeguards?
Safeguards would be more likely if the AGI were, say, causing infrastructure damage while learning. I can definitely see someone, say, removing internet access, after mishaps like that. That’s still not an adequate safeguard, in that when the AGI gets intelligent enough, it could hack or social-engineer its way through safeguards that were working before.
I think this scheme doesn’t quite catch the abulia trap (where the AGI discovers a way to directly administer itself reward, and then ceases to interact with the outside world), in that it’s not clear that the AI learns about the map/territory distinction and to locate its goals in the territory (one way to avoid this) instead of just a prohibition against many sorts of self-modification or reward tampering (which avoids this until it comes up with a clever new approach).
I might be misunderstanding you, but I feel like this is sort of missing a key point. It seems like there could be situations in which the AI does indeed, as you point out, require “a bunch of safeguards to stop it destroying *itself*”, in order to advance to a high level of capabilities. These could be built by its engineers, or developed by the AI itself, perhaps through trial and error.
But that doesn’t seem to mean it’d have safeguards to not destroy other things we value, or in some more abstract sense “destroy” our future potential (e.g., by colonising space and “wasting” the resources optimising for something that we don’t/barely care about, even if it doesn’t harm anything on Earth). It seems possible for an AI to get safeguards like how to not have its robotic manifestation jump off things too high or disassemble itself, and thereby be “safe enough” itself to become more capable, but to not have the sort of “safeguards” that e.g. Russell cares about.
Indeed, this seems to related to the core point of ideas like instrumental convergent subgoals and differential progress. We or the AI might get really good at building its capabilities and building safeguards that allow it to become more capable or avoid harm to itself or its own current “goals”, without necessarily getting good at building safeguards to protect “what we truly value”.
But here’s two things you might have meant that would be consistent with what I’ve said:
It is only when you expect a system to radically gain capability without needing any safeguards to protect a particular thing that it makes sense to expect there to be a dangerous AI created by a team with no experience of safe guards to protect that particular thing or how to embed them. This may inform LeCun’s views, if he’s focusing on safeguards for the AI’s own ability to operate in the world, since these will have to be developed in order for the AI to become more capable. But Russell may be focusing on the fact that a system really could radically gain capability without needing safeguards to protect what we value.
It is only when you expect a system to radically gain capability without needing any safeguards of any type, does it makes sense to expect there to be a dangerous AI created by a team with no experience of safeguards in general or how to embed them. Since AI designers will have to learn how to develop and embed some types of safeguard, they’re likely to pick up general skills for that, which could then also be useful for building safeguards to protect what we value.
If what you meant is the latter, then I don’t think I’m comfortable resting on the assumption that lessons from developing/embedding “capability safeguards” (so to speak) will transfer to a high degree to “safety safeguards”. Although I haven’t looked into it a great deal.
Is one of those things what you meant?
That sounds right to me. Also worth noting that much of what parents do for the first few years of a child’s life is just trying to stop the child from killing/injuring themselves, when the child’s own understanding of the world isn’t sufficiently developed yet.