I might be misunderstanding you, but I feel like this is sort of missing a key point. It seems like there could be situations in which the AI does indeed, as you point out, require “a bunch of safeguards to stop it destroying *itself*”, in order to advance to a high level of capabilities. These could be built by its engineers, or developed by the AI itself, perhaps through trial and error.
But that doesn’t seem to mean it’d have safeguards to not destroy other things we value, or in some more abstract sense “destroy” our future potential (e.g., by colonising space and “wasting” the resources optimising for something that we don’t/barely care about, even if it doesn’t harm anything on Earth). It seems possible for an AI to get safeguards like how to not have its robotic manifestation jump off things too high or disassemble itself, and thereby be “safe enough” itself to become more capable, but to not have the sort of “safeguards” that e.g. Russell cares about.
Indeed, this seems to related to the core point of ideas like instrumental convergent subgoals and differential progress. We or the AI might get really good at building its capabilities and building safeguards that allow it to become more capable or avoid harm to itself or its own current “goals”, without necessarily getting good at building safeguards to protect “what we truly value”.
But here’s two things you might have meant that would be consistent with what I’ve said:
It is only when you expect a system to radically gain capability without needing any safeguards to protect a particular thing that it makes sense to expect there to be a dangerous AI created by a team with no experience of safe guards to protect that particular thing or how to embed them. This may inform LeCun’s views, if he’s focusing on safeguards for the AI’s own ability to operate in the world, since these will have to be developed in order for the AI to become more capable. But Russell may be focusing on the fact that a system really could radically gain capability without needing safeguards to protect what we value.
It is only when you expect a system to radically gain capability without needing any safeguards of any type, does it makes sense to expect there to be a dangerous AI created by a team with no experience of safeguards in general or how to embed them. Since AI designers will have to learn how to develop and embed some types of safeguard, they’re likely to pick up general skills for that, which could then also be useful for building safeguards to protect what we value.
If what you meant is the latter, then I don’t think I’m comfortable resting on the assumption that lessons from developing/embedding “capability safeguards” (so to speak) will transfer to a high degree to “safety safeguards”. Although I haven’t looked into it a great deal.
I might be misunderstanding you, but I feel like this is sort of missing a key point. It seems like there could be situations in which the AI does indeed, as you point out, require “a bunch of safeguards to stop it destroying *itself*”, in order to advance to a high level of capabilities. These could be built by its engineers, or developed by the AI itself, perhaps through trial and error.
But that doesn’t seem to mean it’d have safeguards to not destroy other things we value, or in some more abstract sense “destroy” our future potential (e.g., by colonising space and “wasting” the resources optimising for something that we don’t/barely care about, even if it doesn’t harm anything on Earth). It seems possible for an AI to get safeguards like how to not have its robotic manifestation jump off things too high or disassemble itself, and thereby be “safe enough” itself to become more capable, but to not have the sort of “safeguards” that e.g. Russell cares about.
Indeed, this seems to related to the core point of ideas like instrumental convergent subgoals and differential progress. We or the AI might get really good at building its capabilities and building safeguards that allow it to become more capable or avoid harm to itself or its own current “goals”, without necessarily getting good at building safeguards to protect “what we truly value”.
But here’s two things you might have meant that would be consistent with what I’ve said:
It is only when you expect a system to radically gain capability without needing any safeguards to protect a particular thing that it makes sense to expect there to be a dangerous AI created by a team with no experience of safe guards to protect that particular thing or how to embed them. This may inform LeCun’s views, if he’s focusing on safeguards for the AI’s own ability to operate in the world, since these will have to be developed in order for the AI to become more capable. But Russell may be focusing on the fact that a system really could radically gain capability without needing safeguards to protect what we value.
It is only when you expect a system to radically gain capability without needing any safeguards of any type, does it makes sense to expect there to be a dangerous AI created by a team with no experience of safeguards in general or how to embed them. Since AI designers will have to learn how to develop and embed some types of safeguard, they’re likely to pick up general skills for that, which could then also be useful for building safeguards to protect what we value.
If what you meant is the latter, then I don’t think I’m comfortable resting on the assumption that lessons from developing/embedding “capability safeguards” (so to speak) will transfer to a high degree to “safety safeguards”. Although I haven’t looked into it a great deal.
Is one of those things what you meant?