The second worry is, I guess, a variant of the first: that we’ll use intent-aligned AI very foolishly. That would be issuing a command like “”follow the laws of the nation you originated in but otherwise do whatever you like.” I guess a key consideration in both cases is whether there’s an adequate level of corrigibility.
I’d flag that I suspect that we really should have AI systems forecasting the future and the results of possible requests.
So if people made a broad request like, “follow the laws of the nation you originated in but otherwise do whatever you like”, they should see forecasts for what that would lead to. If there’s any clearly problematic outcomes, those should be apparent early on.
This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up.
This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up.
I think “very dumb humans” is what we have to work with. Remember, it only requires a small number of imperfectly aligned humans to ignore the warnings (or, indeed, to welcome the world the warnings describe).
In many worlds, if we have a bunch of decently smart humans around, they would know what specific situations “very dumb humans” would mess up, and take the corresponding preventative measures.
A world where many small pockets of “highly dumb humans” could cause an existential catastrophe is one that’s very clearly incredibly fragile and dangerous, enough so that I assume reasonable actors would freak out until it stops being so fragile and dangerous. I think we see this in other areas—like cyber attacks, where reasonable people prevent small clusters of actors from causing catastrophic damage.
It’s possible that the offense/defense balance would dramatically favor tiny groups of dumb actors, and I assume that this is what you and others expect, but I don’t see it yet.
How do you propose that reasonable actors prevent reality from being fragile and dangerous?
Cyber attacks are generally based on poor protocols. Over time smart reasonable people can convince less smart reasonable people to follow better ones. Can reasonable people convince reality to follow better protocols?
As soon as you get into proposing solutions to this sort of problem, they start to look a lot less reasonable by current standards.
I’d flag that I suspect that we really should have AI systems forecasting the future and the results of possible requests.
So if people made a broad request like, “follow the laws of the nation you originated in but otherwise do whatever you like”, they should see forecasts for what that would lead to. If there’s any clearly problematic outcomes, those should be apparent early on.
This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up.
I think “very dumb humans” is what we have to work with. Remember, it only requires a small number of imperfectly aligned humans to ignore the warnings (or, indeed, to welcome the world the warnings describe).
In many worlds, if we have a bunch of decently smart humans around, they would know what specific situations “very dumb humans” would mess up, and take the corresponding preventative measures.
A world where many small pockets of “highly dumb humans” could cause an existential catastrophe is one that’s very clearly incredibly fragile and dangerous, enough so that I assume reasonable actors would freak out until it stops being so fragile and dangerous. I think we see this in other areas—like cyber attacks, where reasonable people prevent small clusters of actors from causing catastrophic damage.
It’s possible that the offense/defense balance would dramatically favor tiny groups of dumb actors, and I assume that this is what you and others expect, but I don’t see it yet.
How do you propose that reasonable actors prevent reality from being fragile and dangerous?
Cyber attacks are generally based on poor protocols. Over time smart reasonable people can convince less smart reasonable people to follow better ones. Can reasonable people convince reality to follow better protocols?
As soon as you get into proposing solutions to this sort of problem, they start to look a lot less reasonable by current standards.