I am not saying that alignment is easy to solve, or that failing it would not result in catastrophe. But all these arguments seem like universal arguments against any kind of solution at all. Just because it will eventually involve some sort of Godzilla. It is like somebody tries to make a plane that can fly safely and not fall from the Sky, and somebody keeps repeating “well, if anything goes wrong in your safety scheme, then the plane will fall from the Sky” or “I notice that your plane is going to fly in the Sky, which means it can potentially fall from it”.
I am not saying that I have better ideas about checking whether any plan will work or not. They all inevitably involve Godzilla or Sky. And the slightest mistake might cost us our lives. But I don’t think that pointing repeatedly at the same scary thing, which will be one way or the other in every single plan, will get us anywhere.
I expect there are ways of dealing with Godzilla which are a lot less brittle.
If we have excellent detailed knowledge of Godzilla’s internals and psychology, we know what sort of things will drive Godzilla into a frenzy or slow him down or put him to sleep, we know how to get Godzilla to go in one direction rather than another, if we knew when and how tests on small lizards would generalize to Godzilla… those would all be robustly useful things. If we had all those pieces plus more like them, then it starts to look like a scenario where dealing with Godzilla is basically viable. There’s lots of fallback options, and many opportunities to recover from errors. It’s not a brittle situation which falls apart as soon as something goes wrong.
This seems to contradict what I interpreted as the message of your post; that message being, if someone gives you a “clever” strategy for dealing with Godzilla, the correct response is to just troll them because Godzilla is inherently bad for property values. But what you’re doing now is admitting that if the scheme to control Godzilla is clever in such and such ways, which you specifically warned against, then actually it might not be so brittle.
The key distinction is between clever methods for controlling something one does not understand, vs clever methods for controlling something one does understand. (The post didn’t go into that because it was short rather than thorough, but it did come up elsewhere in the comments.)
It suggests putting more weight on a plan to get AI Research globally banned. I am skeptical that this will work (though if burning all GPUs would be a pivotal act the chances of success are significantly higher), but it seems very unlikely that there is a technical solution either.
In addition, at least some purported technical solutions to AI risk seem to meaningfully increase the risk to humanity. If you have someone creating an AGI to exercise sufficient control over the world to execute a pivotal act, that raises the stakes of being first enormously which incentivizes cutting corners. And, it also makes it more likely that the AGI will destroy humanity and be quicker to do so.
I am not saying that alignment is easy to solve, or that failing it would not result in catastrophe. But all these arguments seem like universal arguments against any kind of solution at all. Just because it will eventually involve some sort of Godzilla. It is like somebody tries to make a plane that can fly safely and not fall from the Sky, and somebody keeps repeating “well, if anything goes wrong in your safety scheme, then the plane will fall from the Sky” or “I notice that your plane is going to fly in the Sky, which means it can potentially fall from it”.
I am not saying that I have better ideas about checking whether any plan will work or not. They all inevitably involve Godzilla or Sky. And the slightest mistake might cost us our lives. But I don’t think that pointing repeatedly at the same scary thing, which will be one way or the other in every single plan, will get us anywhere.
I expect there are ways of dealing with Godzilla which are a lot less brittle.
If we have excellent detailed knowledge of Godzilla’s internals and psychology, we know what sort of things will drive Godzilla into a frenzy or slow him down or put him to sleep, we know how to get Godzilla to go in one direction rather than another, if we knew when and how tests on small lizards would generalize to Godzilla… those would all be robustly useful things. If we had all those pieces plus more like them, then it starts to look like a scenario where dealing with Godzilla is basically viable. There’s lots of fallback options, and many opportunities to recover from errors. It’s not a brittle situation which falls apart as soon as something goes wrong.
This seems to contradict what I interpreted as the message of your post; that message being, if someone gives you a “clever” strategy for dealing with Godzilla, the correct response is to just troll them because Godzilla is inherently bad for property values. But what you’re doing now is admitting that if the scheme to control Godzilla is clever in such and such ways, which you specifically warned against, then actually it might not be so brittle.
The key distinction is between clever methods for controlling something one does not understand, vs clever methods for controlling something one does understand. (The post didn’t go into that because it was short rather than thorough, but it did come up elsewhere in the comments.)
It suggests putting more weight on a plan to get AI Research globally banned. I am skeptical that this will work (though if burning all GPUs would be a pivotal act the chances of success are significantly higher), but it seems very unlikely that there is a technical solution either.
In addition, at least some purported technical solutions to AI risk seem to meaningfully increase the risk to humanity. If you have someone creating an AGI to exercise sufficient control over the world to execute a pivotal act, that raises the stakes of being first enormously which incentivizes cutting corners. And, it also makes it more likely that the AGI will destroy humanity and be quicker to do so.