There are essentially three kinds of proposals to do something about the fact that we’re all going to die.
A proposal that has no chance of working, like everything the AI labs do.
A proposal that is a small first step to future larger steps, but which almost certainly wouldn’t on its own work, that gets dismissed because it won’t work on its own and also because it might slow down progress and no one understands the dangers.
A proposal that might actually work if people tried it, then people say no one would ever go for that, you’re crazy and ruining our chances to do something more practical.
Thanks for providing concepts to help me express a bit of frustration I have with your takes. You keep on categorizing things that the AI labs are doing as kind 1, or even worse than kind 1, when I would categorize at least some of them as kind 2 (example).
In the case of Constitutional AI, I think you are selling it short as well:
This deserves a longer treatment, but my core reaction is that I expect this to be a good way to solve easy problems and to be a very bad way to try and solve the hard problems we should actually worry about. Humans provide a list of principles and rules, then the AI takes it from there based on its understanding of those principles and rules – so the AI is going to Goodhart on its own interpretation of what you wrote, based on its own model and reasoning.
When you are dealing with current-level LLMs and what you want are things like ‘don’t say bad words’ and ‘don’t tell people how to build bombs’ this could totally work and be a big time saver. When you have models smarter and more capable than humans, and the things you are trying to instill get more complex, this seems like a version of RLHF with additional points of inevitable lethal failure?
RLHF imo can’t scale up properly because scaling the AI can’t fix biases in the human ratings. As you scale up Constitutional AI on the other hand, it seems to me the AI should be able to handle more and more sophisticated constitutions, including ones based on predicting what humans would want if properly informed (correcting for Goodhart in a way you can’t without this informed-human-predicting feature).
Maybe it can handle a sufficiently sophisticated constitution before it is powerful enough to foom, maybe not. Maybe LLMs are going to be upstaged by some other architecture that can’t use constitutional AI, maybe not. Being able to cover one of those possibilities would still be better than the none of them that (imo) RLHF could handle.
Thanks for providing concepts to help me express a bit of frustration I have with your takes. You keep on categorizing things that the AI labs are doing as kind 1, or even worse than kind 1, when I would categorize at least some of them as kind 2 (example).
In the case of Constitutional AI, I think you are selling it short as well:
RLHF imo can’t scale up properly because scaling the AI can’t fix biases in the human ratings. As you scale up Constitutional AI on the other hand, it seems to me the AI should be able to handle more and more sophisticated constitutions, including ones based on predicting what humans would want if properly informed (correcting for Goodhart in a way you can’t without this informed-human-predicting feature).
Maybe it can handle a sufficiently sophisticated constitution before it is powerful enough to foom, maybe not. Maybe LLMs are going to be upstaged by some other architecture that can’t use constitutional AI, maybe not. Being able to cover one of those possibilities would still be better than the none of them that (imo) RLHF could handle.