You replace parts of your neural network with natural language, optimize those parts to implement a good process, and then hope the outcome is good because the process is good.
You replace parts of your neural network with natural language, and then optimize that natural language to achieve good outcomes.
I think that #1 is safe and runs into significant capability limitations (roughly the same as debate/amplification). It may still be good enough to carry the day if things work out well or if people are willing to exercise a lot of restraint, and I’d like to see people doing it. I think that in this it doesn’t matter that much how powerful the subsystems are, since each of them is doing something that you understand (though there are many subtleties and possible problems, e.g. with emergent bad behavior and some inner alignment problems).
I think that by default #2 is pretty dangerous. If you took this route I don’t think it would be fair to call the bad/inscrutable behavior “emergent,” or to call each step “interpretable”—the steps make sense but by default it seems extremely likely that you don’t understand why the process leads to good results. (If you did, you could have just taken path #1.) If there is bad behavior it’s not emergent it’s just produced directly by gradient descent, and the fact that you can encode the intermediate activations in natural language doesn’t really address the risk (if that information isn’t necessarily functioning in the way you expect).
I feel like different versions of path #2 sit on a spectrum between “fairly safe like path #1” and “clearly unworkably dangerous.” I feel most comfortable basically starting from path #1 and then carefully adding in stuff you don’t understand (e.g. systems solving small subtasks in ways you don’t understand, or optimizing only a small number of degrees within a space you understand reasonably well).
You could instead start with “very scary” and then try to add in controls—like paraphrasing, tighter input bottlenecks, smaller pieces, etc.---to make it safe, but I find that approach pretty scary.
I think there’s a real fork in the road between:
You replace parts of your neural network with natural language, optimize those parts to implement a good process, and then hope the outcome is good because the process is good.
You replace parts of your neural network with natural language, and then optimize that natural language to achieve good outcomes.
I think that #1 is safe and runs into significant capability limitations (roughly the same as debate/amplification). It may still be good enough to carry the day if things work out well or if people are willing to exercise a lot of restraint, and I’d like to see people doing it. I think that in this it doesn’t matter that much how powerful the subsystems are, since each of them is doing something that you understand (though there are many subtleties and possible problems, e.g. with emergent bad behavior and some inner alignment problems).
I think that by default #2 is pretty dangerous. If you took this route I don’t think it would be fair to call the bad/inscrutable behavior “emergent,” or to call each step “interpretable”—the steps make sense but by default it seems extremely likely that you don’t understand why the process leads to good results. (If you did, you could have just taken path #1.) If there is bad behavior it’s not emergent it’s just produced directly by gradient descent, and the fact that you can encode the intermediate activations in natural language doesn’t really address the risk (if that information isn’t necessarily functioning in the way you expect).
I feel like different versions of path #2 sit on a spectrum between “fairly safe like path #1” and “clearly unworkably dangerous.” I feel most comfortable basically starting from path #1 and then carefully adding in stuff you don’t understand (e.g. systems solving small subtasks in ways you don’t understand, or optimizing only a small number of degrees within a space you understand reasonably well).
You could instead start with “very scary” and then try to add in controls—like paraphrasing, tighter input bottlenecks, smaller pieces, etc.---to make it safe, but I find that approach pretty scary.