It seems like recursive schemes can potentially scale arbitrarily far (and at least up to the analog of “NEXP”, but probably farther), they are mostly just limited by the capability of the AI assistants / debaters / etc. So it’s kind of hard to distinguish mere capabilities costs from bounds on the ultimate capability.
We could exclude that kind of thing because we have no idea what the bound is (or perhaps humans just never discover some facts that gradient descent discovers, or that they discover them in a way that causes them to run into the same problem). I think in that case the problem is still open. For example, finding a solution that definitely runs in 2n more time than the unaligned benchmark looks hard, I’d guess it’s roughly as hard as finding a solution that definitely runs in 10 times more time than the unaligned benchmark.
The main reasons we decided not to emphasize with this, and to focus as much as we do on competitiveness issues, is (i) the “do science” options do feel like they work if you have enough compute and it seems like you need to emphasize the competitiveness issue to explain why we don’t like them (or else get into increasingly weird counterexamples), (ii) in the worst case we don’t expect a very slow solution to be much easier than a very fast solution, since most realistic kinds of slowdown can get arbitrarily bad in the worst case, and the plausible approaches we are aware of all seem pretty likely be roughly competitive. So it seems likely to set people off down weirder alleys (which would be good for someone to go down if lots of folks are working on the problem but probably aren’t where you should start).
There’s a direction (which I imagine you and others have considered) where you replace some activations within your AI with natural language, so that eg heuristically certain layers can only communicate with the next layer in NL.
Then you heavily regularize in various ways. You’d require the language to be fully understandable and transparent, perhaps requiring that counter-factual changes to inputs lead to sensible changes to outputs within subsystems, etc. You’d have humans verify the language was relevant, meaningful, & concise, train AIs to do this verification at larger scale, do some adversarial training, etc. You could also train sub-human level AIs to paraphrase the language that’s used and restate it between layers, to make it really hard for the whole system to ever pass hidden coded messages.
This seems like it lives under a slogan like “enforce interpretability at any cost”. This would almost certainly incur a big efficiency/capabilities hit. Maybe it’s enormous. Though it actually seems plausible that the hit would be much smaller for extremely capable systems, as compared to the AI models of today.
A crucial question will then be “how powerful are the subsystems that talk to each other via natural language allowed to get”, where in the most conservative limit each subsystem is human level, or even significantly below, and in the riskiest limit you just have a single NL layer that cuts the system in half.
There’s a worry along the lines of “maybe the whole system is so big and complex it has emergent bad and inscrutable behavior even though every step is interpretable and makes sense”. Or in the same vein “the answers to simple big-picture questions we care about don’t live anywhere specific, so this doesn’t help us to ensure the model can transparently address them, even if its operation itself can be broken down into transparent pieces.” That said, I think we’re in a better position wrt these issues, as we can now talk about training models that automate the extraction of big-picture information from the NL activations in this giant beast.
You replace parts of your neural network with natural language, optimize those parts to implement a good process, and then hope the outcome is good because the process is good.
You replace parts of your neural network with natural language, and then optimize that natural language to achieve good outcomes.
I think that #1 is safe and runs into significant capability limitations (roughly the same as debate/amplification). It may still be good enough to carry the day if things work out well or if people are willing to exercise a lot of restraint, and I’d like to see people doing it. I think that in this it doesn’t matter that much how powerful the subsystems are, since each of them is doing something that you understand (though there are many subtleties and possible problems, e.g. with emergent bad behavior and some inner alignment problems).
I think that by default #2 is pretty dangerous. If you took this route I don’t think it would be fair to call the bad/inscrutable behavior “emergent,” or to call each step “interpretable”—the steps make sense but by default it seems extremely likely that you don’t understand why the process leads to good results. (If you did, you could have just taken path #1.) If there is bad behavior it’s not emergent it’s just produced directly by gradient descent, and the fact that you can encode the intermediate activations in natural language doesn’t really address the risk (if that information isn’t necessarily functioning in the way you expect).
I feel like different versions of path #2 sit on a spectrum between “fairly safe like path #1” and “clearly unworkably dangerous.” I feel most comfortable basically starting from path #1 and then carefully adding in stuff you don’t understand (e.g. systems solving small subtasks in ways you don’t understand, or optimizing only a small number of degrees within a space you understand reasonably well).
You could instead start with “very scary” and then try to add in controls—like paraphrasing, tighter input bottlenecks, smaller pieces, etc.---to make it safe, but I find that approach pretty scary.
It seems like recursive schemes can potentially scale arbitrarily far (and at least up to the analog of “NEXP”, but probably farther), they are mostly just limited by the capability of the AI assistants / debaters / etc. So it’s kind of hard to distinguish mere capabilities costs from bounds on the ultimate capability.
We could exclude that kind of thing because we have no idea what the bound is (or perhaps humans just never discover some facts that gradient descent discovers, or that they discover them in a way that causes them to run into the same problem). I think in that case the problem is still open. For example, finding a solution that definitely runs in 2n more time than the unaligned benchmark looks hard, I’d guess it’s roughly as hard as finding a solution that definitely runs in 10 times more time than the unaligned benchmark.
The main reasons we decided not to emphasize with this, and to focus as much as we do on competitiveness issues, is (i) the “do science” options do feel like they work if you have enough compute and it seems like you need to emphasize the competitiveness issue to explain why we don’t like them (or else get into increasingly weird counterexamples), (ii) in the worst case we don’t expect a very slow solution to be much easier than a very fast solution, since most realistic kinds of slowdown can get arbitrarily bad in the worst case, and the plausible approaches we are aware of all seem pretty likely be roughly competitive. So it seems likely to set people off down weirder alleys (which would be good for someone to go down if lots of folks are working on the problem but probably aren’t where you should start).
There’s a direction (which I imagine you and others have considered) where you replace some activations within your AI with natural language, so that eg heuristically certain layers can only communicate with the next layer in NL.
Then you heavily regularize in various ways. You’d require the language to be fully understandable and transparent, perhaps requiring that counter-factual changes to inputs lead to sensible changes to outputs within subsystems, etc. You’d have humans verify the language was relevant, meaningful, & concise, train AIs to do this verification at larger scale, do some adversarial training, etc. You could also train sub-human level AIs to paraphrase the language that’s used and restate it between layers, to make it really hard for the whole system to ever pass hidden coded messages.
This seems like it lives under a slogan like “enforce interpretability at any cost”. This would almost certainly incur a big efficiency/capabilities hit. Maybe it’s enormous. Though it actually seems plausible that the hit would be much smaller for extremely capable systems, as compared to the AI models of today.
A crucial question will then be “how powerful are the subsystems that talk to each other via natural language allowed to get”, where in the most conservative limit each subsystem is human level, or even significantly below, and in the riskiest limit you just have a single NL layer that cuts the system in half.
There’s a worry along the lines of “maybe the whole system is so big and complex it has emergent bad and inscrutable behavior even though every step is interpretable and makes sense”. Or in the same vein “the answers to simple big-picture questions we care about don’t live anywhere specific, so this doesn’t help us to ensure the model can transparently address them, even if its operation itself can be broken down into transparent pieces.” That said, I think we’re in a better position wrt these issues, as we can now talk about training models that automate the extraction of big-picture information from the NL activations in this giant beast.
I think there’s a real fork in the road between:
You replace parts of your neural network with natural language, optimize those parts to implement a good process, and then hope the outcome is good because the process is good.
You replace parts of your neural network with natural language, and then optimize that natural language to achieve good outcomes.
I think that #1 is safe and runs into significant capability limitations (roughly the same as debate/amplification). It may still be good enough to carry the day if things work out well or if people are willing to exercise a lot of restraint, and I’d like to see people doing it. I think that in this it doesn’t matter that much how powerful the subsystems are, since each of them is doing something that you understand (though there are many subtleties and possible problems, e.g. with emergent bad behavior and some inner alignment problems).
I think that by default #2 is pretty dangerous. If you took this route I don’t think it would be fair to call the bad/inscrutable behavior “emergent,” or to call each step “interpretable”—the steps make sense but by default it seems extremely likely that you don’t understand why the process leads to good results. (If you did, you could have just taken path #1.) If there is bad behavior it’s not emergent it’s just produced directly by gradient descent, and the fact that you can encode the intermediate activations in natural language doesn’t really address the risk (if that information isn’t necessarily functioning in the way you expect).
I feel like different versions of path #2 sit on a spectrum between “fairly safe like path #1” and “clearly unworkably dangerous.” I feel most comfortable basically starting from path #1 and then carefully adding in stuff you don’t understand (e.g. systems solving small subtasks in ways you don’t understand, or optimizing only a small number of degrees within a space you understand reasonably well).
You could instead start with “very scary” and then try to add in controls—like paraphrasing, tighter input bottlenecks, smaller pieces, etc.---to make it safe, but I find that approach pretty scary.