I agree with the sentiment but want to disagree on a minor point. I think we need more galaxy brained proofs and not less.
But his research now (“heuristic arguments”) is roughly “trying to solve alignment via galaxy-brained math proofs.” As much as I respect and appreciate Paul, I’m really skeptical of this: basically all deep learning progress has been empirical, often via dumb hacks[3] and intuitions, rather than sophisticated theory. My baseline expectation is that aligning deep learning systems will be achieved similarly.
Epistemically, there’s a solid chance the following argument is my brain working backwards from what I already believe.
To me it appears that the relation between capabilities research and alignment is fundamentally asymmetric. Deep learning research success might be finding just one way to improve the performance of a system. Successful alignment requires you to find all the ways things might go wrong.
I agree with your characterisation of deep learning. The field is famously akin to “alchemy” with advances running far ahead of the understanding. This works for capabilities research because: 1. you can afford to try a variety of experiments and get immediate feedback on if something works or not. 2. You can afford to break things and start again. 3. Finally, you are also in a position where performing well on a set of metrics is success in itself, even if those metrics might not be perfect.
Contrast this with alignment research. If you’re trying to solve alignment in 2023: 1. Experiments cannot tell you if your alignment techniques works for models built in the future under new paradigms. 2. When it comes to the point in which there are experiments on AGIs that pose an existential threat, there will be no room to break things and start again. 4. While there are metrics for how well a language model might appear to be aligned there are (currently) no known metrics that can be met that guarantee an AGI is aligned.
I agree with the sentiment but want to disagree on a minor point. I think we need more galaxy brained proofs and not less.
Epistemically, there’s a solid chance the following argument is my brain working backwards from what I already believe.
To me it appears that the relation between capabilities research and alignment is fundamentally asymmetric. Deep learning research success might be finding just one way to improve the performance of a system. Successful alignment requires you to find all the ways things might go wrong.
I agree with your characterisation of deep learning. The field is famously akin to “alchemy” with advances running far ahead of the understanding. This works for capabilities research because:
1. you can afford to try a variety of experiments and get immediate feedback on if something works or not.
2. You can afford to break things and start again.
3. Finally, you are also in a position where performing well on a set of metrics is success in itself, even if those metrics might not be perfect.
Contrast this with alignment research. If you’re trying to solve alignment in 2023:
1. Experiments cannot tell you if your alignment techniques works for models built in the future under new paradigms.
2. When it comes to the point in which there are experiments on AGIs that pose an existential threat, there will be no room to break things and start again.
4. While there are metrics for how well a language model might appear to be aligned there are (currently) no known metrics that can be met that guarantee an AGI is aligned.