So you need a pretty strong argument that interp in particular is good for capabilities, which isn’t borne out empirically and also doesn’t seem that strong.
I think current interpretability has close to no capabilities externalities because it is not good yet, and delivers close to no insights into NN internals. If you had a good interpretability tool, which let you read off and understand e.g. how AlphaGo plays games to the extent that you could reimplement the algorithm by hand in C, and not need the NN anymore, then I would expect this to yield large capabilities externalities. This is the level of interpretability I aim for, and the level I think we need to make any serious progress on alignment.
If your interpretability tools cannot do things even remotely like this, I expect they are quite safe. But then I also don’t think they help much at all with alignment. There’s a roughly proportional relationship between your understanding of the network, and both your ability to align it and make it better, is what I’m saying. I doubt there’s many deep insights to be had that further the former without also furthering the latter. Maybe some insights further one a bit more than the other, but I doubt you’d be able to figure out which ones those are in advance. Often, I expect you’d only know years after the insight has been published and the field has figured out all of what can be done with it.
I think it’s all one tech tree, is what I’m saying. I don’t think neural network theory neatly decomposes into a “make strong AGI architecture” branch and a “aim AGI optimisation at a specific target” branch. Just like quantum mechanics doesn’t neatly decompose into a “make a nuclear bomb” branch and a “make a nuclear reactor” branch. In fact, in the case of NNs, I expect aiming strong optimisation is probably just straight up harder than creating strong optimisation.
By default, I think if anyone succeeds at solving alignment, they probably figured out most of what goes into making strong AGI along the way. Even just by accident. Because it’s lower in the tech tree.
I think current interpretability has close to no capabilities externalities because it is not good yet, and delivers close to no insights into NN internals. If you had a good interpretability tool, which let you read off and understand e.g. how AlphaGo plays games to the extent that you could reimplement the algorithm by hand in C, and not need the NN anymore, then I would expect this to yield large capabilities externalities. This is the level of interpretability I aim for, and the level I think we need to make any serious progress on alignment.
If your interpretability tools cannot do things even remotely like this, I expect they are quite safe. But then I also don’t think they help much at all with alignment. There’s a roughly proportional relationship between your understanding of the network, and both your ability to align it and make it better, is what I’m saying. I doubt there’s many deep insights to be had that further the former without also furthering the latter. Maybe some insights further one a bit more than the other, but I doubt you’d be able to figure out which ones those are in advance. Often, I expect you’d only know years after the insight has been published and the field has figured out all of what can be done with it.
I think it’s all one tech tree, is what I’m saying. I don’t think neural network theory neatly decomposes into a “make strong AGI architecture” branch and a “aim AGI optimisation at a specific target” branch. Just like quantum mechanics doesn’t neatly decompose into a “make a nuclear bomb” branch and a “make a nuclear reactor” branch. In fact, in the case of NNs, I expect aiming strong optimisation is probably just straight up harder than creating strong optimisation.
By default, I think if anyone succeeds at solving alignment, they probably figured out most of what goes into making strong AGI along the way. Even just by accident. Because it’s lower in the tech tree.