Thanks for the post. I’ll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:
We must grow interpretability and AI safety in the real world.
Strong +1 to working on more real-world-relevant approaches to interpretability.
Regulation is coming – let’s use it.
Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be something that has a once-and-for-all solution, so governance seems to matter a lot in the likely case of a future with highly-prolific TAI. And because of the pace of governance, work now to establish concern, offices, precedent, case law, etc. seems uniquely key.
Speed potentially transformative narrow domain systems. AI for scientific progress is an important side quest. Interpretability is the backbone of knowledge discovery with deep learning, and has huge potential to advance basic science by making legible the complex patterns that machine learning models identify in huge datasets.
I do not see the reasoning or motivation for this, and it seems possibly harmful.
First, developing basic insights is clearly not just an AI safety goal. It’s an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good. They are heavy-tailed in both directions. This seems like possible safety washing. But to be fair, this is a critique I have of a ton of AI alignment work including some of my own.
Second, I don’t know of any examples of gaining particularly useful domain knowledge from interpretability related things in deep learning other than maybe the predictivness of nonrobust features. Another possible example could be using deep-learning to find new algorithms for things like matrix multiplication, but this isn’t really “interpretability”. Do you have other examples in mind? Progress in the last 6 years on reverse-engineering nontrivial systems has seemed to be tenuous at best.
So I’d be interested in hearing more about whether/how you expect this one type of work to be robustly good and what is meant by “Interpretability is the backbone of knowledge discovery with deep learning.”
Thanks for the comment! I’ll respond to the last part:
“First, developing basic insights is clearly not just an AI safety goal. It’s an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good.”
I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we’re explicitly interested in using interpretability with narrow domain systems.
“Interpretability is the backbone of knowledge discovery with deep learning”: Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren’t able to parse. If we can use interpretability to extract these patterns in a human-parsable way, in a (very Olah-ish) sense we can reframe deep learning models as lenses through which to view the world, and to make sense of data that would otherwise be opaque to us.
Are you concerned about AI risk from narrow systems of this kind?
No. Am I concerned about risks from methods that work for this in narrow AI? Maybe.
This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of “backbone” which I would not have used. I think we’re on the same page.
Thanks for the post. I’ll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:
Strong +1 to working on more real-world-relevant approaches to interpretability.
Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be something that has a once-and-for-all solution, so governance seems to matter a lot in the likely case of a future with highly-prolific TAI. And because of the pace of governance, work now to establish concern, offices, precedent, case law, etc. seems uniquely key.
I do not see the reasoning or motivation for this, and it seems possibly harmful.
First, developing basic insights is clearly not just an AI safety goal. It’s an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good. They are heavy-tailed in both directions. This seems like possible safety washing. But to be fair, this is a critique I have of a ton of AI alignment work including some of my own.
Second, I don’t know of any examples of gaining particularly useful domain knowledge from interpretability related things in deep learning other than maybe the predictivness of nonrobust features. Another possible example could be using deep-learning to find new algorithms for things like matrix multiplication, but this isn’t really “interpretability”. Do you have other examples in mind? Progress in the last 6 years on reverse-engineering nontrivial systems has seemed to be tenuous at best.
So I’d be interested in hearing more about whether/how you expect this one type of work to be robustly good and what is meant by “Interpretability is the backbone of knowledge discovery with deep learning.”
Thanks for the comment! I’ll respond to the last part:
“First, developing basic insights is clearly not just an AI safety goal. It’s an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good.”
I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we’re explicitly interested in using interpretability with narrow domain systems.
“Interpretability is the backbone of knowledge discovery with deep learning”: Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren’t able to parse. If we can use interpretability to extract these patterns in a human-parsable way, in a (very Olah-ish) sense we can reframe deep learning models as lenses through which to view the world, and to make sense of data that would otherwise be opaque to us.
Here are a couple of examples:
https://www.mdpi.com/2072-6694/14/23/5957
https://www.deepmind.com/blog/exploring-the-beauty-of-pure-mathematics-in-novel-ways
https://www.nature.com/articles/s41598-021-90285-5
Are you concerned about AI risk from narrow systems of this kind?
Thanks.
No. Am I concerned about risks from methods that work for this in narrow AI? Maybe.
This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of “backbone” which I would not have used. I think we’re on the same page.