Thanks for the comment! I’ll respond to the last part:
“First, developing basic insights is clearly not just an AI safety goal. It’s an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good.”
I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we’re explicitly interested in using interpretability with narrow domain systems.
“Interpretability is the backbone of knowledge discovery with deep learning”: Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren’t able to parse. If we can use interpretability to extract these patterns in a human-parsable way, in a (very Olah-ish) sense we can reframe deep learning models as lenses through which to view the world, and to make sense of data that would otherwise be opaque to us.
Are you concerned about AI risk from narrow systems of this kind?
No. Am I concerned about risks from methods that work for this in narrow AI? Maybe.
This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of “backbone” which I would not have used. I think we’re on the same page.
Thanks for the comment! I’ll respond to the last part:
“First, developing basic insights is clearly not just an AI safety goal. It’s an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good.”
I think this could certainly be the case if we were trying to build state of the art broad domain systems, in order to use interpretability tools with them for knowledge discovery – but we’re explicitly interested in using interpretability with narrow domain systems.
“Interpretability is the backbone of knowledge discovery with deep learning”: Deep learning models are really good at learning complex patterns and correlations in huge datasets that humans aren’t able to parse. If we can use interpretability to extract these patterns in a human-parsable way, in a (very Olah-ish) sense we can reframe deep learning models as lenses through which to view the world, and to make sense of data that would otherwise be opaque to us.
Here are a couple of examples:
https://www.mdpi.com/2072-6694/14/23/5957
https://www.deepmind.com/blog/exploring-the-beauty-of-pure-mathematics-in-novel-ways
https://www.nature.com/articles/s41598-021-90285-5
Are you concerned about AI risk from narrow systems of this kind?
Thanks.
No. Am I concerned about risks from methods that work for this in narrow AI? Maybe.
This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of “backbone” which I would not have used. I think we’re on the same page.