Sure. In fact, if you just want to make ML models as interpretable as humans, almost all of what I labeled “reservations” are in fact reasons for optimism. My take is colored by thinking about using this for AI alignment, which is a pretty hard problem. (To restate: you’re right, and the rest of this comment is about why I’m still pessimistic about a related thing.)
It’s kinda like realizability vs. non-realizability in ML. If you have 10 humans and 1 is aligned with your values and 9 are bad in normal human ways, you can probably use human-interpretability tools to figure out who is who (give them the Voight-Kampff test). If you have 10 humans and they’re all unaligned, the feedback from interpretability tools will only sometimes help you modify their brains so that they’re aligned with you, even if we imagine a neurosurgeon who could move individual neurons around. And if you try, I expect you’d end up with humans that are all bad but in weird ways.
What this means for AI alignment depends on the story you imagine us acting out. In my mainline story we’re not going to be in a situation where we need to building AIs that have some moderate chance of turning out to be aligned (and we then need to interpret their motivations to tell). We’re either be going to be building AIs with a high chance of alignment or a very low chance, and we don’t get much of a mulligan.
Sure. In fact, if you just want to make ML models as interpretable as humans, almost all of what I labeled “reservations” are in fact reasons for optimism. My take is colored by thinking about using this for AI alignment, which is a pretty hard problem. (To restate: you’re right, and the rest of this comment is about why I’m still pessimistic about a related thing.)
It’s kinda like realizability vs. non-realizability in ML. If you have 10 humans and 1 is aligned with your values and 9 are bad in normal human ways, you can probably use human-interpretability tools to figure out who is who (give them the Voight-Kampff test). If you have 10 humans and they’re all unaligned, the feedback from interpretability tools will only sometimes help you modify their brains so that they’re aligned with you, even if we imagine a neurosurgeon who could move individual neurons around. And if you try, I expect you’d end up with humans that are all bad but in weird ways.
What this means for AI alignment depends on the story you imagine us acting out. In my mainline story we’re not going to be in a situation where we need to building AIs that have some moderate chance of turning out to be aligned (and we then need to interpret their motivations to tell). We’re either be going to be building AIs with a high chance of alignment or a very low chance, and we don’t get much of a mulligan.