I like your reservations (which I think are well-written, sensible, and comprehensive) more than your optimism. The wireheading example is a good illustration of a place where we can find something coarse-grained in the brain that almost matches a natural human concept, but not quite, and the caveats there would be disastrous to ignore if trying to put load-bearing trust in interpreting superhuman AI.
The point of the wireheading example is that, in order for investigators in the 1960s to succeed that much, the brain must be MUCH more interpretable than current artificial neural networks. We should be able to make networks that are even more interpretable than the brain, and we should have much better interpretability techniques than neuroscientists from the 1960s.
My argument isn’t “we can effectively adapt neuroscience interpretability to current ML systems”. It’s “The brain’s high level interpretability suggests we can greatly improve the interpretability of current ML systems”.
Sure. In fact, if you just want to make ML models as interpretable as humans, almost all of what I labeled “reservations” are in fact reasons for optimism. My take is colored by thinking about using this for AI alignment, which is a pretty hard problem. (To restate: you’re right, and the rest of this comment is about why I’m still pessimistic about a related thing.)
It’s kinda like realizability vs. non-realizability in ML. If you have 10 humans and 1 is aligned with your values and 9 are bad in normal human ways, you can probably use human-interpretability tools to figure out who is who (give them the Voight-Kampff test). If you have 10 humans and they’re all unaligned, the feedback from interpretability tools will only sometimes help you modify their brains so that they’re aligned with you, even if we imagine a neurosurgeon who could move individual neurons around. And if you try, I expect you’d end up with humans that are all bad but in weird ways.
What this means for AI alignment depends on the story you imagine us acting out. In my mainline story we’re not going to be in a situation where we need to building AIs that have some moderate chance of turning out to be aligned (and we then need to interpret their motivations to tell). We’re either be going to be building AIs with a high chance of alignment or a very low chance, and we don’t get much of a mulligan.
I like your reservations (which I think are well-written, sensible, and comprehensive) more than your optimism. The wireheading example is a good illustration of a place where we can find something coarse-grained in the brain that almost matches a natural human concept, but not quite, and the caveats there would be disastrous to ignore if trying to put load-bearing trust in interpreting superhuman AI.
The point of the wireheading example is that, in order for investigators in the 1960s to succeed that much, the brain must be MUCH more interpretable than current artificial neural networks. We should be able to make networks that are even more interpretable than the brain, and we should have much better interpretability techniques than neuroscientists from the 1960s.
My argument isn’t “we can effectively adapt neuroscience interpretability to current ML systems”. It’s “The brain’s high level interpretability suggests we can greatly improve the interpretability of current ML systems”.
Sure. In fact, if you just want to make ML models as interpretable as humans, almost all of what I labeled “reservations” are in fact reasons for optimism. My take is colored by thinking about using this for AI alignment, which is a pretty hard problem. (To restate: you’re right, and the rest of this comment is about why I’m still pessimistic about a related thing.)
It’s kinda like realizability vs. non-realizability in ML. If you have 10 humans and 1 is aligned with your values and 9 are bad in normal human ways, you can probably use human-interpretability tools to figure out who is who (give them the Voight-Kampff test). If you have 10 humans and they’re all unaligned, the feedback from interpretability tools will only sometimes help you modify their brains so that they’re aligned with you, even if we imagine a neurosurgeon who could move individual neurons around. And if you try, I expect you’d end up with humans that are all bad but in weird ways.
What this means for AI alignment depends on the story you imagine us acting out. In my mainline story we’re not going to be in a situation where we need to building AIs that have some moderate chance of turning out to be aligned (and we then need to interpret their motivations to tell). We’re either be going to be building AIs with a high chance of alignment or a very low chance, and we don’t get much of a mulligan.