Yep, I think that’s a correct summary of the final point.
The main counterpoint that comes to mind is a possible world where “opaque AIs” just can’t ever achieve general intelligence, but moderately well-thought-out AI designs can bridge the gap to “general intelligence/agency” without being reliable enough to be aligned.
Well, we know it’s possible to achieve general intelligence via dumb black box search—evolution did it—and we’ve got lots of evidence for current black box approaches being quite powerful. So it seems unlikely to me that we “just can’t ever achieve general intelligence” with black box approaches, though it could be that doing so is much more difficult than if you have more of an understanding.
Also, ease of aligning a particular AI design is a relative property, not an absolute one. When you say transparent approaches might not be “reliable enough to be aligned” you could mean that they’ll be just as likely likely as black box approaches to be aligned, less likely, or that they won’t be able to meet some benchmark threshold probability of safety. I would guess that transparency will increase the probability of alignment relative to not having it, though I would say that it’s unclear currently by how much.
The way I generally like to think about this is that there are many possible roads we can take to get to AGI, with some being more alignable and some being less alignable and some being shorter and some being longer. Then, the argument here is that transparency research opens up additional avenues which are more alignable, but which may be shorter or longer. Even if they’re shorter, however, since they’re more alignable the idea is that even if we end up taking the fastest path without regards to safety, if you can make the fastest path available to us a safer one, then that’s a win.
One thing I’d add, in addition to Evan’s comments, is that the present ML paradigm and Neural Architecture Search are formidable competitors. It feels like there’s a big gap in effectiveness, where we’d need to make lots of progress for “principled model design” to be competitive with them in a serious way. The gap causes me to believe that we’ll have (and already have had) significant returns on interpretability before we see capabilities acceleration. If it felt like interpretability was accelerating capabilities on the present margin, I’d be a bit more cautious about this type of argumentation.
(To date, I think the best candidate for a capabilities success case from this approach is Deconvolution and Checkerboard Artifacts. I think it’s striking that the success was less about improving a traditional benchmark, and more about getting models to do what we intend.)
What if we think about it the following way? ML researchers range from _theorists_ (who try to produce theories that describe how ML/AI/intelligence works at the deep level and how to build it) to _experimenters_ (who put things together using some theory and lots of trial and error and try to make it perform well on the benchmarks). Most people will be somewhere in between on this spectrum but people focusing on interpretability will be further towards theorists than most of the field.
Now let’s say we boost the theorists and they produce a lot of explanations that make better sense of the state of the art that experimenters have been playing with. The immediate impact of this will be improved understanding of our best models and this is good for safety. However, when the experimenters read these papers, their search space (of architectures, hyperparameters, training regimes, etc.) is reduced and they are now able to search more efficiently. Standing on the shoulders of the new theories they produce even better performing models (however they still incorporate a lot of trial and error because this is what experimenters do).
So what we achieved is better understanding of the current state of the art models combined with new improved state of the art that we still don’t quite understand. It’s not immediately clear whether we’re better off this way. Or is this model too coarse to see what’s going on?
Yep, I think that’s a correct summary of the final point.
Well, we know it’s possible to achieve general intelligence via dumb black box search—evolution did it—and we’ve got lots of evidence for current black box approaches being quite powerful. So it seems unlikely to me that we “just can’t ever achieve general intelligence” with black box approaches, though it could be that doing so is much more difficult than if you have more of an understanding.
Also, ease of aligning a particular AI design is a relative property, not an absolute one. When you say transparent approaches might not be “reliable enough to be aligned” you could mean that they’ll be just as likely likely as black box approaches to be aligned, less likely, or that they won’t be able to meet some benchmark threshold probability of safety. I would guess that transparency will increase the probability of alignment relative to not having it, though I would say that it’s unclear currently by how much.
The way I generally like to think about this is that there are many possible roads we can take to get to AGI, with some being more alignable and some being less alignable and some being shorter and some being longer. Then, the argument here is that transparency research opens up additional avenues which are more alignable, but which may be shorter or longer. Even if they’re shorter, however, since they’re more alignable the idea is that even if we end up taking the fastest path without regards to safety, if you can make the fastest path available to us a safer one, then that’s a win.
One thing I’d add, in addition to Evan’s comments, is that the present ML paradigm and Neural Architecture Search are formidable competitors. It feels like there’s a big gap in effectiveness, where we’d need to make lots of progress for “principled model design” to be competitive with them in a serious way. The gap causes me to believe that we’ll have (and already have had) significant returns on interpretability before we see capabilities acceleration. If it felt like interpretability was accelerating capabilities on the present margin, I’d be a bit more cautious about this type of argumentation.
(To date, I think the best candidate for a capabilities success case from this approach is Deconvolution and Checkerboard Artifacts. I think it’s striking that the success was less about improving a traditional benchmark, and more about getting models to do what we intend.)
What if we think about it the following way? ML researchers range from _theorists_ (who try to produce theories that describe how ML/AI/intelligence works at the deep level and how to build it) to _experimenters_ (who put things together using some theory and lots of trial and error and try to make it perform well on the benchmarks). Most people will be somewhere in between on this spectrum but people focusing on interpretability will be further towards theorists than most of the field.
Now let’s say we boost the theorists and they produce a lot of explanations that make better sense of the state of the art that experimenters have been playing with. The immediate impact of this will be improved understanding of our best models and this is good for safety. However, when the experimenters read these papers, their search space (of architectures, hyperparameters, training regimes, etc.) is reduced and they are now able to search more efficiently. Standing on the shoulders of the new theories they produce even better performing models (however they still incorporate a lot of trial and error because this is what experimenters do).
So what we achieved is better understanding of the current state of the art models combined with new improved state of the art that we still don’t quite understand. It’s not immediately clear whether we’re better off this way. Or is this model too coarse to see what’s going on?