DanielFilan comments on Chris Olah’s views on AGI safety

DanielFilan 8 Jan 2021 22:21 UTC
LW: 8 AF: 5
AF
- Olah’s comment indicates that this is indeed a good summary of his views.
- I think the first three listed benefits are indeed good reasons to work on transparency/interpretability. I am intrigued but less convinced by the prospect of ‘microscope AI’.
  - The ‘catching problems with auditing’ section describes an ‘auditing game’, and says that progress in this game might illustrate progress in using interpretability for alignment. It would be good to learn how much success the auditors have had in this game since the post was published.
  - One test of ‘microscope AI’: the go community has had a couple of years of the computer era, in which time open-source go programs stronger than AlphaGo have been released. This has indeed changed the way that humans think about go: seeing the corner variations that AIs tend to play has changed our views on which variations are good for which player, and seeing AI win probabilities conditioned on various moves, as well as the AI-recommended continuations, has made it easier to review games. Yet sadly, there has been to my knowledge no new go knowledge generated from looking at the internals of these systems, despite some visualization research being done (https://arxiv.org/pdf/1901.02184.pdf, https://link.springer.com/chapter/10.1007/978-3-319-97304-3_20). As far as I’m aware, we do not even know if these systems understand the combinatorial game theory of the late endgame, the one part of go that has been satisfactorily mathematized (and therefore unusually amenable to checking whether some program implements it). It’s not clear to me whether this is for a lack of trying, but this does seem like a setting where microscope AI would be useful if it were promising.
- The paper mostly focuses on the benefits of transparency/interpretability for AI alignment. However, as far as I’m aware, since before this post was published, the strongest argument against work in this direction has been the problem of tractability—can we actually succeed at understanding powerful AI systems?
  - The one argument that is given in the post is a claim that moderately-smarter-than-human AIs will have relatively crisp and human-like abstractions that humans will be able to understand. The evidence for this claim is that more powerful image classifiers seem to have closer-to-human concepts, but this doesn’t seem strong to me—these classifiers appear to use concepts that are close-to-human, but without mechanistic understanding of how these ‘concepts’ are actually constructed in the network, it’s difficult to reason about them with confidence. Furthermore, we know that network visualization techniques that involve finding images that maximize the activation of an internal neuron produce bizarre images that don’t look like anything given a wide enough search space, suggesting that we in fact don’t understand what these neurons are representing.
  - This relates to a problem with the argument: just because an ML system uses easy-to-understand concepts does not mean that those concepts will be easy to recover at the level of analysis that humans can access. For example, humans have concepts that are easy for humans to understand, but it is difficult for neuroscientists to recover the details of all of these concepts purely from neural wirings and chemical levels.
  - Since this paper has been published, Olah’s team has published work that does say something about the internals of AI systems (https://distill.pub/2020/circuits/), but this has involved manually inspecting all the weights and neurons, an approach that I do not expect to scale.
  - All in all, I do think there are promising research directions here, but I don’t think that there is enough publicly-available evidence for a reader of the post to be convinced of that without relying on deference to the researchers who are relatively optimistic about this work.
- The post has some interesting thoughts about how to re-orient the field (or at least a sub-section of the field) around interpreting ML models.
  - I’m most intrigued by the argument that as state-of-the-art models get larger, this work will be comparatively more accessible to smaller groups.
  - I think Distill is an interesting step in the direction of making ML interpretation more academically rewarding, but at present, it still seems to me that the effort-to-reward ratio is still unfavourably high relative to other work enterprising ML researchers can pursue.
  - At any rate, I’m interested in this re-orientation, and am unsure about whether there are promising ways to make it materialize. I hope that more work is done to make it happen, and that we will be able to know the results in five or so years from now.
- Follow-up work that I’d be interested to see:
  - More academic research on extracting knowledge from the internals of ML systems.
  - LW posts arguing for the tractability of transparency and interpretability research as AI systems scale.
  - A more detailed post and model of what would succeed at re-orienting the field around interpreting powerful AI systems. Bonus points if the interventions only work if such a reorientation is actually a good idea.
What links here?
- Modeling the impact of safety agendas by Ben Cottier (5 Nov 2021 19:46 UTC; 51 points)
- DanielFilan's comment on Thread for making 2019 Review accountability commitments by jacobjacob (12 Jan 2021 19:25 UTC; 4 points)