Here’s a relevant statement that Rohin made—I think it’s from a few years ago though so it might be outdated:
I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using. Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us, but if we only consider AI systems that are (say) 10x more intelligent than us, they will probably still be using human-understandable concepts. This should make alignment and oversight of these systems significantly easier. For significantly stronger systems, we should be delegating the problem to the AI systems that are 10x more intelligent than us.
Chris provided additional clarification to his views on interpretability here. TLDR: the AGI safety post by Evan was just pointing to Chris’ intuitions at the time of doing interpretability on CNN vision models and he’s not certain his views are true or transfer to the LLM regime.
Just a small tangent with respect to:
Chris provided additional clarification to his views on interpretability here. TLDR: the AGI safety post by Evan was just pointing to Chris’ intuitions at the time of doing interpretability on CNN vision models and he’s not certain his views are true or transfer to the LLM regime.