jacquesthibs comments on Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes

jacquesthibs 3 May 2023 2:38 UTC
2 points
0
Just a small tangent with respect to:
Here’s a relevant statement that Rohin made—I think it’s from a few years ago though so it might be outdated:
I would guess that AI systems will become more interpretable in the future, as they start using the features / concepts / abstractions that humans are using. Eventually, sufficiently intelligent AI systems will probably find even better concepts that are alien to us, but if we only consider AI systems that are (say) 10x more intelligent than us, they will probably still be using human-understandable concepts. This should make alignment and oversight of these systems significantly easier. For significantly stronger systems, we should be delegating the problem to the AI systems that are 10x more intelligent than us.
(This is very similar to the picture painted in Chris Olah’s views on AGI safety (AN [AN #80]: Why AI risk might be solved without additional intervention from longtermists
Chris provided additional clarification to his views on interpretability here. TLDR: the AGI safety post by Evan was just pointing to Chris’ intuitions at the time of doing interpretability on CNN vision models and he’s not certain his views are true or transfer to the LLM regime.