Sheikh Abdur Raheem Ali comments on Neel Nanda on the Mechanistic Interpretability Researcher Mindset

Sheikh Abdur Raheem Ali 22 Sep 2023 0:18 UTC
1 point
0
The learned algorithms will not always be simple enough to be interpretable. But I agree we should try to interpret as much as we can. What we are trying to predict is the behavior of future, more powerful models. I think toy models can sometimes have characteristics that are absent from current language models but those characteristics may be integrated into or emerge from more advanced systems that we build.