Interpretability Research for the Most Important Century

25 Apr 2022 22:56 UTC

This series of posts attempts to answer the following question from Holden Karnofsky’s Important, actionable research questions for the most important century (from which the name of this sequence is inspired as well):

“What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?”.

As one answer to Holden’s question, I explore the argument that interpretability research is one of these high-leverage activities in the AI alignment research space.

Featured image was created by DALL·E

Introduction to the sequence: Interpretability Research for the Most Important Century

Evan R. Murphy12 May 2022 19:59 UTC

16 points

0 comments8 min readLW link

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

Evan R. Murphy12 May 2022 20:01 UTC

58 points

0 comments59 min readLW link

Interpretability Research for the Most Important Century

In­tro­duc­tion to the se­quence: In­ter­pretabil­ity Re­search for the Most Im­por­tant Century

In­ter­pretabil­ity’s Align­ment-Solv­ing Po­ten­tial: Anal­y­sis of 7 Scenarios

Introduction to the sequence: Interpretability Research for the Most Important Century

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios