This post is among the most concrete, actionable, valuable post I read from 2021. Earlier this year, when I was trying to get a handle on the current-state-of-AI, this post transformed my opinion of Interpretability research from “man, this seems important but it looks so daunting and I can’t imagine interpretability providing enough value in time” to “okay, I actually see a research framework I could expect to be scalable.”
I’m not a technical researcher so I have trouble comparing this post to other Alignment conceptual work. But my impression, from seeing this concept discussed by groups of established AI Alignment researchers, is that this is least a prima-facie legitimate research paradigm.
I also really like the level of detail here. There are a number of obvious traps, confusions a reader might fall into when reading the high level summary. But the post gives lots of concrete implementation details that look to me. (I recall at one discussion, Eliezer had two objections to the core idea, and then Evan said “yup, I address that in the post” and Eliezer was like “oh. Huh. That doesn’t usually happen to me.”)
One clarification I’ve heard Evan say when elaborating on the post:
An idea that seemed obvious to me and some others, but at least wasn’t what Evan meant here, was “create short tournaments, the way you have Red Team Blue Team games at, say, DefCon.” Evan said the point here was more of a framework that a longterm research team would use on the timescale of years. Upon reflection I think this makes sense. I think in most short-term games, the Blue Team would just fail, in a boring way, and the Red Team would be incentivized to do fairly boring things that don’t actually improve on the state of the art.
This post is among the most concrete, actionable, valuable post I read from 2021. Earlier this year, when I was trying to get a handle on the current-state-of-AI, this post transformed my opinion of Interpretability research from “man, this seems important but it looks so daunting and I can’t imagine interpretability providing enough value in time” to “okay, I actually see a research framework I could expect to be scalable.”
I’m not a technical researcher so I have trouble comparing this post to other Alignment conceptual work. But my impression, from seeing this concept discussed by groups of established AI Alignment researchers, is that this is least a prima-facie legitimate research paradigm.
I also really like the level of detail here. There are a number of obvious traps, confusions a reader might fall into when reading the high level summary. But the post gives lots of concrete implementation details that look to me. (I recall at one discussion, Eliezer had two objections to the core idea, and then Evan said “yup, I address that in the post” and Eliezer was like “oh. Huh. That doesn’t usually happen to me.”)
One clarification I’ve heard Evan say when elaborating on the post:
An idea that seemed obvious to me and some others, but at least wasn’t what Evan meant here, was “create short tournaments, the way you have Red Team Blue Team games at, say, DefCon.” Evan said the point here was more of a framework that a longterm research team would use on the timescale of years. Upon reflection I think this makes sense. I think in most short-term games, the Blue Team would just fail, in a boring way, and the Red Team would be incentivized to do fairly boring things that don’t actually improve on the state of the art.