I found this post extremely valuable. This is a novel approach to safety research, which I hadn’t come across before, and which I likely would not otherwise have come across without Evan putting in the effort to write this post (and Chris putting in the work to come up with the ideas!).
I personally find interpretability to be a fascinating problem, that I might want to research someday. This post updated me a lot towards thinking that it’s a valuable and important problem for achieving alignment.
Further, I am very excited to see more posts like this in general—I think it’s extremely good and healthy to bring in more perspectives on the alignment problem, and different paths to success.
I found this post extremely valuable. This is a novel approach to safety research, which I hadn’t come across before, and which I likely would not otherwise have come across without Evan putting in the effort to write this post (and Chris putting in the work to come up with the ideas!).
I personally find interpretability to be a fascinating problem, that I might want to research someday. This post updated me a lot towards thinking that it’s a valuable and important problem for achieving alignment.
Further, I am very excited to see more posts like this in general—I think it’s extremely good and healthy to bring in more perspectives on the alignment problem, and different paths to success.