I think this is really exciting and I’m very interested see how it goes. I think the current set of problems and methodologies is solid enough that participants have a reasonable shot at making meaningful progress within a month. I also expect this to be a useful way to learn about language models and to generally be in a better position to think about alignment.
I think we’re still a long way from understanding model behavior well enough that we could e.g. rule out deceptive alignment, but it feels to me like recent work on LM interpretability is making real progress towards that goal, and I can imagine having large teams studying frontier models closely enough to robustly notice deceptive alignment well in advance by the time we have transformative AI.
I think this is really exciting and I’m very interested see how it goes. I think the current set of problems and methodologies is solid enough that participants have a reasonable shot at making meaningful progress within a month. I also expect this to be a useful way to learn about language models and to generally be in a better position to think about alignment.
I think we’re still a long way from understanding model behavior well enough that we could e.g. rule out deceptive alignment, but it feels to me like recent work on LM interpretability is making real progress towards that goal, and I can imagine having large teams studying frontier models closely enough to robustly notice deceptive alignment well in advance by the time we have transformative AI.