Will this be an alignment effort that can work, an alignment effort that cannot possibly work, an initial doomed effort capable of pivoting, or a capabilities push in alignment clothing?
As written, and assuming that the description is accurate and means what I think it means this seems like a very substantial advance in marginal safety. I’d say it seems like it’s aimed at a difficulty level of 5 to 7 on my table,
I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I’d unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.
As ever though, if it doesn’t work then it doesn’t just not work, but rather will just make it look like the problem’s been solved and advance capabilities.
As written, and assuming that the description is accurate and means what I think it means this seems like a very substantial advance in marginal safety. I’d say it seems like it’s aimed at a difficulty level of 5 to 7 on my table,
https://www.lesswrong.com/posts/EjgfreeibTXRx9Ham/ten-levels-of-ai-alignment-difficulty#Table
I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I’d unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.
As ever though, if it doesn’t work then it doesn’t just not work, but rather will just make it look like the problem’s been solved and advance capabilities.