I recently made an inside view argument that deceptive alignment is unlikely. It doesn’t cover other failure modes, but it makes detailed arguments against a core AI x-risk story. I’d love to hear what you think of it!
If “you” is referring to me, I’m not an alignment researcher, my knowledge of the field comes just from reading random LessWrong articles once in a while, so I’m not in a position to evaluate it, sorry.
I recently made an inside view argument that deceptive alignment is unlikely. It doesn’t cover other failure modes, but it makes detailed arguments against a core AI x-risk story. I’d love to hear what you think of it!
If “you” is referring to me, I’m not an alignment researcher, my knowledge of the field comes just from reading random LessWrong articles once in a while, so I’m not in a position to evaluate it, sorry.