My favorite for AI researchers is Ajeya’s Without specific countermeasures, because I think it does a really good job being concrete about a training set up leading to deceptive alignment. It also is sufficiently non-technical that a motivated person not familiar with AI could understand the key points.
Forgot to include this. It’s sort of a more opinionated and ML-focused version of Carlsmith’s report and has a corresponding video/talk (as does Carlsmith).
My favorite for AI researchers is Ajeya’s Without specific countermeasures, because I think it does a really good job being concrete about a training set up leading to deceptive alignment. It also is sufficiently non-technical that a motivated person not familiar with AI could understand the key points.
Forgot to include this. It’s sort of a more opinionated and ML-focused version of Carlsmith’s report and has a corresponding video/talk (as does Carlsmith).