I tentatively think that it’s good to distinguish at least between the following three model classes: active planners, sleeper agents and opportunists.
Looking back at this, I find this categorization pretty confusing, and not quite carving the model space at the right places. I discuss what I now think is a better frame here.
Briefly, the idea is that one should make a distinction on how often the model is thinking about its plans against you. In this post’s terminology, this basically draws a line between opportunists vs. active planners and sleeper agents. The latter two roughly correspond to different strategies, namely active sabotage vs. “lie and wait”.
But I think the descriptions I gave of “active planners” etc. in the post are overly specific and are possibly sneaking in false inferences. So I think those labels are possibly harmful, and I’ve mostly stopped using those terms in my own thinking.
Looking back at this, I find this categorization pretty confusing, and not quite carving the model space at the right places. I discuss what I now think is a better frame here.
Briefly, the idea is that one should make a distinction on how often the model is thinking about its plans against you. In this post’s terminology, this basically draws a line between opportunists vs. active planners and sleeper agents. The latter two roughly correspond to different strategies, namely active sabotage vs. “lie and wait”.
But I think the descriptions I gave of “active planners” etc. in the post are overly specific and are possibly sneaking in false inferences. So I think those labels are possibly harmful, and I’ve mostly stopped using those terms in my own thinking.