I agree current models sometimes trick their supervisors ~intentionally and it’s certainly easy to train/prompt them to do so.
I don’t think current models are deceptively aligned and I think that this poses substantial additional risk.
I personally like Joe’s definition and it feels like a natural category in my head, but I can see why you don’t like it. You should consider tabooing the word scheming or saying something more specific as many people mean something more specific that is different from what you mean.
Yeah, that makes sense. I’ve noticed miscommunications around the word “scheming” a few times, so am in favor of tabooing it more. “Engage in deception for instrumental reasons” seems like an obvious extension that captures a lot of what I care about.
I agree current models sometimes trick their supervisors ~intentionally and it’s certainly easy to train/prompt them to do so.
I don’t think current models are deceptively aligned and I think that this poses substantial additional risk.
I personally like Joe’s definition and it feels like a natural category in my head, but I can see why you don’t like it. You should consider tabooing the word scheming or saying something more specific as many people mean something more specific that is different from what you mean.
Yeah, that makes sense. I’ve noticed miscommunications around the word “scheming” a few times, so am in favor of tabooing it more. “Engage in deception for instrumental reasons” seems like an obvious extension that captures a lot of what I care about.