Thanks for the thoughtful feedback both here and on my other post! I plan to respond in detail to both. For now, your comment here makes a good point about terminology, and I have replaced “deception” with “deceptive alignment” in both posts. Thanks for pointing that out!
I’m intentionally not addressing direct reward maximizers in this sequence. I think they are a much more plausible source of risk than deceptive alignment. However, I haven’t thought about them nearly as much, and I don’t have strong intuition for how likely they are yet, so I’m choosing to stay focused on deceptive alignment for this sequence.
Thanks for the thoughtful feedback both here and on my other post! I plan to respond in detail to both. For now, your comment here makes a good point about terminology, and I have replaced “deception” with “deceptive alignment” in both posts. Thanks for pointing that out!
I’m intentionally not addressing direct reward maximizers in this sequence. I think they are a much more plausible source of risk than deceptive alignment. However, I haven’t thought about them nearly as much, and I don’t have strong intuition for how likely they are yet, so I’m choosing to stay focused on deceptive alignment for this sequence.