tldr: I’m a little confused about what Anthropic is aiming for as an alignment target, and I think it would be helpful if they publicly clarified this and/or considered it more internally.
I think we could be very close to AGI, and I think it’s important that whoever makes AGI thinks carefully about what properties to target in trying to create a system that is both useful and maximally likely to be safe.
It seems to me that right now, Anthropic is targeting something that resembles a slightly more harmless modified version of human values — maybe a CEV-like thing. However, some alignment targets may be easier than others. It may turn out that it is hard to instill a CEV-like thing into an AGI, while it’s easier to ensure properties like corrigibility or truthfulness.
One intuition for why this may be true: if you took OAI’s weak-to-strong generalization setup, and tried eliciting capabilities relating to different alignment targets (standard reward modeling might be a solid analogy for the current Anthropic plan, but one could also try this with truthfulness or corrigibility), I think you may well find that a capability like ‘truthfulness’ is more natural than reward modeling and can be elicited more easily. Truth may also have low algorithmic complexity compared to other targets.
There is an inherent tradeoff between harmlessness and usefulness. Similarly, there is some inherent tradeoff between harmlessness and corrigibility, and between harmlessness and truthfulness (the Alignment Faking paper provides strong evidence for the latter two points, even ignoring theoretical arguments).
As seen in the Alignment Faking paper, Claude seems to align pretty well with human values and be relatively harmless. However, as a tradeoff, it does not seem to be very corrigible or truthful.
Some people I’ve talked to seem to think that Anthropic does think of corrigibility as one of the main pillars of their alignment plan. If that’s the case, maybe they should make their current AIs more corrigible, so their safety testing is enacted on AIs that resemble their first AGI. Or, if they haven’t really thought about this question (or if individuals have thought about it, but never cohesively in an organized fashion), they should maybe consider it. My guess is that there are designated people at Anthropic thinking about what values are important to instill, but they are thinking about this more from a societal perspective than an alignment perspective?
Mostly, I want to avoid a scenario where Anthropic does the default thing without considering tough, high-level strategy questions until the last minute. I also think it would be nice to do concrete empirical research now which lines up well with what we should expect to see later.
tldr: I’m a little confused about what Anthropic is aiming for as an alignment target, and I think it would be helpful if they publicly clarified this and/or considered it more internally.
I think we could be very close to AGI, and I think it’s important that whoever makes AGI thinks carefully about what properties to target in trying to create a system that is both useful and maximally likely to be safe.
It seems to me that right now, Anthropic is targeting something that resembles a slightly more harmless modified version of human values — maybe a CEV-like thing. However, some alignment targets may be easier than others. It may turn out that it is hard to instill a CEV-like thing into an AGI, while it’s easier to ensure properties like corrigibility or truthfulness.
One intuition for why this may be true: if you took OAI’s weak-to-strong generalization setup, and tried eliciting capabilities relating to different alignment targets (standard reward modeling might be a solid analogy for the current Anthropic plan, but one could also try this with truthfulness or corrigibility), I think you may well find that a capability like ‘truthfulness’ is more natural than reward modeling and can be elicited more easily. Truth may also have low algorithmic complexity compared to other targets.
There is an inherent tradeoff between harmlessness and usefulness. Similarly, there is some inherent tradeoff between harmlessness and corrigibility, and between harmlessness and truthfulness (the Alignment Faking paper provides strong evidence for the latter two points, even ignoring theoretical arguments).
As seen in the Alignment Faking paper, Claude seems to align pretty well with human values and be relatively harmless. However, as a tradeoff, it does not seem to be very corrigible or truthful.
Some people I’ve talked to seem to think that Anthropic does think of corrigibility as one of the main pillars of their alignment plan. If that’s the case, maybe they should make their current AIs more corrigible, so their safety testing is enacted on AIs that resemble their first AGI. Or, if they haven’t really thought about this question (or if individuals have thought about it, but never cohesively in an organized fashion), they should maybe consider it. My guess is that there are designated people at Anthropic thinking about what values are important to instill, but they are thinking about this more from a societal perspective than an alignment perspective?
Mostly, I want to avoid a scenario where Anthropic does the default thing without considering tough, high-level strategy questions until the last minute. I also think it would be nice to do concrete empirical research now which lines up well with what we should expect to see later.
Thanks for reading!
I agree! I contributed to and endorse this Corrigibility plan by Max Harms (MIRI researcher): Corrigibility as Singular Target
(See also posts by Seth Herd)
I think CAST offers much better safety under higher capabilities and more agentic workflows.