I definitely agree that COT interpretability isn’t the end of the story, but to respond to this:
I agree that LLMs in general look like a simpler alignment problem than many of the others we could have been faced with, and having explicit reasoning steps seems to help further. I think monitoring still isn’t entirely straightforward, since
CoT can be unfaithful to the actual reasoning process.
I suspect that advanced models will understand that we’re monitoring their CoT, which seems likely to affect them.
I would speculate that if such models were to become deceptively misaligned, they might very well be able to make CoT look innocuous while still using it to scheme.
At least in principle, putting optimization pressure on innocuous-looking CoT could result in models which were deceptive without deceptive intent.
I agree with 1, which is why the COT will absolutely have to be faithful.
I agree with 2, but conditional on relatively weak forward passes, and most of the bottleneck to reasoning being through the COT, there is little the model can do about the situation, short of exfiltrating itself, which is likely to be caught if we look for it in COT, since they can’t take over without the COT reasoning.
I also disagree with 3, at least assuming relatively weak forward passes, and the bottleneck to reasoning being largely through COT, but this is a plausible enough threat model that we can’t rely solely on COT.
I basically disagree with 4, because I’m in general skeptical of deep deception threat models, both because I think that we may well be able to prevent self-exfiltration of the model such that it doesn’t get much more compute to get more power, and in general I think at least in the early years of AGI there will be little ability to externalize compute/cognition beyond the server in which the model lives, for somewhat similar but also different reasons as to why I don’t expect open-source to matter much in AI takeoff.
I agree with 1, which is why the COT will absolutely have to be faithful.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won’t really need the CoT in the first place).
I agree with 2, but conditional on relatively weak forward passes, and most of the bottleneck to reasoning being through the COT, there is little the model can do about the situation, short of exfiltrating itself...
I also disagree with 3, at least assuming relatively weak forward passes, and the bottleneck to reasoning being largely through COT
I don’t have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren’t legible to monitors, and those can accumulate over the course of an extended context.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won’t really need the CoT in the first place).
Good point, I haven’t thought about interpretability strong enough to not need CoT.
I don’t have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren’t legible to monitors, and those can accumulate over the course of an extended context.
There will be more than 0.0 bits on attempting to takeover, assuming the AI is misaligned, but the key difficulty here is that takeover is still hard, at least without a rogue deployment, and while we might not be able to prevent steganography from occurring at all, we could well limit the ability to do this such that key takeover steps are bottlenecked through COT, which we can intervene on.
I definitely agree that COT interpretability isn’t the end of the story, but to respond to this:
I agree with 1, which is why the COT will absolutely have to be faithful.
I agree with 2, but conditional on relatively weak forward passes, and most of the bottleneck to reasoning being through the COT, there is little the model can do about the situation, short of exfiltrating itself, which is likely to be caught if we look for it in COT, since they can’t take over without the COT reasoning.
I also disagree with 3, at least assuming relatively weak forward passes, and the bottleneck to reasoning being largely through COT, but this is a plausible enough threat model that we can’t rely solely on COT.
I basically disagree with 4, because I’m in general skeptical of deep deception threat models, both because I think that we may well be able to prevent self-exfiltration of the model such that it doesn’t get much more compute to get more power, and in general I think at least in the early years of AGI there will be little ability to externalize compute/cognition beyond the server in which the model lives, for somewhat similar but also different reasons as to why I don’t expect open-source to matter much in AI takeoff.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won’t really need the CoT in the first place).
I don’t have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren’t legible to monitors, and those can accumulate over the course of an extended context.
Good point, I haven’t thought about interpretability strong enough to not need CoT.
There will be more than 0.0 bits on attempting to takeover, assuming the AI is misaligned, but the key difficulty here is that takeover is still hard, at least without a rogue deployment, and while we might not be able to prevent steganography from occurring at all, we could well limit the ability to do this such that key takeover steps are bottlenecked through COT, which we can intervene on.