I agree with 1, which is why the COT will absolutely have to be faithful.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won’t really need the CoT in the first place).
I agree with 2, but conditional on relatively weak forward passes, and most of the bottleneck to reasoning being through the COT, there is little the model can do about the situation, short of exfiltrating itself...
I also disagree with 3, at least assuming relatively weak forward passes, and the bottleneck to reasoning being largely through COT
I don’t have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren’t legible to monitors, and those can accumulate over the course of an extended context.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won’t really need the CoT in the first place).
Good point, I haven’t thought about interpretability strong enough to not need CoT.
I don’t have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren’t legible to monitors, and those can accumulate over the course of an extended context.
There will be more than 0.0 bits on attempting to takeover, assuming the AI is misaligned, but the key difficulty here is that takeover is still hard, at least without a rogue deployment, and while we might not be able to prevent steganography from occurring at all, we could well limit the ability to do this such that key takeover steps are bottlenecked through COT, which we can intervene on.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won’t really need the CoT in the first place).
I don’t have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren’t legible to monitors, and those can accumulate over the course of an extended context.
Good point, I haven’t thought about interpretability strong enough to not need CoT.
There will be more than 0.0 bits on attempting to takeover, assuming the AI is misaligned, but the key difficulty here is that takeover is still hard, at least without a rogue deployment, and while we might not be able to prevent steganography from occurring at all, we could well limit the ability to do this such that key takeover steps are bottlenecked through COT, which we can intervene on.