Putting what I think is the most important part of my reply first: I think research into mitigations is premature and instead demonstrating/measuring the phenomena should take priority.
However given that, I think I agree that these are all possible mitigations to the phenomena, in particular (rephrasing your points):
Reducing the channel capacity, and thus the amount of information that can be transmitted
Constraining the outputs, and thus reducing the degrees of freedom of the null space
Adding noise (via dropout or otherwise) also reduces the channel capacity, and probably also has some desirable side effects (like increasing the diversity or creativity of reasoning)
Introducing “barriers” past which optimization for hidden information becomes difficult (in your example, you use different models; I also think “stop gradients” in end-to-end training has similar effects, if weaker)
If you can detect/localize this phenomena, you might be able to use interpretability like techniques to limit or constrain it
I’m less certain about this, but possible future interpretability research will unlock this ability.
Also I think an important point here is that we can expect to see it with fairly simple techniques—merely including sampled/generated data in the training set is sufficient, as opposed to it requiring a complicated reinforcement learning algorithm like MuZero.
Did you publish your proposal? I’d be interested in reading it.
Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn’t a certainty.
Putting what I think is the most important part of my reply first: I think research into mitigations is premature and instead demonstrating/measuring the phenomena should take priority.
However given that, I think I agree that these are all possible mitigations to the phenomena, in particular (rephrasing your points):
Reducing the channel capacity, and thus the amount of information that can be transmitted
Constraining the outputs, and thus reducing the degrees of freedom of the null space
Adding noise (via dropout or otherwise) also reduces the channel capacity, and probably also has some desirable side effects (like increasing the diversity or creativity of reasoning)
Introducing “barriers” past which optimization for hidden information becomes difficult (in your example, you use different models; I also think “stop gradients” in end-to-end training has similar effects, if weaker)
If you can detect/localize this phenomena, you might be able to use interpretability like techniques to limit or constrain it
I’m less certain about this, but possible future interpretability research will unlock this ability.
Also I think an important point here is that we can expect to see it with fairly simple techniques—merely including sampled/generated data in the training set is sufficient, as opposed to it requiring a complicated reinforcement learning algorithm like MuZero.
Did you publish your proposal? I’d be interested in reading it.
Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn’t a certainty.
They’re a bit out of date but the original proposal is ELK Sub—Note-taking in internal rollouts and the multiple model protocol is in Note Taking Without Hidden Messages.
Haven’t done any proper testing but it’s high up on my agenda, would be interested in working out what tests would be best—will elaborate later.