I think it would be due to the LM in question using lots of language-neutral circuitry? See this paper.
RLHF mostly updates abstract/​conceptual circuits, which (I assume) tend to be language neutral, then the language specific circuits just continue translating to/​from the updated circuits.
I think it would be due to the LM in question using lots of language-neutral circuitry? See this paper.
RLHF mostly updates abstract/​conceptual circuits, which (I assume) tend to be language neutral, then the language specific circuits just continue translating to/​from the updated circuits.
Is RLHF updating abstract circuits an established fact? Why would it suffer from mode collapse in that case?