In the human brain there is quite a lot of redundancy of information encoding. This could be for a variety of reasons.
here’s one hot take: In a brain and a language model I can imagine that during early learning, the network hasn’t learned concepts like “how to code” well enough to recognize that each training instance is an instance of the same thing. Consequently, during that early learning stage, the model does just encode a variety of representations for what turns out to be the same thing. 800 vector encodes in it starts to match each subsequent training example to prior examples and can encode the information more efficiently.
Then adding multiple vectors triggers a refusal just because the “code for making a bomb” sign gets amplified and more easily triggers the RLHF-derived circuit for “refuse to answer”.
In the human brain there is quite a lot of redundancy of information encoding. This could be for a variety of reasons.
here’s one hot take: In a brain and a language model I can imagine that during early learning, the network hasn’t learned concepts like “how to code” well enough to recognize that each training instance is an instance of the same thing. Consequently, during that early learning stage, the model does just encode a variety of representations for what turns out to be the same thing. 800 vector encodes in it starts to match each subsequent training example to prior examples and can encode the information more efficiently.
Then adding multiple vectors triggers a refusal just because the “code for making a bomb” sign gets amplified and more easily triggers the RLHF-derived circuit for “refuse to answer”.