Interestingly, after a certain layer, the first principle component becomes identical to the mean difference between harmful and harmless activations.
Do you think this can be interpreted as the model having its focus entirely on “refusing to answer” from layer 15 onwards? And if it can be interpreted as the model not evaluating other potential moves/choices coherently over these layers. The idea is that it could be evaluating other moves in a single layer (after layer 15) but not over several layers since the residual stream is not updated significantly.
Especially can we interpret that as the model not thinking coherently over several layers about other policies, it could choose (e.g., deceptive policies like defecting from the policy of “refusing to answer”)? I wonder if we would observe something different if the model was trained to defect from this policy conditional on some hard-to-predict trigger (e.g. whether the model is in training or deployment).
Do you think this can be interpreted as the model having its focus entirely on “refusing to answer” from layer 15 onwards? And if it can be interpreted as the model not evaluating other potential moves/choices coherently over these layers. The idea is that it could be evaluating other moves in a single layer (after layer 15) but not over several layers since the residual stream is not updated significantly.
Especially can we interpret that as the model not thinking coherently over several layers about other policies, it could choose (e.g., deceptive policies like defecting from the policy of “refusing to answer”)? I wonder if we would observe something different if the model was trained to defect from this policy conditional on some hard-to-predict trigger (e.g. whether the model is in training or deployment).