The authors hypothesized that RMU causes randomness in the logits of generated tokens.
They proved that the logit of a token generated by an RMU model follows a Normal distribution.
This randomness in logits is interpreted as low confidence, leading to nonsensical or incorrect responses.
The analysis suggests that the variance of the logit distribution is influenced by the coefficient value and properties of the model layers.
Impact of the coefficient value:
The coefficient ‘c’ in RMU affects how well the forget sample representations align with the random vector.
Theoretical analysis showed a positive correlation between the coefficient and the alignment.
Larger coefficient values lead to more alignment between forget sample representations and the random vector.
However, the optimal coefficient value varies depending on the layer of the model being unlearned.
Robustness against adversarial jailbreak attacks:
The authors analyzed RMU’s effectiveness against white-box jailbreak attacks from an attack-defense game perspective.
They showed that RMU causes the attacker to receive unreliable and uninformative gradient signals.
This makes it difficult for the attacker to find optimal adversarial tokens for replacement.
The analysis explains why RMU models demonstrate strong robustness against methods like Greedy Coordinate Gradient (GCG) attacks.
The theoretical analysis provides a foundation for understanding why RMU works and how its parameters affect its performance. This understanding led to the development of the improved Adaptive RMU method proposed in the paper.′
Potentially relevant: the theoretical results from On Effects of Steering Latent Representation for Large Language Model Unlearning. From Claude chat: ’Point 2 refers to the theoretical analysis provided in the paper on three key aspects of the Representation Misdirection for Unlearning (RMU) method:
Effect on token confidence:
The authors hypothesized that RMU causes randomness in the logits of generated tokens.
They proved that the logit of a token generated by an RMU model follows a Normal distribution.
This randomness in logits is interpreted as low confidence, leading to nonsensical or incorrect responses.
The analysis suggests that the variance of the logit distribution is influenced by the coefficient value and properties of the model layers.
Impact of the coefficient value:
The coefficient ‘c’ in RMU affects how well the forget sample representations align with the random vector.
Theoretical analysis showed a positive correlation between the coefficient and the alignment.
Larger coefficient values lead to more alignment between forget sample representations and the random vector.
However, the optimal coefficient value varies depending on the layer of the model being unlearned.
Robustness against adversarial jailbreak attacks:
The authors analyzed RMU’s effectiveness against white-box jailbreak attacks from an attack-defense game perspective.
They showed that RMU causes the attacker to receive unreliable and uninformative gradient signals.
This makes it difficult for the attacker to find optimal adversarial tokens for replacement.
The analysis explains why RMU models demonstrate strong robustness against methods like Greedy Coordinate Gradient (GCG) attacks.
The theoretical analysis provides a foundation for understanding why RMU works and how its parameters affect its performance. This understanding led to the development of the improved Adaptive RMU method proposed in the paper.′