Bogdan Ionut Cirstea comments on TurnTrout’s shortform feed

Bogdan Ionut Cirstea 15 Aug 2024 15:47 UTC
1 point
0
Potentially relevant: the theoretical results from On Effects of Steering Latent Representation for Large Language Model Unlearning. From Claude chat: ’Point 2 refers to the theoretical analysis provided in the paper on three key aspects of the Representation Misdirection for Unlearning (RMU) method:
1. Effect on token confidence:
  - The authors hypothesized that RMU causes randomness in the logits of generated tokens.
  - They proved that the logit of a token generated by an RMU model follows a Normal distribution.
  - This randomness in logits is interpreted as low confidence, leading to nonsensical or incorrect responses.
  - The analysis suggests that the variance of the logit distribution is influenced by the coefficient value and properties of the model layers.
2. Impact of the coefficient value:
  - The coefficient ‘c’ in RMU affects how well the forget sample representations align with the random vector.
  - Theoretical analysis showed a positive correlation between the coefficient and the alignment.
  - Larger coefficient values lead to more alignment between forget sample representations and the random vector.
  - However, the optimal coefficient value varies depending on the layer of the model being unlearned.
3. Robustness against adversarial jailbreak attacks:
  - The authors analyzed RMU’s effectiveness against white-box jailbreak attacks from an attack-defense game perspective.
  - They showed that RMU causes the attacker to receive unreliable and uninformative gradient signals.
  - This makes it difficult for the attacker to find optimal adversarial tokens for replacement.
  - The analysis explains why RMU models demonstrate strong robustness against methods like Greedy Coordinate Gradient (GCG) attacks.
The theoretical analysis provides a foundation for understanding why RMU works and how its parameters affect its performance. This understanding led to the development of the improved Adaptive RMU method proposed in the paper.′