Bogdan Ionut Cirstea comments on Covert Malicious Finetuning

Bogdan Ionut Cirstea 22 Aug 2024 14:17 UTC
1 point
0
Any thoughts on how helpful / how much more robust unlearning approaches like RMU or even Tamper-Resistant Safeguards for Open-Weight LLMs might be against this threat model, especially in the context of WMDP-related misuse?