Any thoughts on how helpful / how much more robust unlearning approaches like RMU or even Tamper-Resistant Safeguards for Open-Weight LLMs might be against this threat model, especially in the context of WMDP-related misuse?
Any thoughts on how helpful / how much more robust unlearning approaches like RMU or even Tamper-Resistant Safeguards for Open-Weight LLMs might be against this threat model, especially in the context of WMDP-related misuse?