This would seem like a great benchmark/dataset/eval to apply automated research to. Would you have thoughts/recommendations on that? E.g. how worried might/should one be about risks of/from Goodharting?
Later edit: I guess it’s kind of already been tried by e.g. Tamper-Resistant Safeguards for Open-Weight LLMs and other approaches combining unlearning with meta-learning, though not necessarily with exactly the same motivation.
Later later edit: Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization seems to show robustness against relearning using a methodology similar to the one in this post.
This would seem like a great benchmark/dataset/eval to apply automated research to. Would you have thoughts/recommendations on that? E.g. how worried might/should one be about risks of/from Goodharting?
Later edit: I guess it’s kind of already been tried by e.g. Tamper-Resistant Safeguards for Open-Weight LLMs and other approaches combining unlearning with meta-learning, though not necessarily with exactly the same motivation.
Later later edit: Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization seems to show robustness against relearning using a methodology similar to the one in this post.