Side note: skills vs facts. I framed some of the approaches as teaching false facts / removing some facts—and concrete raw facts is often what is the easiest to evaluate. But I think that the same approach can be used to remove the information that makes LLMs able to use most skills. For example, writing Python code requires the knowledge of the Python syntax and writing recursive functions requires the knowledge of certain patterns (at least for current LLMs, who don’t have the ability to rederive recursive from scratch, especially without a scratchpad), both of which could be unlearned like things that are more clearly “facts”.
If you read e.g. “Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level”, current thinking in mechanistic interpretability is that simple memorization of otherwise random facts (i.e. ones where there are no helpful “rules of thumb” to get many cases right) uses different kinds of learned neural circuits that learning of skills does. If this were in fact the case, then unlearning of facts and skills might have different characteristics. In particular, for learnt skills, if mechanistic interpretability can locate and interpret the learned circuit implementing a skills, then we can edit them directly (as has been done successfully in a few cases). However, we’ve so far had no luck at interpreting neural circuitry for large collections of basically-random facts, and there are some theoretical arguments suggesting that such circuitry may be inherently hard to interpret, at least using current techniques.
If you read e.g. “Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level”, current thinking in mechanistic interpretability is that simple memorization of otherwise random facts (i.e. ones where there are no helpful “rules of thumb” to get many cases right) uses different kinds of learned neural circuits that learning of skills does. If this were in fact the case, then unlearning of facts and skills might have different characteristics. In particular, for learnt skills, if mechanistic interpretability can locate and interpret the learned circuit implementing a skills, then we can edit them directly (as has been done successfully in a few cases). However, we’ve so far had no luck at interpreting neural circuitry for large collections of basically-random facts, and there are some theoretical arguments suggesting that such circuitry may be inherently hard to interpret, at least using current techniques.