Roger Dearnaley comments on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Roger Dearnaley 20 Oct 2024 2:20 UTC
1 point
0
AF
I think something that might be quite illuminating for this factual recall question (as well has having potential safety uses) is editing the facts. Suppose, for this case, you take just the layer 2-6 MLPs with frozen attention, you warm-start from the existing model, add a loss function term that penalizes changes to the model away from that initialization (perhaps using a combination of L1 and L2 norms), and train that model on a dataset that consists of your full set of athlete names embeddings and their sports (in a frequency ratio matching the original training data, so with more copies of data about more famous athletes), but with one factual edit to it to change the sport of one athlete. Train until it has memorized this slightly edited set of facts reasonably well, and then look at what the pattern of weights and neurons that changed significantly is. Repeat several times for the same athlete, and see how consistent/stochastic the results are. Repeat for many athlete/sport edit combinations. This should give you a lot of information on how widely distributed the representation of the data on a specific athlete is, and ought to give you enough information to distinguish between not only your single step vs hashing hypotheses, but also things like bagging and combinations of different algorithms. Even more interesting would be to do this not just for an athlete’s sport, but also some independent fact about them like their number.

Of course, this is computationally somewhat expensive, and you do need to find metaparameters that don’t make this too expensive, while encouraging the resulting change to be as localized as possible consistent with getting a good score on the altered fact.
- Neel Nanda 21 Oct 2024 14:40 UTC
  LW: 3 AF: 2
  0
  AF Parent
  This is somewhat similar to the approach of the ROME paper, which has been shown to not actually do fact editing, just inserting louder facts that drown out the old ones and maybe suppressing the old ones.
  
  In general, the problem with optimising model behavior as a localisation technique is that you can’t distinguish between something that truly edits the fact, and something which adds a new fact in another layer that cancels out the first fact and adds something new.