Neel Nanda comments on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda 21 Oct 2024 14:40 UTC
LW: 2 AF: 2
0
AF
This is somewhat similar to the approach of the ROME paper, which has been shown to not actually do fact editing, just inserting louder facts that drown out the old ones and maybe suppressing the old ones.

In general, the problem with optimising model behavior as a localisation technique is that you can’t distinguish between something that truly edits the fact, and something which adds a new fact in another layer that cancels out the first fact and adds something new.