I’m unsure if I didn’t emphasize it in the post enough, but part of the point of my post on ROME was that many AI researchers seemed to assume that transformers are not trained in a way that prevents them from understanding that A is B = B is A.
As I discussed in the comment above,
What is stored in the MLP is not the relationship between the facts. This is obvious because the relationship is coming after the subject tokens. In other words, as we said before, the MLPs are retrieving a bunch of factual knowledge, and then the attention modules are picking the correct (forgive the handwavy description) fact given what was retrieved and the relationship that is being asked of it.
This means that the A token will ‘write’ some information into the residual stream, while the B token will ‘write’ other information into the residual. Some of that information may be the same, but not all. And so, if it’s different enough, the attention heads just won’t be able to pick up on the relevant information to know that B is A. However, if you include the A token, the necessary information will be added to the residual stream, and it will be much more likely for the model to predict that B is A (as well as A is B).
From what I remember in the case of ROME, as soon as I added the edited token A to the prompt (or make the next predicted token be A), then the model could essentially predict B is A.
I write what it means in the context of ROME, below (found here in the post):
So, part of the story here is that the transformer stores the key for one entity (Eiffel Tower) separately from another (Rome). And so you’d need a second edit to say, “the tower in Rome is called the Eiffel Tower.”
Intuitively, as a human, if I told you that the Eiffel Tower is in Rome, you’d immediately be able to understand both of these things at once. While for the ROME method, it’s as if it’s two separate facts. For this reason, you can’t really equate ROME with how a human would naturally update on a fact. You could maybe imagine ROME more like doing some brain surgery on someone to change a fact.
The directional nature of transformers could make it so that facts are stored somewhat differently than what we’d infer from our experience with humans. What we see as one fact may be multiplefacts for a transformer. Maybe bidirectional models are different. That said, ROME could be seen as brain surgery which might mess up things internally and cause inconsistencies.
It looks like the model is representing its factual knowledge in a complex/distributed way, and that intervening on just one node does not propagate the change to the rest of the knowledge graph.
Why is this surprising at all then? My guess is that symmetry is intuitive to us, and we’re used to LLMs being capable of surprising and impressive things, so it’s weird to see something seemingly basic missing.
I actually have a bit of an updated (evolving) opinion on this:
Upon further reflection, it’s not obvious to me that humans and decoder-only transformers are that different. Could be that we both store info unidirectionally, but humans only see B->A as obvious because our internal loop is so optimized that we don’t notice the effort it takes.
Like, we just have a better system message than LLMs and that system message makes it super quick to identify relationships. LLMs would probably be fine doing the examples in the paper if you just adjusted their system message a little instead of leaving it essentially blank.
How do you imagine the system message helping? If the information is stored hetero-associatively (K → V) like how it is in a hash map, is there a way to recall in the reverse direction (V → K) other than with a giant scan?
My response:
Yeah, I’d have to think about it, but I imagined something like, “Given the prompt, quickly outline related info to help yourself get the correct answer.” You can probably output tokens that quickly help you get the useful facts as it is doing the forward pass.
In the context of the paper, now that I think about it, I think it becomes nearly impossible unless you can somehow retrieve the specific relevant tokens used for the training set. Not sure how to prompt those out.
When I updated the models to new facts using ROME, it wasn’t possible to get the updated fact unless the updated token was in the prompt somewhere. As soon as it is found in the prompt, it retrieves the new info where the model was edited.
Diversifying your dataset with the reverse prompt to make it so it has the correct information in whichever way possible feels so unsatisfying to me...feels like there’s something missing.
As I said, this is a bit of an evolving opinion. Still need time to think about this, especially regarding the differences between decoder-only transformers and humans.
A couple of notes regarding the Reversal Curse paper.
I’m unsure if I didn’t emphasize it in the post enough, but part of the point of my post on ROME was that many AI researchers seemed to assume that transformers are not trained in a way that prevents them from understanding that
A is B
=B is A
.As I discussed in the comment above,
This means that the A token will ‘write’ some information into the residual stream, while the B token will ‘write’ other information into the residual. Some of that information may be the same, but not all. And so, if it’s different enough, the attention heads just won’t be able to pick up on the relevant information to know that B is A. However, if you include the A token, the necessary information will be added to the residual stream, and it will be much more likely for the model to predict that B is A (as well as A is B).
From what I remember in the case of ROME, as soon as I added the edited token A to the prompt (or make the next predicted token be A), then the model could essentially predict B is A.
I write what it means in the context of ROME, below (found here in the post):
Regarding human intuition, @Neel Nanda says (link):
I actually have a bit of an updated (evolving) opinion on this:
@cfoster0 asks:
My response:
As I said, this is a bit of an evolving opinion. Still need time to think about this, especially regarding the differences between decoder-only transformers and humans.
Finally, from @Nora Belrose, this is worth pondering: