Detail about the ROME paper I’ve been thinking about
In the ROME paper, when you prompt the language model with “The Eiffel Tower is located in Paris”, you have the following:
Subject token(s): The Eiffel Tower
Relationship: is located in
Object: Paris
Once a model has seen a subject token(s) (e.g. Eiffel Tower), it will retrieve a whole bunch of factual knowledge (not just one thing since it doesn’t know you will ask for something like location after the subject token) from the MLPs and ‘write’ into to the residual stream for the attention modules at the final token to look at the context, aggregate and retrieve the correct information.
In other words, if we take the “The Eiffel Tower is located in”, the model will write different information about the Eiffel Tower into the residual stream once it gets to the layers with “factual” information (early-middle layers). At this point, the model hasn’t seen “is located in” so it doesn’t actually know that you are going to ask for the location. For this reason, it will write more than just the location of the Eiffel Tower into the residual stream. Once you are at the point of predicting the location (at the final token, “in”), the model will aggregate the surrounding context and pull the location information that was ‘written’ into the residual stream via the MLPs with the most causal effect.
What is stored in the MLP is not the relationship between the facts. This is obvious because the relationship is coming after the subject tokens. In other words, as we said before, the MLPs are retrieving a bunch of factual knowledge, and then the attention modules are picking the correct (forgive the handwavy description) fact given what was retrieved and the relationship that is being asked of it.
My guess is that you could probably take what is being ‘written’ into the residual stream and directly predict properties of the subject token from the output of the layers with the most causal effect to predict a fact.
I’m unsure if I didn’t emphasize it in the post enough, but part of the point of my post on ROME was that many AI researchers seemed to assume that transformers are not trained in a way that prevents them from understanding that A is B = B is A.
As I discussed in the comment above,
What is stored in the MLP is not the relationship between the facts. This is obvious because the relationship is coming after the subject tokens. In other words, as we said before, the MLPs are retrieving a bunch of factual knowledge, and then the attention modules are picking the correct (forgive the handwavy description) fact given what was retrieved and the relationship that is being asked of it.
This means that the A token will ‘write’ some information into the residual stream, while the B token will ‘write’ other information into the residual. Some of that information may be the same, but not all. And so, if it’s different enough, the attention heads just won’t be able to pick up on the relevant information to know that B is A. However, if you include the A token, the necessary information will be added to the residual stream, and it will be much more likely for the model to predict that B is A (as well as A is B).
From what I remember in the case of ROME, as soon as I added the edited token A to the prompt (or make the next predicted token be A), then the model could essentially predict B is A.
I write what it means in the context of ROME, below (found here in the post):
So, part of the story here is that the transformer stores the key for one entity (Eiffel Tower) separately from another (Rome). And so you’d need a second edit to say, “the tower in Rome is called the Eiffel Tower.”
Intuitively, as a human, if I told you that the Eiffel Tower is in Rome, you’d immediately be able to understand both of these things at once. While for the ROME method, it’s as if it’s two separate facts. For this reason, you can’t really equate ROME with how a human would naturally update on a fact. You could maybe imagine ROME more like doing some brain surgery on someone to change a fact.
The directional nature of transformers could make it so that facts are stored somewhat differently than what we’d infer from our experience with humans. What we see as one fact may be multiplefacts for a transformer. Maybe bidirectional models are different. That said, ROME could be seen as brain surgery which might mess up things internally and cause inconsistencies.
It looks like the model is representing its factual knowledge in a complex/distributed way, and that intervening on just one node does not propagate the change to the rest of the knowledge graph.
Why is this surprising at all then? My guess is that symmetry is intuitive to us, and we’re used to LLMs being capable of surprising and impressive things, so it’s weird to see something seemingly basic missing.
I actually have a bit of an updated (evolving) opinion on this:
Upon further reflection, it’s not obvious to me that humans and decoder-only transformers are that different. Could be that we both store info unidirectionally, but humans only see B->A as obvious because our internal loop is so optimized that we don’t notice the effort it takes.
Like, we just have a better system message than LLMs and that system message makes it super quick to identify relationships. LLMs would probably be fine doing the examples in the paper if you just adjusted their system message a little instead of leaving it essentially blank.
How do you imagine the system message helping? If the information is stored hetero-associatively (K → V) like how it is in a hash map, is there a way to recall in the reverse direction (V → K) other than with a giant scan?
My response:
Yeah, I’d have to think about it, but I imagined something like, “Given the prompt, quickly outline related info to help yourself get the correct answer.” You can probably output tokens that quickly help you get the useful facts as it is doing the forward pass.
In the context of the paper, now that I think about it, I think it becomes nearly impossible unless you can somehow retrieve the specific relevant tokens used for the training set. Not sure how to prompt those out.
When I updated the models to new facts using ROME, it wasn’t possible to get the updated fact unless the updated token was in the prompt somewhere. As soon as it is found in the prompt, it retrieves the new info where the model was edited.
Diversifying your dataset with the reverse prompt to make it so it has the correct information in whichever way possible feels so unsatisfying to me...feels like there’s something missing.
As I said, this is a bit of an evolving opinion. Still need time to think about this, especially regarding the differences between decoder-only transformers and humans.
Detail about the ROME paper I’ve been thinking about
In the ROME paper, when you prompt the language model with “The Eiffel Tower is located in Paris”, you have the following:
Subject token(s): The Eiffel Tower
Relationship: is located in
Object: Paris
Once a model has seen a subject token(s) (e.g. Eiffel Tower), it will retrieve a whole bunch of factual knowledge (not just one thing since it doesn’t know you will ask for something like location after the subject token) from the MLPs and ‘write’ into to the residual stream for the attention modules at the final token to look at the context, aggregate and retrieve the correct information.
In other words, if we take the “The Eiffel Tower is located in”, the model will write different information about the Eiffel Tower into the residual stream once it gets to the layers with “factual” information (early-middle layers). At this point, the model hasn’t seen “is located in” so it doesn’t actually know that you are going to ask for the location. For this reason, it will write more than just the location of the Eiffel Tower into the residual stream. Once you are at the point of predicting the location (at the final token, “in”), the model will aggregate the surrounding context and pull the location information that was ‘written’ into the residual stream via the MLPs with the most causal effect.
What is stored in the MLP is not the relationship between the facts. This is obvious because the relationship is coming after the subject tokens. In other words, as we said before, the MLPs are retrieving a bunch of factual knowledge, and then the attention modules are picking the correct (forgive the handwavy description) fact given what was retrieved and the relationship that is being asked of it.
My guess is that you could probably take what is being ‘written’ into the residual stream and directly predict properties of the subject token from the output of the layers with the most causal effect to predict a fact.
Thoughts and corrections are welcome.
A couple of notes regarding the Reversal Curse paper.
I’m unsure if I didn’t emphasize it in the post enough, but part of the point of my post on ROME was that many AI researchers seemed to assume that transformers are not trained in a way that prevents them from understanding that
A is B
=B is A
.As I discussed in the comment above,
This means that the A token will ‘write’ some information into the residual stream, while the B token will ‘write’ other information into the residual. Some of that information may be the same, but not all. And so, if it’s different enough, the attention heads just won’t be able to pick up on the relevant information to know that B is A. However, if you include the A token, the necessary information will be added to the residual stream, and it will be much more likely for the model to predict that B is A (as well as A is B).
From what I remember in the case of ROME, as soon as I added the edited token A to the prompt (or make the next predicted token be A), then the model could essentially predict B is A.
I write what it means in the context of ROME, below (found here in the post):
Regarding human intuition, @Neel Nanda says (link):
I actually have a bit of an updated (evolving) opinion on this:
@cfoster0 asks:
My response:
As I said, this is a bit of an evolving opinion. Still need time to think about this, especially regarding the differences between decoder-only transformers and humans.
Finally, from @Nora Belrose, this is worth pondering: