Huh I am surprised models fail this badly. That said, I think
We argue that there are certain properties of language that our current large language models (LLMs) don’t learn.
is too strong a claim based on your experiments. For instance, these models definitely have representations for uppercase letters:
In my own experiments I have found it hard to get models to answer multiple choice questions. It seems like there may be a disconnect in prompting a model to elicit information which it has in fact learned.
Here is the code to reproduce the plot if you want to look at some of your other tasks:
import numpy as np
import pandas as pd
from transformer_lens import HookedTransformer
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
model = HookedTransformer.from_pretrained(
model_name,
device='cpu',
torch_dtype=torch.float32,
fold_ln=True,
center_writing_weights=False,
center_unembed=True,
)
def is_cap(s):
s = s.replace('_', '') # may need to change for other tokenizers
return len(s) > 0 and s[0].isupper()
decoded_vocab = pd.Series({v: k for k, v in model.tokenizer.vocab.items()}).sort_index()
uppercase_labels = np.array([is_cap(s) for s in decoded_vocab.values])
W_E = model.W_E.numpy()
uppercase_dir = W_E[uppercase_labels].mean(axis=0) - W_E[~uppercase_labels].mean(axis=0)
uppercase_proj = W_E @ uppercase_dir
uc_range = (uppercase_proj.min(), uppercase_proj.max())
plt.hist(uppercase_proj[uppercase_labels], bins=100, alpha=0.5, label='uppercase', range=uc_range);
plt.hist(uppercase_proj[~uppercase_labels], bins=100, alpha=0.5, label='lowercase', range=uc_range);
plt.legend(title='token')
plt.ylabel('vocab count')
plt.xlabel('projection onto uppercase direction of W_E Mistral-7b')
What’s the difference between “having a representation” for uppercase/lowercase and using the representation to solving MCQ or AB test? From your investigations, do you have intuitions as to what might be the mechanism of disconnect? I’m interested in seeing what might cause these models to perform poorly, despite having representations that seem to be relevant to solving the task, at least to us people.
Considering that the tokenizer architecture for Mistral-7B probably includes a case-sensitive dictionary (https://discuss.huggingface.co/t/case-sensitivity-in-mistralai-mistral-7b-v0-1/70031), the presence of distinct representations for uppercase and lowercase characters might not be as relevant to the task for the model as one would assume. It seems plausible that these representations may not substantially influence the model’s ability to perform H-Test, such as answering multiple-choice questions, with non-negligible probability. Perhaps one should probe for another representation, such as a circuit for “eliciting information”.
I also want to point you to this (https://arxiv.org/abs/2402.11349, Appendix I, Figure 7, Last Page, “Blueberry?: From Reddit u/AwkwardIllustrator47, r/mkbhd: Was listening to the podcast. Can anyone explain why Chat GPT doesn’t know if R is in the word Blueberry?”). Large model failures on these task types were rather a widely observed phenomenon but with no empirical investigation.
The really cool ones are where the BPE errors compound and lead to strange, unpredictable downstream bugs. I’ve talked a lot about how I think ChatGPT’s rhyming poetry is an example of this, but here’s a simpler one: a famous GPT problem last year was about African countries whose name starts with ‘K’: https://news.ycombinator.com/item?id=37145312 GPT-3 made a spelling error of the usual BPE sort, which then got quoted in a blog spam & HN comment, which then (perhaps due to lack of explicit statement that the quotes were false) got read by search engine bots* and provided as snippet-answers to search queries! One could imagine this going further: extracted snippets would be useful for automatically creating Q&A datasets...
And then somewhere downstream, a human is reading the false statement ‘While there are 54 recognized countries in Africa, none of them begin with the letter “K”. The closest is Kenya, which starts with a “K” sound, but is actually spelled with a “K” sound.’† being generated by an AI model—possibly one that is character-tokenized at this point & so the error would appear to provably have nothing to do with BPEs—and is baffled at where this delusion came from.
* Google has discussed the processing of web page text (most detailed recent writeup) and it’s usually a small BERT-like, so that means WordPiece, IIRC, and so it’s also unable to ‘see’ the error even if it was smart enough to.
† This still replicates if you ask ChatGPT for a list of African countries which start with ‘K’. If you reroll, it makes a remarkable number of different errors and confabulations. I particularly like the one where it just prints ‘1. Kenya’ repeatedly—when you go on vacation to Kenya, be sure you go to Kenya, and not one of the other Kenyas, god have mercy on you if you land in Kenya, or worse yet, Kenya!
Huh I am surprised models fail this badly. That said, I think
is too strong a claim based on your experiments. For instance, these models definitely have representations for uppercase letters:
In my own experiments I have found it hard to get models to answer multiple choice questions. It seems like there may be a disconnect in prompting a model to elicit information which it has in fact learned.
Here is the code to reproduce the plot if you want to look at some of your other tasks:
What’s the difference between “having a representation” for uppercase/lowercase and using the representation to solving MCQ or AB test? From your investigations, do you have intuitions as to what might be the mechanism of disconnect? I’m interested in seeing what might cause these models to perform poorly, despite having representations that seem to be relevant to solving the task, at least to us people.
Considering that the tokenizer architecture for Mistral-7B probably includes a case-sensitive dictionary (https://discuss.huggingface.co/t/case-sensitivity-in-mistralai-mistral-7b-v0-1/70031), the presence of distinct representations for uppercase and lowercase characters might not be as relevant to the task for the model as one would assume. It seems plausible that these representations may not substantially influence the model’s ability to perform H-Test, such as answering multiple-choice questions, with non-negligible probability. Perhaps one should probe for another representation, such as a circuit for “eliciting information”.
I also want to point you to this (https://arxiv.org/abs/2402.11349, Appendix I, Figure 7, Last Page, “Blueberry?: From Reddit u/AwkwardIllustrator47, r/mkbhd: Was listening to the podcast. Can anyone explain why Chat GPT doesn’t know if R is in the word Blueberry?”). Large model failures on these task types were rather a widely observed phenomenon but with no empirical investigation.
The really cool ones are where the BPE errors compound and lead to strange, unpredictable downstream bugs. I’ve talked a lot about how I think ChatGPT’s rhyming poetry is an example of this, but here’s a simpler one: a famous GPT problem last year was about African countries whose name starts with ‘K’: https://news.ycombinator.com/item?id=37145312 GPT-3 made a spelling error of the usual BPE sort, which then got quoted in a blog spam & HN comment, which then (perhaps due to lack of explicit statement that the quotes were false) got read by search engine bots* and provided as snippet-answers to search queries! One could imagine this going further: extracted snippets would be useful for automatically creating Q&A datasets...
And then somewhere downstream, a human is reading the false statement ‘While there are 54 recognized countries in Africa, none of them begin with the letter “K”. The closest is Kenya, which starts with a “K” sound, but is actually spelled with a “K” sound.’† being generated by an AI model—possibly one that is character-tokenized at this point & so the error would appear to provably have nothing to do with BPEs—and is baffled at where this delusion came from.
* Google has discussed the processing of web page text (most detailed recent writeup) and it’s usually a small BERT-like, so that means WordPiece, IIRC, and so it’s also unable to ‘see’ the error even if it was smart enough to.
† This still replicates if you ask ChatGPT for a list of African countries which start with ‘K’. If you reroll, it makes a remarkable number of different errors and confabulations. I particularly like the one where it just prints ‘1. Kenya’ repeatedly—when you go on vacation to Kenya, be sure you go to Kenya, and not one of the other Kenyas, god have mercy on you if you land in Kenya, or worse yet, Kenya!
I appreciate this analysis. I’ll take more time to look into this and then get back to write a better reply.