Cool work!
There are 2 findings about that I found surprising and that I’d be interested in seeing explored through other methods:
Some LLMs are computing the last digit in addition mostly in base 10 (even when all numbers are a single token?)
LLMs trained to not hallucinate sometimes decide whether to give an answer or not based on the familiarity of the entity, rather than based on the familiarity of the actual answer (they don’t try to first recover the answer and then refuse to answer if they only recall a low-confidence guess?)
The second one may imply that LLMs are less able to reason about what they are about to say than I thought.
I also find it cool that you measured how good the explanations for your new features are. I find it slightly concerning how bad the numbers are. In particular, I would have expected a sort eval error much below 2% (which is the sort eval error you would get if you perfectly assigned each dataset example to one of 5 balanced categories of features [Edit: My math was wrong. 2% is what you get with 25 categories]), but you find a sort eval error around 10%. Some of that is probably Claude being dumb, but I guess you would also struggle to get below 2% with human labels?
But I also see how very predictive feature explanations might not be necessary. I am looking forward to seeing how Circuit Tracing performs in cases where there are more sources of external validations (e.g. hard auditing games)!
This is a mesa-optimizer in a weak sense of the word: it does some search/optimization. I think the model in the paper here is weakly mesa-optimizing, maybe more than base models generating random pieces of sports news, and maybe roughly as much as a model trying to follow weird and detailed instructions—except that here it follows memorized “instructions” as opposed to in-context ones.