A great resource for answering these questions is a set of model runs put out by the Stanford Center for Research into Foundation Models—they trained 5 runs of GPT-2 small and GPT-2 medium with 600 checkpoints and different random seeds, and released the weights. It seems like a good way to get some surface area on these questions with interesting real models. A few ideas that are somewhere on my maybe/someday research ideas list:
For each pair of models, feed in a bunch of text and look at the log prob for predicting each next token, and look at the scatter plot of these—does it look highly correlated? Poke at any outliers and see if there are any consistent patterns of things one model can do and the other cannot
Repeat this for a checkpoint halfway through training. If you find capabilities in one model and not in another, have they converged by the end of training?
Look at the PCA of these per-token losses across, say, 1M tokens of text, and see if you can find anything interesting about the components
Evaluate the models for a bunch of behaviours—ability to use punctuation correctly, to match open and close parentheses, patterns in the syntax and structure of the data (capital letters at the start of a sentence, email addresses having an @ and a .com in them, taking text in other languages and continuing it with text of that language, etc), specific behaviour like the ability to memorise specific phrases, complete acronyms, use induction-like behaviour, basic factual knowledge about the world, etc
The medium models will have more interesting + sophisticated behaviour, and are probably a better place to look for specific circuits
Look at the per-token losses for some text over training (esp for tokens with significant deviation between final models) and see whether it looks smooth or S-shaped—S-shaped would suggest higher path dependence to me
Look for induction head phase changes in each model during training, and compare when they happen.
I’m currently writing a library for mechanistic interpretability of LLMs, with support for loading these models + their checkpoints—if anyone might be interested on working on this, happy to share ideas. This is a small subset of OpenWebText that seems useful for testing.
Unrelatedly, a mark against path dependence is the induction head bump result, where we found that models have a phase change where they suddenly form induction heads, and that across a range of model sizes and architecture it forms consistently and around the same point (though not all architectures tested). Anecdotally, I’ve found that the time of formation is very sensitive to the exact positional embeddings used though.
Interesting post! I’m pretty curious about these.
A great resource for answering these questions is a set of model runs put out by the Stanford Center for Research into Foundation Models—they trained 5 runs of GPT-2 small and GPT-2 medium with 600 checkpoints and different random seeds, and released the weights. It seems like a good way to get some surface area on these questions with interesting real models. A few ideas that are somewhere on my maybe/someday research ideas list:
For each pair of models, feed in a bunch of text and look at the log prob for predicting each next token, and look at the scatter plot of these—does it look highly correlated? Poke at any outliers and see if there are any consistent patterns of things one model can do and the other cannot
Repeat this for a checkpoint halfway through training. If you find capabilities in one model and not in another, have they converged by the end of training?
Look at the PCA of these per-token losses across, say, 1M tokens of text, and see if you can find anything interesting about the components
Evaluate the models for a bunch of behaviours—ability to use punctuation correctly, to match open and close parentheses, patterns in the syntax and structure of the data (capital letters at the start of a sentence, email addresses having an @ and a .com in them, taking text in other languages and continuing it with text of that language, etc), specific behaviour like the ability to memorise specific phrases, complete acronyms, use induction-like behaviour, basic factual knowledge about the world, etc
The medium models will have more interesting + sophisticated behaviour, and are probably a better place to look for specific circuits
Look at the per-token losses for some text over training (esp for tokens with significant deviation between final models) and see whether it looks smooth or S-shaped—S-shaped would suggest higher path dependence to me
Look for induction head phase changes in each model during training, and compare when they happen.
I’m currently writing a library for mechanistic interpretability of LLMs, with support for loading these models + their checkpoints—if anyone might be interested on working on this, happy to share ideas. This is a small subset of OpenWebText that seems useful for testing.
Unrelatedly, a mark against path dependence is the induction head bump result, where we found that models have a phase change where they suddenly form induction heads, and that across a range of model sizes and architecture it forms consistently and around the same point (though not all architectures tested). Anecdotally, I’ve found that the time of formation is very sensitive to the exact positional embeddings used though.
This seems quite similar to the experiments done in this paper.