Great question! My concrete suggestion is to look for interesting neurons in Neuroscope, as I discuss more in the final post. This is a website I made that shows the text that most activates each neuron in the model (for a ton of open source models), and by looking for interesting neurons, you can hopefully find some hook—find a specific task the model can consistently-ish do, analogous to IOI (with a predictable structure you can generate prompts for, ideally with a somewhat algorithmic flavour—something you could write code to solve). And then do the kinds of things in the IOI notebook. Though note that for a 1L model, you can actually mechanistically look at the weights and break down what the model is doing!
On a meta level, the strategy you want to follow in a situation like this is what I call maximising surface area. You want to explore things and try to get exposed to as many random details about the model behaviour as you can. So that you can then serendipitiously notice something interesting and dig into it. The meta-lesson is that when you feel stuck and meandering, you want to pick some purpose to strive for, but that purpose can just be “put yourself in a situation where you have so much data and context that you can spontaneously stumble across something interesting, and cast a really wide net”. Concretely, you want to look for some kind of task/capability that the model is capable of, so you can then try to reverse-engineer it. And a good way to do this is just to run the model on a bunch of dataset examples and look at what it’s good at, and see if you can find any consistent patterns to dig into. To better explore this, I made a tool to visualise the top 10 tokens predicted for each token in the text in Alan Cooney’s CircuitsVis library. You can filter for interesting text by eg looking for tokens where the model’s log prob for the correct next token is significantly higher than attn-only-1l, to cut things down to where the MLPs matter (I’d cut off the log prob at −6 though, so you don’t just notice when attn-only-1l is really incorrect lol).
Great question! My concrete suggestion is to look for interesting neurons in Neuroscope, as I discuss more in the final post. This is a website I made that shows the text that most activates each neuron in the model (for a ton of open source models), and by looking for interesting neurons, you can hopefully find some hook—find a specific task the model can consistently-ish do, analogous to IOI (with a predictable structure you can generate prompts for, ideally with a somewhat algorithmic flavour—something you could write code to solve). And then do the kinds of things in the IOI notebook. Though note that for a 1L model, you can actually mechanistically look at the weights and break down what the model is doing!
On a meta level, the strategy you want to follow in a situation like this is what I call maximising surface area. You want to explore things and try to get exposed to as many random details about the model behaviour as you can. So that you can then serendipitiously notice something interesting and dig into it. The meta-lesson is that when you feel stuck and meandering, you want to pick some purpose to strive for, but that purpose can just be “put yourself in a situation where you have so much data and context that you can spontaneously stumble across something interesting, and cast a really wide net”. Concretely, you want to look for some kind of task/capability that the model is capable of, so you can then try to reverse-engineer it. And a good way to do this is just to run the model on a bunch of dataset examples and look at what it’s good at, and see if you can find any consistent patterns to dig into. To better explore this, I made a tool to visualise the top 10 tokens predicted for each token in the text in Alan Cooney’s CircuitsVis library. You can filter for interesting text by eg looking for tokens where the model’s log prob for the correct next token is significantly higher than attn-only-1l, to cut things down to where the MLPs matter (I’d cut off the log prob at −6 though, so you don’t just notice when attn-only-1l is really incorrect lol).
Thank you for this response!