Perhaps a smaller model would be forced to learn features in superposition—or even nontrivial algorithms—in a way more representative of a typical LLM.
I’m unsure whether or not it’s possible at all to express a “more algorithmic” solution to the problem in a transformer of OthelloGPT’s approximate size/shape (much less a smaller transformer). I asked myself “what kinds of systematic/algorithmic calculations could I ‘program into’ OthelloGPT if I were writing the weights myself?”, and I wasn’t able to come up with much, though I didn’t think very hard about it.
Curious if you have thoughts on this?
Roughly, this is hard because, for any given internal representation that a transformer computes, it can’t access that representation at earlier positions when computing the same representation at later ones (cf). For example, if there’s some layer L at which the board state is basically “finalized,” OthelloGPT can’t use these finalized representations of past states to compute the current state. (By the time it is able to look back at past finalized states, it’s too late, the current state is already finalized.)
So, when we use info about the past to compute the current board state, this info can’t just be the board state, it has to be “only partway computed” in some sense. This rules out some things that feel intuitive and “algorithmic,” like having a modular component that is fully (and only) responsible for deciding which pieces get flipped by a given move. (In order to know which pieces get flipped now, you need to know which pieces are mine/yours, and that depends on which ones got flipped in the past. So a “flip calculator” would need to use its own past outputs as current inputs, and transformers can’t be recurrent like this.)
It looks like for OthelloGPT, the relevant sense of “only partway computed” is “probably true based on heuristics, but not necessarily true.” I guess one can imagine other ways of organizing this computation inside a transformer, like computing the state of the upper half of the board first, then the lower half (or upper left corner first, or whatever). That seems like a bad idea, since it will leave capacity unused when the game is confined only to a proper subset of these regions. Maybe there are other alternatives I haven’t thought of? In any case, it’s not obvious to me that there’s a feasible “more algorithmic” approach here.
I am also unsure how feasible it is to solve the problem using rules that generalize across positions. More precisely, I’m not sure if such an approach would really be simpler than what OthelloGPT does.
In either case, it has to (in some sense) iterate over all positions and check separately whether each one is legal. In practice it does this with distinct neurons/circuits that are mapped to distinct absolute positions—but if it instead used a computation like “check for a certain pattern in relative position space,” it would still need to repeat this same check many times over within the same forward pass, to handle all the copies of that pattern that might appear at various places on the board. And it’s not obvious that repeating (some number of distinct rules) x (some other number of potential applications per rule) is cheaper than just repeating the full conditions that license a move once for each absolute position on the board.
I think it’s pretty plausible that this is true, and that OthelloGPT is already doing something that’s somewhat close to optimal within the constraints of its architecture. I have also spent time thinking about the optimal algorithm for next move prediction within the constraints of the OthelloGPT architecture, and “a bag of heuristics that promote / suppress information with attention to aggregate information across moves” seems like a very reasonable approach.
Seems like the natural next step would be to try to investigate grokking, as this appears analogous: you have a model which has memorized or learned a grabbag of heuristics & regularities, but as far as you can tell, the algorithmic core is eluding the model despite what seems like ample parameterization & data, perhaps because it is a wide shallow model. So one could try to train a skinny net, and maybe aggressively subsample the training data down into a maximally diverse subset. If it groks, then one should be able to read off much more interpretable algorithmic sub-models.
Interesting stuff!
I’m unsure whether or not it’s possible at all to express a “more algorithmic” solution to the problem in a transformer of OthelloGPT’s approximate size/shape (much less a smaller transformer). I asked myself “what kinds of systematic/algorithmic calculations could I ‘program into’ OthelloGPT if I were writing the weights myself?”, and I wasn’t able to come up with much, though I didn’t think very hard about it.
Curious if you have thoughts on this?
Roughly, this is hard because, for any given internal representation that a transformer computes, it can’t access that representation at earlier positions when computing the same representation at later ones (cf). For example, if there’s some layer L at which the board state is basically “finalized,” OthelloGPT can’t use these finalized representations of past states to compute the current state. (By the time it is able to look back at past finalized states, it’s too late, the current state is already finalized.)
So, when we use info about the past to compute the current board state, this info can’t just be the board state, it has to be “only partway computed” in some sense. This rules out some things that feel intuitive and “algorithmic,” like having a modular component that is fully (and only) responsible for deciding which pieces get flipped by a given move. (In order to know which pieces get flipped now, you need to know which pieces are mine/yours, and that depends on which ones got flipped in the past. So a “flip calculator” would need to use its own past outputs as current inputs, and transformers can’t be recurrent like this.)
It looks like for OthelloGPT, the relevant sense of “only partway computed” is “probably true based on heuristics, but not necessarily true.” I guess one can imagine other ways of organizing this computation inside a transformer, like computing the state of the upper half of the board first, then the lower half (or upper left corner first, or whatever). That seems like a bad idea, since it will leave capacity unused when the game is confined only to a proper subset of these regions. Maybe there are other alternatives I haven’t thought of? In any case, it’s not obvious to me that there’s a feasible “more algorithmic” approach here.
I am also unsure how feasible it is to solve the problem using rules that generalize across positions. More precisely, I’m not sure if such an approach would really be simpler than what OthelloGPT does.
In either case, it has to (in some sense) iterate over all positions and check separately whether each one is legal. In practice it does this with distinct neurons/circuits that are mapped to distinct absolute positions—but if it instead used a computation like “check for a certain pattern in relative position space,” it would still need to repeat this same check many times over within the same forward pass, to handle all the copies of that pattern that might appear at various places on the board. And it’s not obvious that repeating (some number of distinct rules) x (some other number of potential applications per rule) is cheaper than just repeating the full conditions that license a move once for each absolute position on the board.
I think it’s pretty plausible that this is true, and that OthelloGPT is already doing something that’s somewhat close to optimal within the constraints of its architecture. I have also spent time thinking about the optimal algorithm for next move prediction within the constraints of the OthelloGPT architecture, and “a bag of heuristics that promote / suppress information with attention to aggregate information across moves” seems like a very reasonable approach.
Seems like the natural next step would be to try to investigate grokking, as this appears analogous: you have a model which has memorized or learned a grabbag of heuristics & regularities, but as far as you can tell, the algorithmic core is eluding the model despite what seems like ample parameterization & data, perhaps because it is a wide shallow model. So one could try to train a skinny net, and maybe aggressively subsample the training data down into a maximally diverse subset. If it groks, then one should be able to read off much more interpretable algorithmic sub-models.