And the top-right vector also transfers across mazes? Why isn’t it maze-specific?
This makes a lot of sense if the top-right vector is being used to do something like “choose between circuits” or “decide how to weight various heuristics” instead of (or in addition to) actually computing any heuristic itself. There is an interesting question of how capable the model architecture is of doing things like that, which maybe warrants thinking about.[1]
This could be either the type of thinking that looks like “try to find examples of this in the model by intelligently throwing in illuminating inputs” or the type that looks like “try to hand-write some parameters that implement ‘two subcircuits with a third circuit assigning the relative weighting between the two’, starting with smaller (but architecturally representative) toy models.”
I’m concerned that this type of thinking would be overly specific to the model architecture that you happen to be using, which might not help learn about the more general phenomena of shards/values/etc, but it’s possible that it might be useful nonetheless if you’re planning on studying these models at length.
This makes a lot of sense if the top-right vector is being used to do something like “choose between circuits” or “decide how to weight various heuristics” instead of (or in addition to) actually computing any heuristic itself. There is an interesting question of how capable the model architecture is of doing things like that, which maybe warrants thinking about.[1]
This could be either the type of thinking that looks like “try to find examples of this in the model by intelligently throwing in illuminating inputs” or the type that looks like “try to hand-write some parameters that implement ‘two subcircuits with a third circuit assigning the relative weighting between the two’, starting with smaller (but architecturally representative) toy models.”
I’m concerned that this type of thinking would be overly specific to the model architecture that you happen to be using, which might not help learn about the more general phenomena of shards/values/etc, but it’s possible that it might be useful nonetheless if you’re planning on studying these models at length.