The training path is always continuous, thus it necessarily interpolates smoothly between some overfit memorization and the generalizing (nonmodular) circuit solution. But that shouldn’t be too surprising—a big circuit can always be recursively decomposed down to smaller elementary pieces, and each elementary circuit is always logically equivalent to not a single unique lookup table, but an infinite set of overparameterized equivalent redundant lookup tables.
So it just has to find one of the many redundant lookuptable (memorization) solutions first, then smoothly remove redundancy of the lookup tables. The phase transitions likely arise due to semi-combinatoric dependencies between layers (and those probably become more pronounced with increasing depth complexity).
The training path is always continuous, thus it necessarily interpolates smoothly between some overfit memorization and the generalizing (nonmodular) circuit solution. But that shouldn’t be too surprising—a big circuit can always be recursively decomposed down to smaller elementary pieces, and each elementary circuit is always logically equivalent to not a single unique lookup table, but an infinite set of overparameterized equivalent redundant lookup tables.
So it just has to find one of the many redundant lookuptable (memorization) solutions first, then smoothly remove redundancy of the lookup tables. The phase transitions likely arise due to semi-combinatoric dependencies between layers (and those probably become more pronounced with increasing depth complexity).