We don’t believe that all knowledge and computation in a trained neural network emerges in phase transitions, but our working hypothesis is that enough emerges this way to make phase transitions a valid organizing principle for interpretability.
I think this undersells the case for focusing on phase transitions.
Hand-wavy version of a stronger case: within a phase (i.e. when there’s not a phase change), things change continuously/slowly. Anyone watching from outside can see what’s going on, and have plenty of heads-up, plenty of opportunity to extrapolate where behavior is headed. That makes safety problems a lot easier. Phase transitions are exactly the points where that breaks down—changes are sudden, extrapolation fails rapidly. So, phase transitions are exactly the points which are strategically crucial to detect, for safety purposes.
I think this undersells the case for focusing on phase transitions.
Hand-wavy version of a stronger case: within a phase (i.e. when there’s not a phase change), things change continuously/slowly. Anyone watching from outside can see what’s going on, and have plenty of heads-up, plenty of opportunity to extrapolate where behavior is headed. That makes safety problems a lot easier. Phase transitions are exactly the points where that breaks down—changes are sudden, extrapolation fails rapidly. So, phase transitions are exactly the points which are strategically crucial to detect, for safety purposes.