Re: the last point above, this points to singular learning theory being an effective tool for analysis.
Reminder: The LLC measures “local flatness” of the loss basin. A higher LLC = flatter loss, i.e. changing the model’s parameters by a small amount does not increase the loss by much.
In preliminary work on LLC analysis of SAE features, the “feature-targeted LLC” turns out to be something which can be measured empirically and distinguishes SAE features from random directions
[Note] On SAE Feature Geometry
SAE feature directions are likely “special” rather than “random”.
Different SAEs seem to converge to learning the same features
SAE error directions increase model loss by a lot compared to random directions, indicating that the error directions are “special”, which points to the feature directions also being “special”
Conversely, SAE feature directions increase model loss by much less than random directions
Re: the last point above, this points to singular learning theory being an effective tool for analysis.
Reminder: The LLC measures “local flatness” of the loss basin. A higher LLC = flatter loss, i.e. changing the model’s parameters by a small amount does not increase the loss by much.
In preliminary work on LLC analysis of SAE features, the “feature-targeted LLC” turns out to be something which can be measured empirically and distinguishes SAE features from random directions