Fabien Roger comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Fabien Roger 21 Jan 2023 18:19 UTC
2 points
0
Here I’m using “feature” only with its simplest meaning: a direction in activation space. A truth-like feature only means “a direction in activation space with low CCS loss”, which is exactly what CCS enables you to find. By the example above, I show that there can be exponentially many of them. Therefore, the theorems above do not apply.
Maybe requiring directions found by CCS to be “actual features” (satisfying the conditions of those theorems) might enable you to improve CCS. But I don’t know what those conditions are.