I’m not surprised if the features aren’t 100% clean, because this is after all a preliminary research prototype of a small approximation of a medium-sized version of a still sub-AGI LLM.
It’s more like a limitation of the paradigm, imo. If the “most golden gate” direction in activation-space and the “most SF fog” direction have high cosine similarity, there isn’t a way to increase activation of one of them but not the other. And this isn’t only a problem for outside interpreters—it’s expensive for the AI’s further layers to distinguish close-together vectors, so I’d expect the AI’s further layers to do it as cheaply and unreliably as works on the training distribution, and not in some extra-robust way that generalizes to clamping features at 5x their observed maximum.
It’s more like a limitation of the paradigm, imo. If the “most golden gate” direction in activation-space and the “most SF fog” direction have high cosine similarity, there isn’t a way to increase activation of one of them but not the other. And this isn’t only a problem for outside interpreters—it’s expensive for the AI’s further layers to distinguish close-together vectors, so I’d expect the AI’s further layers to do it as cheaply and unreliably as works on the training distribution, and not in some extra-robust way that generalizes to clamping features at 5x their observed maximum.