I don’t really think that this is super important for “fragility of value”-type concerns, but probably is important for people who think we will easily be able to understand the features/internals of LLMs
I’m not surprised if the features aren’t 100% clean, because this is after all a preliminary research prototype of a small approximation of a medium-sized version of a still sub-AGI LLM.
But I am a little more concerned that this is the first I’ve seen anyone notice that the cherrypicked, single, chosen example of what is apparently a straightforward, familiar, concrete (literally) concept, which people have been playing with interactively for days, is clearly dirty and not actually a ‘Golden Gate Bridge feature’. This suggests it is not hard to fool a lot of people with an ‘interpretable feature’ which is still quite far from the human concept. And if you believe that it’s not super important for fragility-of-value because it’d have feasible fixes if noticed, how do you know anyone will notice?
I’m not surprised if the features aren’t 100% clean, because this is after all a preliminary research prototype of a small approximation of a medium-sized version of a still sub-AGI LLM.
It’s more like a limitation of the paradigm, imo. If the “most golden gate” direction in activation-space and the “most SF fog” direction have high cosine similarity, there isn’t a way to increase activation of one of them but not the other. And this isn’t only a problem for outside interpreters—it’s expensive for the AI’s further layers to distinguish close-together vectors, so I’d expect the AI’s further layers to do it as cheaply and unreliably as works on the training distribution, and not in some extra-robust way that generalizes to clamping features at 5x their observed maximum.
While I think you’re right it’s not cleanly “a Golden Bridge feature,” I strongly suspect it may be activating a more specific feature vector and not a less specific feature.
It looks like this is somewhat of a measurement problem with SAE. We are measuring SAE activations via text or image inputs, but what’s activated in generations seems to be “sensations associated with the Golden gate bridge.”
While googling “Golden Gate Bridge” might return the Wikipedia page, whats the relative volume in a very broad training set between encyclopedic writing about the Golden Gate Bridge and experiential writing on social media or in books and poems about the bridge?
The model was trained to complete those too, and in theory should have developed successful features for doing so.
In the research examples one of the matched images is a perspective shot from physically being on the bridge, a text example is talking about the color of it, another is seeing it in the sunset.
But these are all the feature activations when acting in a classifier role. That’s what SAE is exploring—give it a set of inputs and see what lights it up.
Yet in the generative role this vector maximized keeps coming up over and over in the model with content from a sensory standpoint.
Maybe generation based on functional vector manipulations will prove to be a more powerful interpretability technique than SAE probing passive activations alone?
In the above chat when that “golden gate vector” is magnified, it keeps talking about either the sensations of being the bridge as if its physical body with wind and waves hitting it or the sensations of being on the bridge. It even generates towards the end in reflecting on the knowledge of the activation about how the sensations are overwhelming. Not reflecting on the Platonic form of an abstract concept of the bridge, but about overwhelming physical sensations of the bridge’s materialism.
I’ll be curious to see more generative data and samples from this variation, but it looks like generative exploration of features may offer considerably more fidelity to their underlying impact on the network than just SAE. Very exciting!!
I’m not surprised if the features aren’t 100% clean, because this is after all a preliminary research prototype of a small approximation of a medium-sized version of a still sub-AGI LLM.
But I am a little more concerned that this is the first I’ve seen anyone notice that the cherrypicked, single, chosen example of what is apparently a straightforward, familiar, concrete (literally) concept, which people have been playing with interactively for days, is clearly dirty and not actually a ‘Golden Gate Bridge feature’. This suggests it is not hard to fool a lot of people with an ‘interpretable feature’ which is still quite far from the human concept. And if you believe that it’s not super important for fragility-of-value because it’d have feasible fixes if noticed, how do you know anyone will notice?
The Anthropic post itself said more or less the same:
It’s more like a limitation of the paradigm, imo. If the “most golden gate” direction in activation-space and the “most SF fog” direction have high cosine similarity, there isn’t a way to increase activation of one of them but not the other. And this isn’t only a problem for outside interpreters—it’s expensive for the AI’s further layers to distinguish close-together vectors, so I’d expect the AI’s further layers to do it as cheaply and unreliably as works on the training distribution, and not in some extra-robust way that generalizes to clamping features at 5x their observed maximum.
FWIW, I had noticed the same but had thought it was overly split (“Golden Gate Bridge, particularly its fog, colour and endpoints”) rather than dirty.
While I think you’re right it’s not cleanly “a Golden Bridge feature,” I strongly suspect it may be activating a more specific feature vector and not a less specific feature.
It looks like this is somewhat of a measurement problem with SAE. We are measuring SAE activations via text or image inputs, but what’s activated in generations seems to be “sensations associated with the Golden gate bridge.”
While googling “Golden Gate Bridge” might return the Wikipedia page, whats the relative volume in a very broad training set between encyclopedic writing about the Golden Gate Bridge and experiential writing on social media or in books and poems about the bridge?
The model was trained to complete those too, and in theory should have developed successful features for doing so.
In the research examples one of the matched images is a perspective shot from physically being on the bridge, a text example is talking about the color of it, another is seeing it in the sunset.
But these are all the feature activations when acting in a classifier role. That’s what SAE is exploring—give it a set of inputs and see what lights it up.
Yet in the generative role this vector maximized keeps coming up over and over in the model with content from a sensory standpoint.
Maybe generation based on functional vector manipulations will prove to be a more powerful interpretability technique than SAE probing passive activations alone?
In the above chat when that “golden gate vector” is magnified, it keeps talking about either the sensations of being the bridge as if its physical body with wind and waves hitting it or the sensations of being on the bridge. It even generates towards the end in reflecting on the knowledge of the activation about how the sensations are overwhelming. Not reflecting on the Platonic form of an abstract concept of the bridge, but about overwhelming physical sensations of the bridge’s materialism.
I’ll be curious to see more generative data and samples from this variation, but it looks like generative exploration of features may offer considerably more fidelity to their underlying impact on the network than just SAE. Very exciting!!