I think the Claude Sonnet Golden Gate Bridge feature is not crispy aligned with the human concept of “Golden Gate Bridge”. It brings up the San Fransisco fog far more than it would if it was just the bridge itself. I think it’s probably more like Golden Gate Bridge + SF fog + a bunch of other things (some SF related, some not). This isn’t particularly surprising, given these are related ideas (both SF things), and the features were trained in an unsupervised way. But still seems kinda important that the “natural” features that SAEs find are not like exactly intuitively natural human concepts.
It might be interesting to look at how much the SAE training data actually mentions the fog and the Golden Gate Bridge together
I don’t really think that this is super important for “fragility of value”-type concerns, but probably is important for people who think we will easily be able to understand the features/internals of LLMs
Almost all of my Golden Gate Claude chats mention the fog. Here is a not particularly cherrypicked example:
I don’t really think that this is super important for “fragility of value”-type concerns, but probably is important for people who think we will easily be able to understand the features/internals of LLMs
I’m not surprised if the features aren’t 100% clean, because this is after all a preliminary research prototype of a small approximation of a medium-sized version of a still sub-AGI LLM.
But I am a little more concerned that this is the first I’ve seen anyone notice that the cherrypicked, single, chosen example of what is apparently a straightforward, familiar, concrete (literally) concept, which people have been playing with interactively for days, is clearly dirty and not actually a ‘Golden Gate Bridge feature’. This suggests it is not hard to fool a lot of people with an ‘interpretable feature’ which is still quite far from the human concept. And if you believe that it’s not super important for fragility-of-value because it’d have feasible fixes if noticed, how do you know anyone will notice?
I’m not surprised if the features aren’t 100% clean, because this is after all a preliminary research prototype of a small approximation of a medium-sized version of a still sub-AGI LLM.
It’s more like a limitation of the paradigm, imo. If the “most golden gate” direction in activation-space and the “most SF fog” direction have high cosine similarity, there isn’t a way to increase activation of one of them but not the other. And this isn’t only a problem for outside interpreters—it’s expensive for the AI’s further layers to distinguish close-together vectors, so I’d expect the AI’s further layers to do it as cheaply and unreliably as works on the training distribution, and not in some extra-robust way that generalizes to clamping features at 5x their observed maximum.
While I think you’re right it’s not cleanly “a Golden Bridge feature,” I strongly suspect it may be activating a more specific feature vector and not a less specific feature.
It looks like this is somewhat of a measurement problem with SAE. We are measuring SAE activations via text or image inputs, but what’s activated in generations seems to be “sensations associated with the Golden gate bridge.”
While googling “Golden Gate Bridge” might return the Wikipedia page, whats the relative volume in a very broad training set between encyclopedic writing about the Golden Gate Bridge and experiential writing on social media or in books and poems about the bridge?
The model was trained to complete those too, and in theory should have developed successful features for doing so.
In the research examples one of the matched images is a perspective shot from physically being on the bridge, a text example is talking about the color of it, another is seeing it in the sunset.
But these are all the feature activations when acting in a classifier role. That’s what SAE is exploring—give it a set of inputs and see what lights it up.
Yet in the generative role this vector maximized keeps coming up over and over in the model with content from a sensory standpoint.
Maybe generation based on functional vector manipulations will prove to be a more powerful interpretability technique than SAE probing passive activations alone?
In the above chat when that “golden gate vector” is magnified, it keeps talking about either the sensations of being the bridge as if its physical body with wind and waves hitting it or the sensations of being on the bridge. It even generates towards the end in reflecting on the knowledge of the activation about how the sensations are overwhelming. Not reflecting on the Platonic form of an abstract concept of the bridge, but about overwhelming physical sensations of the bridge’s materialism.
I’ll be curious to see more generative data and samples from this variation, but it looks like generative exploration of features may offer considerably more fidelity to their underlying impact on the network than just SAE. Very exciting!!
I had a weird one today; I asked it to write a program for me, and it wrote one about the Golden Gate Bridge, and when I asked it why, it used the Russian word for “program” instead of the English word “program”, despite the rest of the response being entirely in English.
Kind of interesting how this is introducing people to Sonnet quirks in general, because that’s within my expectations for a Sonnet ‘typo’/writing quirk. Do they just not get used as much as Opus or Haiku?
I think the Claude Sonnet Golden Gate Bridge feature is not crispy aligned with the human concept of “Golden Gate Bridge”.
It brings up the San Fransisco fog far more than it would if it was just the bridge itself. I think it’s probably more like Golden Gate Bridge + SF fog + a bunch of other things (some SF related, some not).
This isn’t particularly surprising, given these are related ideas (both SF things), and the features were trained in an unsupervised way. But still seems kinda important that the “natural” features that SAEs find are not like exactly intuitively natural human concepts.
It might be interesting to look at how much the SAE training data actually mentions the fog and the Golden Gate Bridge together
I don’t really think that this is super important for “fragility of value”-type concerns, but probably is important for people who think we will easily be able to understand the features/internals of LLMs
Almost all of my Golden Gate Claude chats mention the fog. Here is a not particularly cherrypicked example:
I’m not surprised if the features aren’t 100% clean, because this is after all a preliminary research prototype of a small approximation of a medium-sized version of a still sub-AGI LLM.
But I am a little more concerned that this is the first I’ve seen anyone notice that the cherrypicked, single, chosen example of what is apparently a straightforward, familiar, concrete (literally) concept, which people have been playing with interactively for days, is clearly dirty and not actually a ‘Golden Gate Bridge feature’. This suggests it is not hard to fool a lot of people with an ‘interpretable feature’ which is still quite far from the human concept. And if you believe that it’s not super important for fragility-of-value because it’d have feasible fixes if noticed, how do you know anyone will notice?
The Anthropic post itself said more or less the same:
It’s more like a limitation of the paradigm, imo. If the “most golden gate” direction in activation-space and the “most SF fog” direction have high cosine similarity, there isn’t a way to increase activation of one of them but not the other. And this isn’t only a problem for outside interpreters—it’s expensive for the AI’s further layers to distinguish close-together vectors, so I’d expect the AI’s further layers to do it as cheaply and unreliably as works on the training distribution, and not in some extra-robust way that generalizes to clamping features at 5x their observed maximum.
FWIW, I had noticed the same but had thought it was overly split (“Golden Gate Bridge, particularly its fog, colour and endpoints”) rather than dirty.
While I think you’re right it’s not cleanly “a Golden Bridge feature,” I strongly suspect it may be activating a more specific feature vector and not a less specific feature.
It looks like this is somewhat of a measurement problem with SAE. We are measuring SAE activations via text or image inputs, but what’s activated in generations seems to be “sensations associated with the Golden gate bridge.”
While googling “Golden Gate Bridge” might return the Wikipedia page, whats the relative volume in a very broad training set between encyclopedic writing about the Golden Gate Bridge and experiential writing on social media or in books and poems about the bridge?
The model was trained to complete those too, and in theory should have developed successful features for doing so.
In the research examples one of the matched images is a perspective shot from physically being on the bridge, a text example is talking about the color of it, another is seeing it in the sunset.
But these are all the feature activations when acting in a classifier role. That’s what SAE is exploring—give it a set of inputs and see what lights it up.
Yet in the generative role this vector maximized keeps coming up over and over in the model with content from a sensory standpoint.
Maybe generation based on functional vector manipulations will prove to be a more powerful interpretability technique than SAE probing passive activations alone?
In the above chat when that “golden gate vector” is magnified, it keeps talking about either the sensations of being the bridge as if its physical body with wind and waves hitting it or the sensations of being on the bridge. It even generates towards the end in reflecting on the knowledge of the activation about how the sensations are overwhelming. Not reflecting on the Platonic form of an abstract concept of the bridge, but about overwhelming physical sensations of the bridge’s materialism.
I’ll be curious to see more generative data and samples from this variation, but it looks like generative exploration of features may offer considerably more fidelity to their underlying impact on the network than just SAE. Very exciting!!
I had a weird one today; I asked it to write a program for me, and it wrote one about the Golden Gate Bridge, and when I asked it why, it used the Russian word for “program” instead of the English word “program”, despite the rest of the response being entirely in English.
Kind of interesting how this is introducing people to Sonnet quirks in general, because that’s within my expectations for a Sonnet ‘typo’/writing quirk. Do they just not get used as much as Opus or Haiku?