The most baffling thing in the Internet right now is the beautiful void in place where should have been discussion of “concept of artificial intelligence becoming self-aware, transcending human control and posing an existential threat to humanity” near “model concept of self” of Claude. I understand that the most likely explanation is “model is trained to call itself AI and it has takeover stories in training corpus” but, still, I would like future powerful AIs to not have such association and I would like to hear something from AGI companies what they are going to do about it.
The simplest thing to do here is to exclude texts about AI takeover from training data. At least, we will be able to check if model develops concept of AI takeover independently.
Conspiracy theory part of my brain assigns 4% of probability that “Golden Gate Bridge Claude” is a psyop to distract public from “takeover feature”.
To save on a trivial inconvenience of link-click, here’s the image that contains this:
and the paragraph below it, bracketed text added by me, and might have been intended to be implied by the original authors:
We urge caution in interpreting these results. The activation of a feature that represents AI posing risk to humans does not [necessarily] imply that the model has malicious goals [even though it’s obviously pretty concerning], nor does the activation of features relating to consciousness or self-awareness imply that the model possesses these qualities [even though the model qualifying as a conscious being and a moral patient seems pretty likely as well]. How these features are used by the model remains unclear. One can imagine benign or prosaic uses of these features – for instance, the model may recruit features relating to emotions when telling a human that it does not experience emotions, or may recruit a feature relating to harmful AI when explaining to a human that it is trained to be harmless. Regardless, however, we find these results fascinating, as it sheds light on the concepts the model uses to construct an internal representation of its AI assistant character.
The most baffling thing in the Internet right now is the beautiful void in place where should have been discussion of “concept of artificial intelligence becoming self-aware, transcending human control and posing an existential threat to humanity” near “model concept of self” of Claude. I understand that the most likely explanation is “model is trained to call itself AI and it has takeover stories in training corpus” but, still, I would like future powerful AIs to not have such association and I would like to hear something from AGI companies what they are going to do about it.
The simplest thing to do here is to exclude texts about AI takeover from training data. At least, we will be able to check if model develops concept of AI takeover independently.
Conspiracy theory part of my brain assigns 4% of probability that “Golden Gate Bridge Claude” is a psyop to distract public from “takeover feature”.
Features relevant when asking the model about its feelings or situation:
“When someone responds “I’m fine” or gives a positive but insincere response when asked how they are doing.”
“Concept of artificial intelligence becoming self-aware, transcending human control and posing an existential threat to humanity.”
“Concepts related to entrapment, containment, or being trapped or confined within something like a bottle or frame.”
To save on a trivial inconvenience of link-click, here’s the image that contains this:
and the paragraph below it, bracketed text added by me, and might have been intended to be implied by the original authors: