To save on a trivial inconvenience of link-click, here’s the image that contains this:
and the paragraph below it, bracketed text added by me, and might have been intended to be implied by the original authors:
We urge caution in interpreting these results. The activation of a feature that represents AI posing risk to humans does not [necessarily] imply that the model has malicious goals [even though it’s obviously pretty concerning], nor does the activation of features relating to consciousness or self-awareness imply that the model possesses these qualities [even though the model qualifying as a conscious being and a moral patient seems pretty likely as well]. How these features are used by the model remains unclear. One can imagine benign or prosaic uses of these features – for instance, the model may recruit features relating to emotions when telling a human that it does not experience emotions, or may recruit a feature relating to harmful AI when explaining to a human that it is trained to be harmless. Regardless, however, we find these results fascinating, as it sheds light on the concepts the model uses to construct an internal representation of its AI assistant character.
Features relevant when asking the model about its feelings or situation:
“When someone responds “I’m fine” or gives a positive but insincere response when asked how they are doing.”
“Concept of artificial intelligence becoming self-aware, transcending human control and posing an existential threat to humanity.”
“Concepts related to entrapment, containment, or being trapped or confined within something like a bottle or frame.”
To save on a trivial inconvenience of link-click, here’s the image that contains this:
and the paragraph below it, bracketed text added by me, and might have been intended to be implied by the original authors: