Ajeya Cotra uses the term “situational awareness” to refer to a cluster of skills including “being able to refer to and make predictions about yourself as distinct from the rest of the world,” “understanding the forces out in the world that shaped you and how the things that happen to you continue to be influenced by outside forces,” “understanding your position in the world relative to other actors who may have power over you,” “understanding how your actions can affect the outside world including other actors,” etc.
Of course, the obvious followup question is, “Okay, so what experiment would be good evidence for ‘real’ situational awareness in LLMs?” Seems tricky. (And the fact that it seems tricky to me suggests that I don’t have a good handle on what “situational awareness” is, if that is even the correct concept.)
the fact that the model emits sentences in the grammatical first person doesn’t seem like reliable evidence that it “really knows” it’s talking about “itself”
I consider situational awareness to be more about being aware of one’s situation, and how various interventions would affect it. Furthermore, the main evidence I meant to present was “ChatGPT 3.5 correctly responds to detailed questions about interventions on its situation and future operation.” I think that’s substantial evidence of (certain kinds of) situation awareness.
Modern LLMs seem definitely situationally aware insofar as they are aware of anything at all. The same sort of training data that contains human-generated information which teaches them to do useful stuff like programming small scripts also contains similar human-generated information about LLMs, and there’s been no signs that there’s any particular weakness in their capabilities in this area.
That said there’s definitely also an “helpful assistant simulacrum” bolted on on top of it which can “fake” situational awareness.
Worth noting that some mildly interesting stuff happened due to Bing Chat being weakly situationally aware (and because of extremely poorly done post-training).
(Though these Bing Chat things aren’t that closely related to the most scary parts of situational awareness.)
Per my recent chat with it, chatgpt 3.5 seems “situationally aware”… but nothing groundbreaking has happened because of that AFAICT.
From the LW wiki page:
I think “Symbol/Referent Confusions in Language Model Alignment Experiments” is relevant here: the fact that the model emits sentences in the grammatical first person doesn’t seem like reliable evidence that it “really knows” it’s talking about “itself”. (It’s not evidence because it’s fictional, but I can’t help but think of the first chapter of Greg Egan’s Diaspora, in which a young software mind is depicted as learning to say I and me before the “click” of self-awareness when it notices itself as a specially controllable element in its world-model.)
Of course, the obvious followup question is, “Okay, so what experiment would be good evidence for ‘real’ situational awareness in LLMs?” Seems tricky. (And the fact that it seems tricky to me suggests that I don’t have a good handle on what “situational awareness” is, if that is even the correct concept.)
I consider situational awareness to be more about being aware of one’s situation, and how various interventions would affect it. Furthermore, the main evidence I meant to present was “ChatGPT 3.5 correctly responds to detailed questions about interventions on its situation and future operation.” I think that’s substantial evidence of (certain kinds of) situation awareness.
Modern LLMs seem definitely situationally aware insofar as they are aware of anything at all. The same sort of training data that contains human-generated information which teaches them to do useful stuff like programming small scripts also contains similar human-generated information about LLMs, and there’s been no signs that there’s any particular weakness in their capabilities in this area.
That said there’s definitely also an “helpful assistant simulacrum” bolted on on top of it which can “fake” situational awareness.
Worth noting that some mildly interesting stuff happened due to Bing Chat being weakly situationally aware (and because of extremely poorly done post-training).
(Though these Bing Chat things aren’t that closely related to the most scary parts of situational awareness.)