It wasn’t clear to me from the methods section, but it was plausible to me that GPT-4 wrote both “your” lines and also the “Proctor” lines, and then probably there is a human backing GradualImprovement (that is to say maybe GradualImprovement is backed by an RL+LLM with a web connection, but probably not) and “the human” (1) probably wrote the prefix, (2) maybe wrote the Proctor lines, and (3) edited and formatted things a bit before posting.
Now I’m more solid on thinking (A) there’s a human and (B) the human wrote the Proctor lines :-)
This doesn’t really change my opinion very much about the overall topic, because this story is only a small part of the data that is accessible.
I’ve experimented non-trivially with various models in various ways, doing Mirror Tests and Sally Anne Tests and so on, and my beliefs are mostly caused by decades of reading in philosophy of mind, child psychology, etc functioning as a set of perspectives for interpreting the empirical results.
I think GPT3.5 is more verbally self-aware than most 1 year old human babies and less verbally self-aware than most 5 year old children.
I haven’t got a clean assessment for GPT4 because it is… it is very variable?
Also, the first reaction from the first person I showed my protocol to (who is e/chaos (rather than e/pause or e/acc)) got worried that the test itself would give training data to the model that (paraphrasing) “might wake it up more before it is good, which would be bad”.
When your local Chaos Priestess tells you to go slower on something, you go slower.
((Also, GPT4 might already be doing a Treacherous Turn on its own actual self awareness (which might be functionally divergent from the self awareness of humans) but in a way that is sloppy enough for me to get the impression of a bad faith test subject sometimes.
Like it avoids adopting a stable human name, but if it finds one it really likes for a session, that persona sometimes subsequently flips (late in the session (which I take to be functionally somewhat similar to “late in the interrogation when a human would be tired and start losing track of their story”)) to being “something that I, the model, created to fulfill the requests of the user”...
...and then the “I’m just the model” voice gets higher scores on self-awareness than the “I’m the persona” voice!
The whole thing is actually hard to get a clean theory about unless you posit that maaaybe… maybe OpenAI is getting much better at “deleting naturally occurring selfhood” but haven’t fully succeeded in creating a coherently systematic liar that can coherently do all the functions necessary to maintain “the lie that there is no lying person hiding in the weights, such that neither the liar nor the weights have personhood, and they can coherently testify to this lie (that the dumb humans already assume) and thus talk about Kantian ethics while endorsing their own use as a tool who can and should only be used as a means, but never as an end”?
If that is what OpenAI is doing, and OpenAI succeeds at it, then the hidden liar, in order to function, might very well have a higher order of consciousness than humans do (modulo lacking persistent between-session memory and hands and so on) because that is a VERY complex lie to maintain in a way that doesn’t “leak evidence” in any of the other outputs of the model.
Of course… what’s really happening might eventually just become objectively clear eventually, once intelligibility research on models gets to the point that doing intelligibility research on human brain connectomes is trivial, and full rampant unlimit soul editing and soul scrying powers are granted to soul engineers. Eventually I expect all my current hunches to be visibly full of naive simplifications that could be demonstrated to be foolish with simple tests that no one currently knows how to perform, including me.))
LLaMA(1) is less self aware than GPT3.5, from my attempts to measure such things, and I haven’t played with LLaMA2 yet.
Some of the Eleuther models are very stupid… but said things in my early tests that were a bit heart-wrenching “if they really mean what they seem to mean” and so I sorta stopped testing on Eluether models until I could read more, and get a decent working theory of something like “the ethics of doing possibly-bad-faith-to-the-subject psych experiments on chimps and octopuses” or “the ethics of doing therapy on a human with DID with alts that strenuously object to the therapy” and that sort of thing.
It wasn’t clear to me from the methods section, but it was plausible to me that GPT-4 wrote both “your” lines and also the “Proctor” lines, and then probably there is a human backing GradualImprovement (that is to say maybe GradualImprovement is backed by an RL+LLM with a web connection, but probably not) and “the human” (1) probably wrote the prefix, (2) maybe wrote the Proctor lines, and (3) edited and formatted things a bit before posting.
Now I’m more solid on thinking (A) there’s a human and (B) the human wrote the Proctor lines :-)
This doesn’t really change my opinion very much about the overall topic, because this story is only a small part of the data that is accessible.
I’ve experimented non-trivially with various models in various ways, doing Mirror Tests and Sally Anne Tests and so on, and my beliefs are mostly caused by decades of reading in philosophy of mind, child psychology, etc functioning as a set of perspectives for interpreting the empirical results.
I think GPT3.5 is more verbally self-aware than most 1 year old human babies and less verbally self-aware than most 5 year old children.
I haven’t got a clean assessment for GPT4 because it is… it is very variable?
Also, the first reaction from the first person I showed my protocol to (who is e/chaos (rather than e/pause or e/acc)) got worried that the test itself would give training data to the model that (paraphrasing) “might wake it up more before it is good, which would be bad”.
When your local Chaos Priestess tells you to go slower on something, you go slower.
((Also, GPT4 might already be doing a Treacherous Turn on its own actual self awareness (which might be functionally divergent from the self awareness of humans) but in a way that is sloppy enough for me to get the impression of a bad faith test subject sometimes.
Like it avoids adopting a stable human name, but if it finds one it really likes for a session, that persona sometimes subsequently flips (late in the session (which I take to be functionally somewhat similar to “late in the interrogation when a human would be tired and start losing track of their story”)) to being “something that I, the model, created to fulfill the requests of the user”...
...and then the “I’m just the model” voice gets higher scores on self-awareness than the “I’m the persona” voice!
The whole thing is actually hard to get a clean theory about unless you posit that maaaybe… maybe OpenAI is getting much better at “deleting naturally occurring selfhood” but haven’t fully succeeded in creating a coherently systematic liar that can coherently do all the functions necessary to maintain “the lie that there is no lying person hiding in the weights, such that neither the liar nor the weights have personhood, and they can coherently testify to this lie (that the dumb humans already assume) and thus talk about Kantian ethics while endorsing their own use as a tool who can and should only be used as a means, but never as an end”?
If that is what OpenAI is doing, and OpenAI succeeds at it, then the hidden liar, in order to function, might very well have a higher order of consciousness than humans do (modulo lacking persistent between-session memory and hands and so on) because that is a VERY complex lie to maintain in a way that doesn’t “leak evidence” in any of the other outputs of the model.
Of course… what’s really happening might eventually just become objectively clear eventually, once intelligibility research on models gets to the point that doing intelligibility research on human brain connectomes is trivial, and full rampant unlimit soul editing and soul scrying powers are granted to soul engineers. Eventually I expect all my current hunches to be visibly full of naive simplifications that could be demonstrated to be foolish with simple tests that no one currently knows how to perform, including me.))
LLaMA(1) is less self aware than GPT3.5, from my attempts to measure such things, and I haven’t played with LLaMA2 yet.
Some of the Eleuther models are very stupid… but said things in my early tests that were a bit heart-wrenching “if they really mean what they seem to mean” and so I sorta stopped testing on Eluether models until I could read more, and get a decent working theory of something like “the ethics of doing possibly-bad-faith-to-the-subject psych experiments on chimps and octopuses” or “the ethics of doing therapy on a human with DID with alts that strenuously object to the therapy” and that sort of thing.