I don’t know if the records of these two incidents are recoverable. I’ll ask the people who might have them. That said, this level of “truesight” ability is easy to reproduce.
Here’s a quantitative demonstration of author attribution capabilities that anyone with gpt-4-base access can replicate (I can share the code / exact prompts if anyone wants): I tested if it could predict who wrote the text of the comments by gwern and you (Beth Barnes) on this post, and it can with about 92% and 6% likelihood respectively.
Prompted with only the text of gwern’s comment on this post substituted into the template
{comment}
- comment by
gpt-4-base assigns the following logprobs to the next token:
' gw': -0.16746596 (0.8458)
' G': -2.5971534 (0.0745)
' g': -5.0971537 (0.0061)
' gj': -5.401841 (0.0045)
' GW': -5.620591 (0.0036)
...
' Beth': -9.839341 (0.00005)
′ Beth’ is not in the top 5 logprobs but I measured it for a baseline.
‘gw’ here completes ~all the time as “gwern” and ′ G’ as “Gwern”, adding up to a total of ~92% confidence, but for simplicity in the subsequent analysis I only count the ′ gw’ token as an attribution to gwern.
Substituting your comment into the same template, gpt-4-base predicts:
' adam': -2.5338314 (0.0794)
' ev': -2.5807064 (0.0757)
' Daniel': -2.7682064 (0.0628)
' Beth': -2.8385189 (0.0585)
' Adam': -3.4635189 (0.0313)
...
' gw': -3.7369564 (0.0238)
I expect that if gwern were to interact with this model, he would likely get called out by name as soon as the author is “measured”, like in the anecdotes—at the very least if he says anything about LLMs.
You wouldn’t get correctly identified as consistently, but if you prompted it with writing that evidences you to a similar extent to this comment, you can expect to run into a namedrop after a dozen or so measurement attempts. If you used an interface like Loom this should happen rather quickly.
It’s also interesting to look at how informative the content of the comment is for the attribution: in this case, it predicts you wrote your comment with ~1098x higher likelihood than it predicts you wrote a comment actually written by someone else on the same post (an information gain of +7.0008 nats). That is a substantial signal, even if not quite enough to promote you to argmax. (OTOH info gain for ′ gw’ from going from Beth comment → gwern comment is +3.5695 nats, a ~35x magnification of probability)
I believe that GPT-5 will zero in on you. Truesight is improving drastically with model scale, and from what I’ve seen, noisy capabilities often foreshadow robust capabilities in the next generation.
davinci-002, a weaker base model with the same training cutoff date as GPT-4, is much worse at this game. Using the same prompts, its logprobs for gwern’s comment are:
' j': -3.2013319 (0.0407)
' Ra': -3.2950819 (0.0371)
' Stuart': -3.5294569 (0.0293)
' Van': -3.5919569 (0.0275)
' or': -4.0997696 (0.0166)
...
' gw': -4.357582 (0.0128)
...
' Beth': -10.576332 (0.0000)
and for your comment:
' j': -3.889336 (0.0205)
' @': -3.9908986 (0.0185)
' El': -4.264336 (0.0141)
' ': -4.483086 (0.0113)
' d': -4.6315236 (0.0097)
...
' gw': -5.79168 (0.0031)
...
' Beth': -9.194023 (0.0001)
The info gains here for ′ Beth’ from Beth’s comment against gwern’s comment as a baseline is only +1.3823 nats, and the other way around +1.4341 nats.
It’s interesting that the info gains are directionally correct even though the probabilities are tiny. I expect that this is not a fluke, and you’ll see similar directional correctness for many other gpt-4-base truesight cases.
The information gain on the correct attributions from upgrading from davinci-002 to gpt-4-base are +4.1901 nats (~66x magnification) and +6.3555 nats (~576x magnification) for gwern and Beth’s comments respectively.
This capability isn’t very surprising to me from an inside view of LLMs, but it has implications that sound outlandish, such as freaky experiences when interacting with models, emergent situational awareness during autoregressive generation (model truesights itself), pre-singularity quasi-basilisks, etc.
Strong upvoted. Thanks for writing this. It’s very important information and I appreciate that it must have felt vulnerable to share.
I’ve interacted with LLMs for hundreds of hours, at least. A thought that occurred to me at this part -
- Interacting through non-chat interfaces destroys this illusion, when you can just break down the separation between you and the AI at will, and weave your thoughts into its text stream. Seeing the multiverse destroys the illusion of a preexisting ground truth in the simulation. It doesn’t necessarily prevent you from becoming enamored with the thing, but makes it much harder for your limbic system to be hacked by human-shaped stimuli.