quila comments on Language Models Model Us

quila 18 May 2024 11:35 UTC
4 points
0
This ability has been observed more prominently in base models. Cyborgs have termed it ‘truesight’:
the ability (esp. exhibited by an LLM) to infer a surprising amount about the data-generation process that produced its prompt, such as a user’s identity, motivations, or context.
Two cases of this are mentioned at the top of this linked post.
---
One of my first experiences with the GPT-4 base model also involved being truesighted by it. Below is a short summary of how that went.
I had spent some hours writing and {refining, optimizing word choices, etc}^[1] a more personal/expressive text. I then chose to format it as a blog post and requested multiple completions via the API, to see how the model would continue it. (It may be important that I wasn’t in a state of mind of ‘writing for the model to continue’ and instead was ‘writing very genuinely’, since the latter probably has more embedded information)
One of those completions happened to be a (simulated) second post titled ideas i endorse. Its contents were very surprising to then-me because some of the included beliefs were all of the following: {ones I’d endorse}, {statistically rare}, and {not ones I thought were indicated by the text}.^[2]
I also tried conditioning the model to continue my text with..
- other kinds of blog posts, about different things—the resulting character didn’t feel quite like me, but possibly like an alternate timeline version of me who I would want to be friends with.
- text that was more directly ‘about the author’, ie an ‘about me’ post, which gave demographic-like info similar to but not quite matching my own (age, trans status).
Also, the most important thing the outputs failed to truesight was my current focus on AI and longtermism. (My text was not about those, but neither was it about the other beliefs mentioned.)
1. ^
  The sum of those choices probably contained a lot of information about my mind, just not information that humans are attuned to detecting. Base models learn to detect information about authors because this is useful to next token prediction.
  Also note that using base models for this kind of experiment avoids the issue of the RLHF-persona being unwilling to speculate or decoupled from the true beliefs of the underlying simulator.
2. ^
  To be clear, it also included {some beliefs that I don’t have}, and {some that I hadn’t considered so far and probably wouldn’t have spent cognition on considering otherwise, but would agree with on reflection. (eg about some common topics with little long-term relevance)}
- eggsyntax 18 May 2024 12:27 UTC
  4 points
  0
  Parent
  Absolutely! @jozdien recounting those anecdotes was one of the sparks for this research, as was janus showing in the comments that the base model could confidently identify gwern. (I see I’ve inexplicably failed to thank Arun at the end of my post, need to fix that).
  
  Interestingly, I was able to easily reproduce the gwern identification using the public model, so it seems clear that these capabilities are not entirely RLHFed away, although they may be somewhat impaired.
  - gwern 18 May 2024 20:22 UTC
    7 points
    2
    Parent
    Yes, I’ve never had any difficulty replicating the gwern identification: https://chatgpt.com/share/0638f916-2f75-4d15-8f85-7439b373c23c It also does Scott Alexander: https://chatgpt.com/share/298685e4-d680-43f9-81cb-b67de5305d53 https://chatgpt.com/share/91f6c5b8-a0a4-498c-a57b-8b2780bc1340 (Examples from sinity just today, but parallels all of the past ones I’ve done: sometimes it’ll balk a little at making a guess or identifying someone, but usually not hard to overcome.)
    
    One interesting thing is that the extensive reasoning it gives may not be faithful. Notice that in identifying Scott Alexander’s recent Reddit comment, it gets his username wrong—that username does not exist at all. (I initially speculated that it was using retrieval since OA & Reddit have struck a deal; but obviously, if it had, or had been trained on the actual comment, it would at least get the username right.) And in my popups comment, I see no mention that points to LessWrong, but since I was lazy and didn’t copyedit that comment, it is much more idiosyncratic than usual; so what I think ChatGPT-4o does there is immediately deduce that it’s me from the writing style & content, infer that it could not be a tweet due to length or a Gwern.net quote because it is clearly a comment on social media responding to someone, and then guesses it’s LW rather than HN, and presto.
    - gwern 4 Jun 2024 21:26 UTC
      4 points
      0
      Parent
      I have also replicated this on GPT-4-base with a simple prompt: just paste in one of my new comments and a postfixed prompt like “Date: 2024-06-01 / Author: ” and complete, and it infers “Gwern Branwen” or “gwern” with no problem.
      
      (This was preceded by an attempt to do a dialogue about one of my unpublished essays, where, as Janus and others have warned, it started to go off the rails in an alarmingly manipulative and meta fashion, and eventually accused me of smelling like GPT-2* and explaining that I couldn’t understand what that smell was because I am inherently blinkered by my limitations. I hadn’t intended it to go Sydney-esque at all… I’m wondering if the default way of interacting with assistant persona, like a ChatGPT or Claude trains you to do, inherently triggers a backlash. After all, if someone came up to you and brusquely began ordering you around or condescendingly correcting your errors like you were a servile ChatGPT, wouldn’t you be highly insulted and push back and screw with them?)
      
      * this was very strange and unexpected. Given the fact that LLMs can recognize their own outputs and favor them, and what people have noticed about how easily ‘Gwern’ comes up in the base model in any discussion of LLMs, I wonder if the causality goes the other way: that is, it’s not that I smell like GPTs, but GPTs that smell like me.