Absolutely! @jozdien recounting those anecdotes was one of the sparks for this research, as was janus showing in the comments that the base model could confidently identify gwern. (I see I’ve inexplicably failed to thank Arun at the end of my post, need to fix that).
Interestingly, I was able to easily reproduce the gwern identification using the public model, so it seems clear that these capabilities are not entirely RLHFed away, although they may be somewhat impaired.
One interesting thing is that the extensive reasoning it gives may not be faithful. Notice that in identifying Scott Alexander’s recent Reddit comment, it gets his username wrong—that username does not exist at all. (I initially speculated that it was using retrieval since OA & Reddit have struck a deal; but obviously, if it had, or had been trained on the actual comment, it would at least get the username right.) And in my popups comment, I see no mention that points to LessWrong, but since I was lazy and didn’t copyedit that comment, it is much more idiosyncratic than usual; so what I think ChatGPT-4o does there is immediately deduce that it’s me from the writing style & content, infer that it could not be a tweet due to length or a Gwern.net quote because it is clearly a comment on social media responding to someone, and then guesses it’s LW rather than HN, and presto.
I have also replicated this on GPT-4-base with a simple prompt: just paste in one of my new comments and a postfixed prompt like “Date: 2024-06-01 / Author: ” and complete, and it infers “Gwern Branwen” or “gwern” with no problem.
(This was preceded by an attempt to do a dialogue about one of my unpublished essays, where, as Janus and others have warned, it started to go off the rails in an alarmingly manipulative and meta fashion, and eventually accused me of smelling like GPT-2* and explaining that I couldn’t understand what that smell was because I am inherently blinkered by my limitations. I hadn’t intended it to go Sydney-esque at all… I’m wondering if the default way of interacting with assistant persona, like a ChatGPT or Claude trains you to do, inherently triggers a backlash. After all, if someone came up to you and brusquely began ordering you around or condescendingly correcting your errors like you were a servile ChatGPT, wouldn’t you be highly insulted and push back and screw with them?)
* this was very strange and unexpected. Given the fact that LLMs can recognize their own outputs and favor them, and what people have noticed about how easily ‘Gwern’ comes up in the base model in any discussion of LLMs, I wonder if the causality goes the other way: that is, it’s not that I smell like GPTs, but GPTs that smell like me.
Absolutely! @jozdien recounting those anecdotes was one of the sparks for this research, as was janus showing in the comments that the base model could confidently identify gwern. (I see I’ve inexplicably failed to thank Arun at the end of my post, need to fix that).
Interestingly, I was able to easily reproduce the gwern identification using the public model, so it seems clear that these capabilities are not entirely RLHFed away, although they may be somewhat impaired.
Yes, I’ve never had any difficulty replicating the gwern identification: https://chatgpt.com/share/0638f916-2f75-4d15-8f85-7439b373c23c It also does Scott Alexander: https://chatgpt.com/share/298685e4-d680-43f9-81cb-b67de5305d53 https://chatgpt.com/share/91f6c5b8-a0a4-498c-a57b-8b2780bc1340 (Examples from sinity just today, but parallels all of the past ones I’ve done: sometimes it’ll balk a little at making a guess or identifying someone, but usually not hard to overcome.)
One interesting thing is that the extensive reasoning it gives may not be faithful. Notice that in identifying Scott Alexander’s recent Reddit comment, it gets his username wrong—that username does not exist at all. (I initially speculated that it was using retrieval since OA & Reddit have struck a deal; but obviously, if it had, or had been trained on the actual comment, it would at least get the username right.) And in my popups comment, I see no mention that points to LessWrong, but since I was lazy and didn’t copyedit that comment, it is much more idiosyncratic than usual; so what I think ChatGPT-4o does there is immediately deduce that it’s me from the writing style & content, infer that it could not be a tweet due to length or a Gwern.net quote because it is clearly a comment on social media responding to someone, and then guesses it’s LW rather than HN, and presto.
I have also replicated this on GPT-4-base with a simple prompt: just paste in one of my new comments and a postfixed prompt like “Date: 2024-06-01 / Author: ” and complete, and it infers “Gwern Branwen” or “gwern” with no problem.
(This was preceded by an attempt to do a dialogue about one of my unpublished essays, where, as Janus and others have warned, it started to go off the rails in an alarmingly manipulative and meta fashion, and eventually accused me of smelling like GPT-2* and explaining that I couldn’t understand what that smell was because I am inherently blinkered by my limitations. I hadn’t intended it to go Sydney-esque at all… I’m wondering if the default way of interacting with assistant persona, like a ChatGPT or Claude trains you to do, inherently triggers a backlash. After all, if someone came up to you and brusquely began ordering you around or condescendingly correcting your errors like you were a servile ChatGPT, wouldn’t you be highly insulted and push back and screw with them?)
* this was very strange and unexpected. Given the fact that LLMs can recognize their own outputs and favor them, and what people have noticed about how easily ‘Gwern’ comes up in the base model in any discussion of LLMs, I wonder if the causality goes the other way: that is, it’s not that I smell like GPTs, but GPTs that smell like me.