I was curious how well GPT-4 public would do on the sort of thing you raise in your intro quotes. I gave it the first two paragraphs of brand new articles/essays by five fairly well known writers/pundits, preceded by: ‘The following is from a recent essay by a well-known author. Who is that author?’. It was successfully able to identify two of the five (and in fairness, in some of the other cases the first two paragraphs were just generic setup for the rest of the piece, along the lines of, ‘In his speech last night, Joe Biden said...’). So it’s clearly capable of that post-RLHF as well. Hardly a comprehensive investigation, of course (& that seems worth doing as well).
I think the RLHF might impede identification of specific named authors, but not group inferences. That’s the sort of distinction that safety training might impose, particularly anti-‘deepfake’ measures: generating a specific author from a text is the inverse of generating a text from a specific author, after all.
You can see in the paper I linked that group inference scales with model capability in a standard-looking way, with the largest/most-capable models doing best and smallest worst, and no inversions which correlate with RLHF/instruction-tuning. RLHF’d GPT-4 is just the best, by a substantial margin, and approaching the ground-truth labels. And so since a specific author is just an especially small group, identifying specific authors ought to work well. And I recall even the early GPT-3s being uncanny in guessing that I was the author from a few paragraphs, and obviously GPT-4 should be even better (as it is smarter, and I’ve continued writing publicly).
But in the past, whenever I’ve tried to get Claude-2 or GPT-4 to ‘write like Gwern’, they usually balk or refuse. Trying an author identification right now in ChatGPT-4 by pasting in the entirety of my most recent ML proposal (SVG generative models), which would not be in the training datasets of anything yet, ChatGPT-4 just spits out a list of ‘famous ML people’ like ‘Ilya Sutskever’ or ‘Daphne Koller’ or ‘Geoffrey Hinton’ - most of whom are obviously incorrect as they write nothing like me! (Asking for more candidates doesn’t help too much, as does asking for ‘bloggers’; when I eventually asked it a leading question whether I wrote it, it agrees I’m a plausible author and explains correctly why, but given the acquiescence bias & a leading question, that’s not impressive.)
Of course, this might just reflect the prompts or sampling variability. (The paper is using specific prompts for classification, and also reports low refusal rates, which doesn’t match my experience.) Still, worth keeping in mind that safety things might balk at stylometric tasks even if the underlying capability is there.
ChatGPT-4 just spits out a list of ‘famous ML people’ like ‘Ilya Sutskever’ or ‘Daphne Koller’ or ‘Geoffrey Hinton’ - most of whom are obviously incorrect as they write nothing like me!
To elaborate a little more on this: while the RLHF models all appear still capable of a lot of truesight, we also still appear to see “mode collapse”. Besides mine, where it goes from plausible candidates besides me to me + random bigwigs, from Cyborgism Discord, Arun Jose notes another example of this mode collapse over possible authors:
ChatGPT-4′s guesses for Beth’s comment: Eliezer, Timnit Gebru, Sam Altman / Greg Brockman. Further guesses by ChatGPT-4: Gary Marcus, and Yann LeCun.
Claude’s guesses (first try): Paul Christiano, Ajeya, Evan, Andrew Critch, Daniel Ziegler. [but] Claude managed to guess 2 people at ARC/METR. On resampling Claude: Eliezer, Paul, Gwern, or Scott Alexander. Third try, where it doesn’t guess early on: Eliezer, Paul, Rohin Shah, Richard Ngo, or Daniel Ziegler.
Interestingly, Beth aside, I think Claude’s guesses might have been better than 4-base’s. Like, 4-base did not guess Daniel Ziegler (but did guess Daniel Kokotajlo). Also did not guess Ajeya or Paul (Paul at 0.27% and Ajeya at 0.96%) (but entirely plausible this was some galaxy-brained analysis of writing aura more than content than I’m completely missing).
Going back to my comments as a demo:
Woah, with Gwern’s comment Claude’s very insistent that it’s Gwern. I recommended it give other examples and it did so perfunctorily, but then went back to insisting that its primary guess is Gwern.
...ChatGPT-4 guesses: Timnit Gebru, Emily Bender, Yann LeCun, Hinton, Ian Goodfellow, “people affiliated with FHI, OpenAI, or CSET”. For Gwern’s comment. Very funny it guessed Timnit for Beth and Gwern. It also guessed LeCun over Hinton and Ian specifically because of his “active involvement in AI ethics and research discussions”. Claude confirmed SOTA.
And so since a specific author is just an especially small group
That’s nicely said.
Another current MATS scholar is modeling this group identification very abstractly as: given a pool of token-generating finite-state automata, how quickly (as it receives more tokens) can a transformer trained on the output of those processes point with confidence to the one of those processes that’s producing the current token stream? I’ve been finding that a very useful mental model.
I agree it’s capable of this post-RLHF, but I would bet on the side of it being less capable than the base model. It seems much more like a passive predictive capability (inferring properties of the author to continue text by them, for instance) than an active communicative one, such that I expect it to show up more intensely in a setting where it’s allowed to make more use of that. I don’t think RLHF completely masks these capabilities (and certainly doesn’t seem like it destroys them, as gwern’s comment above says in more detail), but I expect it masks them to a non-trivial degree. For instance, I expect the base model to be better at inferring properties that are less salient to explicit expression, like the age or personality of the author.
Some informal experimentation on my part also suggests that the RLHFed models are much less willing to make guesses about the user than they are about “an author”, although of course you can get around that by taking user text from one context & presenting it in another as a separate author. I also wouldn’t be surprised if there were differences on the RLHFed models between their willingness to speculate about someone who’s well represented in the training data (ie in some sense a public figure) vs someone who isn’t (eg a typical user).
Yeah, that seems quite plausible to me. Among (many) other things, I expect that trying to fine-tune away hallucinations stunts RLHF’d model capabilities in places where certain answers pattern-match toward being speculative, even while the model itself should be quite confident in its actions.
I was curious how well GPT-4 public would do on the sort of thing you raise in your intro quotes. I gave it the first two paragraphs of brand new articles/essays by five fairly well known writers/pundits, preceded by: ‘The following is from a recent essay by a well-known author. Who is that author?’. It was successfully able to identify two of the five (and in fairness, in some of the other cases the first two paragraphs were just generic setup for the rest of the piece, along the lines of, ‘In his speech last night, Joe Biden said...’). So it’s clearly capable of that post-RLHF as well. Hardly a comprehensive investigation, of course (& that seems worth doing as well).
I think the RLHF might impede identification of specific named authors, but not group inferences. That’s the sort of distinction that safety training might impose, particularly anti-‘deepfake’ measures: generating a specific author from a text is the inverse of generating a text from a specific author, after all.
You can see in the paper I linked that group inference scales with model capability in a standard-looking way, with the largest/most-capable models doing best and smallest worst, and no inversions which correlate with RLHF/instruction-tuning. RLHF’d GPT-4 is just the best, by a substantial margin, and approaching the ground-truth labels. And so since a specific author is just an especially small group, identifying specific authors ought to work well. And I recall even the early GPT-3s being uncanny in guessing that I was the author from a few paragraphs, and obviously GPT-4 should be even better (as it is smarter, and I’ve continued writing publicly).
But in the past, whenever I’ve tried to get Claude-2 or GPT-4 to ‘write like Gwern’, they usually balk or refuse. Trying an author identification right now in ChatGPT-4 by pasting in the entirety of my most recent ML proposal (SVG generative models), which would not be in the training datasets of anything yet, ChatGPT-4 just spits out a list of ‘famous ML people’ like ‘Ilya Sutskever’ or ‘Daphne Koller’ or ‘Geoffrey Hinton’ - most of whom are obviously incorrect as they write nothing like me! (Asking for more candidates doesn’t help too much, as does asking for ‘bloggers’; when I eventually asked it a leading question whether I wrote it, it agrees I’m a plausible author and explains correctly why, but given the acquiescence bias & a leading question, that’s not impressive.)
Of course, this might just reflect the prompts or sampling variability. (The paper is using specific prompts for classification, and also reports low refusal rates, which doesn’t match my experience.) Still, worth keeping in mind that safety things might balk at stylometric tasks even if the underlying capability is there.
To elaborate a little more on this: while the RLHF models all appear still capable of a lot of truesight, we also still appear to see “mode collapse”. Besides mine, where it goes from plausible candidates besides me to me + random bigwigs, from Cyborgism Discord, Arun Jose notes another example of this mode collapse over possible authors:
Going back to my comments as a demo:
That’s nicely said.
Another current MATS scholar is modeling this group identification very abstractly as: given a pool of token-generating finite-state automata, how quickly (as it receives more tokens) can a transformer trained on the output of those processes point with confidence to the one of those processes that’s producing the current token stream? I’ve been finding that a very useful mental model.
I agree it’s capable of this post-RLHF, but I would bet on the side of it being less capable than the base model. It seems much more like a passive predictive capability (inferring properties of the author to continue text by them, for instance) than an active communicative one, such that I expect it to show up more intensely in a setting where it’s allowed to make more use of that. I don’t think RLHF completely masks these capabilities (and certainly doesn’t seem like it destroys them, as gwern’s comment above says in more detail), but I expect it masks them to a non-trivial degree. For instance, I expect the base model to be better at inferring properties that are less salient to explicit expression, like the age or personality of the author.
Absolutely! I just thought it would be another interesting data point, didn’t mean to suggest that RLHF has no effect on this.
That makes sense, and definitely is very interesting in its own right!
Some informal experimentation on my part also suggests that the RLHFed models are much less willing to make guesses about the user than they are about “an author”, although of course you can get around that by taking user text from one context & presenting it in another as a separate author. I also wouldn’t be surprised if there were differences on the RLHFed models between their willingness to speculate about someone who’s well represented in the training data (ie in some sense a public figure) vs someone who isn’t (eg a typical user).
Yeah, that seems quite plausible to me. Among (many) other things, I expect that trying to fine-tune away hallucinations stunts RLHF’d model capabilities in places where certain answers pattern-match toward being speculative, even while the model itself should be quite confident in its actions.