I think the RLHF might impede identification of specific named authors, but not group inferences. That’s the sort of distinction that safety training might impose, particularly anti-‘deepfake’ measures: generating a specific author from a text is the inverse of generating a text from a specific author, after all.
You can see in the paper I linked that group inference scales with model capability in a standard-looking way, with the largest/most-capable models doing best and smallest worst, and no inversions which correlate with RLHF/instruction-tuning. RLHF’d GPT-4 is just the best, by a substantial margin, and approaching the ground-truth labels. And so since a specific author is just an especially small group, identifying specific authors ought to work well. And I recall even the early GPT-3s being uncanny in guessing that I was the author from a few paragraphs, and obviously GPT-4 should be even better (as it is smarter, and I’ve continued writing publicly).
But in the past, whenever I’ve tried to get Claude-2 or GPT-4 to ‘write like Gwern’, they usually balk or refuse. Trying an author identification right now in ChatGPT-4 by pasting in the entirety of my most recent ML proposal (SVG generative models), which would not be in the training datasets of anything yet, ChatGPT-4 just spits out a list of ‘famous ML people’ like ‘Ilya Sutskever’ or ‘Daphne Koller’ or ‘Geoffrey Hinton’ - most of whom are obviously incorrect as they write nothing like me! (Asking for more candidates doesn’t help too much, as does asking for ‘bloggers’; when I eventually asked it a leading question whether I wrote it, it agrees I’m a plausible author and explains correctly why, but given the acquiescence bias & a leading question, that’s not impressive.)
Of course, this might just reflect the prompts or sampling variability. (The paper is using specific prompts for classification, and also reports low refusal rates, which doesn’t match my experience.) Still, worth keeping in mind that safety things might balk at stylometric tasks even if the underlying capability is there.
ChatGPT-4 just spits out a list of ‘famous ML people’ like ‘Ilya Sutskever’ or ‘Daphne Koller’ or ‘Geoffrey Hinton’ - most of whom are obviously incorrect as they write nothing like me!
To elaborate a little more on this: while the RLHF models all appear still capable of a lot of truesight, we also still appear to see “mode collapse”. Besides mine, where it goes from plausible candidates besides me to me + random bigwigs, from Cyborgism Discord, Arun Jose notes another example of this mode collapse over possible authors:
ChatGPT-4′s guesses for Beth’s comment: Eliezer, Timnit Gebru, Sam Altman / Greg Brockman. Further guesses by ChatGPT-4: Gary Marcus, and Yann LeCun.
Claude’s guesses (first try): Paul Christiano, Ajeya, Evan, Andrew Critch, Daniel Ziegler. [but] Claude managed to guess 2 people at ARC/METR. On resampling Claude: Eliezer, Paul, Gwern, or Scott Alexander. Third try, where it doesn’t guess early on: Eliezer, Paul, Rohin Shah, Richard Ngo, or Daniel Ziegler.
Interestingly, Beth aside, I think Claude’s guesses might have been better than 4-base’s. Like, 4-base did not guess Daniel Ziegler (but did guess Daniel Kokotajlo). Also did not guess Ajeya or Paul (Paul at 0.27% and Ajeya at 0.96%) (but entirely plausible this was some galaxy-brained analysis of writing aura more than content than I’m completely missing).
Going back to my comments as a demo:
Woah, with Gwern’s comment Claude’s very insistent that it’s Gwern. I recommended it give other examples and it did so perfunctorily, but then went back to insisting that its primary guess is Gwern.
...ChatGPT-4 guesses: Timnit Gebru, Emily Bender, Yann LeCun, Hinton, Ian Goodfellow, “people affiliated with FHI, OpenAI, or CSET”. For Gwern’s comment. Very funny it guessed Timnit for Beth and Gwern. It also guessed LeCun over Hinton and Ian specifically because of his “active involvement in AI ethics and research discussions”. Claude confirmed SOTA.
And so since a specific author is just an especially small group
That’s nicely said.
Another current MATS scholar is modeling this group identification very abstractly as: given a pool of token-generating finite-state automata, how quickly (as it receives more tokens) can a transformer trained on the output of those processes point with confidence to the one of those processes that’s producing the current token stream? I’ve been finding that a very useful mental model.
I think the RLHF might impede identification of specific named authors, but not group inferences. That’s the sort of distinction that safety training might impose, particularly anti-‘deepfake’ measures: generating a specific author from a text is the inverse of generating a text from a specific author, after all.
You can see in the paper I linked that group inference scales with model capability in a standard-looking way, with the largest/most-capable models doing best and smallest worst, and no inversions which correlate with RLHF/instruction-tuning. RLHF’d GPT-4 is just the best, by a substantial margin, and approaching the ground-truth labels. And so since a specific author is just an especially small group, identifying specific authors ought to work well. And I recall even the early GPT-3s being uncanny in guessing that I was the author from a few paragraphs, and obviously GPT-4 should be even better (as it is smarter, and I’ve continued writing publicly).
But in the past, whenever I’ve tried to get Claude-2 or GPT-4 to ‘write like Gwern’, they usually balk or refuse. Trying an author identification right now in ChatGPT-4 by pasting in the entirety of my most recent ML proposal (SVG generative models), which would not be in the training datasets of anything yet, ChatGPT-4 just spits out a list of ‘famous ML people’ like ‘Ilya Sutskever’ or ‘Daphne Koller’ or ‘Geoffrey Hinton’ - most of whom are obviously incorrect as they write nothing like me! (Asking for more candidates doesn’t help too much, as does asking for ‘bloggers’; when I eventually asked it a leading question whether I wrote it, it agrees I’m a plausible author and explains correctly why, but given the acquiescence bias & a leading question, that’s not impressive.)
Of course, this might just reflect the prompts or sampling variability. (The paper is using specific prompts for classification, and also reports low refusal rates, which doesn’t match my experience.) Still, worth keeping in mind that safety things might balk at stylometric tasks even if the underlying capability is there.
To elaborate a little more on this: while the RLHF models all appear still capable of a lot of truesight, we also still appear to see “mode collapse”. Besides mine, where it goes from plausible candidates besides me to me + random bigwigs, from Cyborgism Discord, Arun Jose notes another example of this mode collapse over possible authors:
Going back to my comments as a demo:
That’s nicely said.
Another current MATS scholar is modeling this group identification very abstractly as: given a pool of token-generating finite-state automata, how quickly (as it receives more tokens) can a transformer trained on the output of those processes point with confidence to the one of those processes that’s producing the current token stream? I’ve been finding that a very useful mental model.