I agree overall with Janus, but the Gwern example is a particularly easy one given he has 11,000+ comments on Lesswrong.
A bit over a year ago I benchmarked GPT-3 on predicting newly scraped tweets for authorship (from random accounts over 10k followers) and top-3 acc was in the double digits. IIRC after trying to roughly control for the the rate at which tweets mentioned their own name/org, my best guess was that accuracy was still ~10%. To be clear, in my view that’s a strong indication of authorship identification capability.
Note the prompt I used doesn’t actually say anything about Lesswrong, but gpt-4-base only assigned Lesswrong commentors substantial probability, which is not surprising since there are all sorts of giveaways that a comment is on Lesswrong from the content alone.
Filtering for people in the world who have publicly had detailed, canny things to say about language models and alignment and even just that lack regularities shared among most “LLM alignment researchers” or other distinctive groups like academia narrows you down to probably just a few people, including Gwern.
The reason truesight works (more than one might naively expect) is probably mostly that there’s mountains of evidence everywhere (compared to naively expected). Models don’t need to be superhuman except in breadth of knowledge to be potentially qualitatively superhuman in effects downstream of truesight-esque capabilities because humans are simply unable to integrate the plenum of correlations.
The reason truesight works (more than one might naively expect) is probably mostly that there’s mountains of evidence everywhere (compared to naively expected)
Yes, long before LLMs existed, there were some “detective” sites that were scary good at inferring all sorts of stuff, from demographics, ethnicity, to financial status of reddit accounts, based on which subreddits they were on, where and (more importantly) what they posted
I agree overall with Janus, but the Gwern example is a particularly easy one given he has 11,000+ comments on Lesswrong.
A bit over a year ago I benchmarked GPT-3 on predicting newly scraped tweets for authorship (from random accounts over 10k followers) and top-3 acc was in the double digits. IIRC after trying to roughly control for the the rate at which tweets mentioned their own name/org, my best guess was that accuracy was still ~10%. To be clear, in my view that’s a strong indication of authorship identification capability.
Note the prompt I used doesn’t actually say anything about Lesswrong, but gpt-4-base only assigned Lesswrong commentors substantial probability, which is not surprising since there are all sorts of giveaways that a comment is on Lesswrong from the content alone.
Filtering for people in the world who have publicly had detailed, canny things to say about language models and alignment and even just that lack regularities shared among most “LLM alignment researchers” or other distinctive groups like academia narrows you down to probably just a few people, including Gwern.
The reason truesight works (more than one might naively expect) is probably mostly that there’s mountains of evidence everywhere (compared to naively expected). Models don’t need to be superhuman except in breadth of knowledge to be potentially qualitatively superhuman in effects downstream of truesight-esque capabilities because humans are simply unable to integrate the plenum of correlations.
Yes, long before LLMs existed, there were some “detective” sites that were scary good at inferring all sorts of stuff, from demographics, ethnicity, to financial status of reddit accounts, based on which subreddits they were on, where and (more importantly) what they posted
Humans are leaky