I appreciate that you’ve at least tried to engage in good faith this time. But your reply still falls apart under scrutiny.
I give ChatGPT a C- on reading comprehension.
That’s cherry-picking. You found one sentence in an otherwise structured, balanced evaluation and used it to discredit the entire process. That’s not analysis. That’s avoidance.
I definitely advice against going to LLMs for social validation.
I didn’t. I used it as a neutral tool—to evaluate reasoning without community bias. That’s why I invited anyone to run the exchange through an LLM of their choosing. You’re now the only person who took that challenge. And ironically, you did so in a way that confirmed everything I’ve been saying.
Claude isn’t a neutral tool. It weighs ideas with social value, just like LW does. It often over-rewards citations, defers to existing literature, and agrees with in-group fluency—just like LW. It values ideas less (than GPT-4) on their internal logic, and more on what references they cite in order to reach conclusions. It will also seek to soften tone, and penalise the opposite of that, which is why it heavily penalises my logic score in your evaluation for essentially being ‘too certain’, something it feels the need to mention twice over 4 points.
I also ran it through Claude (with an account with no connections to my lesswrong, not even the name Franco) and had to do it twice because the first result was so different from yours. Using my prompt (included in the text) the scores were radically different, but it still put me on top. Here.
So I went back and used your prompt, this was the result—here.
Your version gave you 40⁄50 and me 31⁄50.
My version, using my prompt, gave me 37⁄50 and you 33⁄50.
So I used your prompt to try to replicate your results: I scored 35⁄50 and you 34⁄50.
Clearly my results are quite different from yours. So why is this? Because Claude weighs in-group dynamics. So when you right-click and save as a pdf for upload, it saves the whole page. Including the votes the essay and comments received. And Claude weighs these things in your favour—just as it penalised mine.
Where as the pdfs I uploaded is just raw text, and I’ve actually deleted the votes from the pdf with the debate on it (go check if it pleases you). I specifically did this to remove bias from the judgment. Which is why I immediately noticed that you did not do likewise.
The irony is that even your result contradicts your original claim, that my essay was not worth engaging with.
it included valuable clarifications about AI safety discourse and community reception dynamics
So then it is worth engaging. And your whole point about the fact that it wasn’t which is why no one did has now been disagreed with by 4 separate LLMs, including the biased one you ran yourself.
You also missed the entire point of this post. Which was not about my original essay. It was about if you had engaged in a good faith debate, which you did not. Even Claude (my output, not yours) had to mention your strawmanning of my argument—as nice as Claude tries to be about these things.
I need to admit. The differences in our scores was confusing for a moment. But as soon as I remembered I had removed karma from the pdf I uploaded of our debate for evaluation—specifically in order to avoid creating a biased result—and looked for you would somehow try to include it in your upload, I found it right away. Maybe you didn’t do it intentionally, but you did it regardless, and it skewed your results predictably.
If you’re serious about engaging honestly, then try GPT-4. Use the clean pdf I already provided. No karma scores, no formatting bias. I did say in my post that I invite anyone to use their favourite LLM, but perhaps to recreate lab conditions only GPT is viable.
You could even go one step further: add the context I gave GPT after its first evaluation, the one that caused your score to drop significantly. Then post those results.
Not as long as it takes to post your (now 5) comments without honest engagement.