From my personal experience, I agree. I find myself unexcited about trying the newest LLM models. My main use-case in practice these days is Perplexity, and I only use it when I don’t care much about the accuracy of the results (which ends up being a lot, actually… maybe too much). Perplexity confabulates quite often even with accurate references in hand (but at least I can check the references). And it is worse than me at the basics of googling things, so it isn’t as if I expect it to find better references than me; the main value-add is in quickly reading and summarizing search results (although the new Deep Research option on Perplexity will at least iterate through several attempted searches, so it might actually find things that I wouldn’t have).
I have been relatively persistent about trying to use LLMs for actual research purposes, but the hallucination rate seems to go to 100% almost whenever an accurate result would be useful to me.
The hallucination rate does seem adequately low when talking about established mathematics (so long as you don’t ask for novel implications, such as applying ideas to new examples). For this and for other reasons I think they can be quite helpful for people trying to get oriented to a subfield they aren’t familiar with—it can make for a great study partner, so long as you verify what it says be checking other references.
Also decent for coding, of course, although the same caveat applies—coders who are already an expert in what they are trying to do will get much less utility out of it.
I recently spoke to someone who made a plausible claim that LLMs were 10xing their productivity in communicating technical ideas in AI alignment with something like the following workflow:
Take a specific cluster of failure modes for thinking about alignment which you’ve seen often.
Hand-write a large, careful prompt document about the cluster of alignment failure modes, which includes many specific trigger-action patterns (if someone makes mistake X, then the correct counterspell to avoid the mistake is Y). This document is highly opinionated and would come off as rude if directly cited/quoted; it is not good communication. However, it is something you can write once and use many times.
When responding to an email/etc, load the email and the prompt document into Claude and ask Claude to respond to the email using the document. Claude will write something polite, informative, and persuasive based on the document, with maybe a few iterations of correcting Claude if its first response doesn’t make sense. The person also emphasized that things should be written in small pieces, as quality declines rapidly when Claude tries to do more at once.
They also mentioned that Claude is awesome at coming up with meme versions of ideas to include in powerpoints and such, which is another useful communication tool.
So, my main conclusion is that there isn’t a big overlap between what LLMs are useful for and what I personally could use. I buy that there are some excellent use-cases for other people who spend their time doing other things.
Still, I agree with you that people are easily fooled into thinking these things are more useful than they actually are. If you aren’t an expert in the subfield you’re asking about, then the LLM outputs will probably look great due to Gell-Mann Amnesia type effects. When checking to see how good the LLM is, people often check the easier sorts of cases which the LLMs are actually decent at, and then wrongly generalize to conclude that the LLMs are similarly good for other cases.
From my personal experience, I agree. I find myself unexcited about trying the newest LLM models. My main use-case in practice these days is Perplexity, and I only use it when I don’t care much about the accuracy of the results (which ends up being a lot, actually… maybe too much). Perplexity confabulates quite often even with accurate references in hand (but at least I can check the references). And it is worse than me at the basics of googling things, so it isn’t as if I expect it to find better references than me; the main value-add is in quickly reading and summarizing search results (although the new Deep Research option on Perplexity will at least iterate through several attempted searches, so it might actually find things that I wouldn’t have).
I have been relatively persistent about trying to use LLMs for actual research purposes, but the hallucination rate seems to go to 100% almost whenever an accurate result would be useful to me.
The hallucination rate does seem adequately low when talking about established mathematics (so long as you don’t ask for novel implications, such as applying ideas to new examples). For this and for other reasons I think they can be quite helpful for people trying to get oriented to a subfield they aren’t familiar with—it can make for a great study partner, so long as you verify what it says be checking other references.
Also decent for coding, of course, although the same caveat applies—coders who are already an expert in what they are trying to do will get much less utility out of it.
I recently spoke to someone who made a plausible claim that LLMs were 10xing their productivity in communicating technical ideas in AI alignment with something like the following workflow:
Take a specific cluster of failure modes for thinking about alignment which you’ve seen often.
Hand-write a large, careful prompt document about the cluster of alignment failure modes, which includes many specific trigger-action patterns (if someone makes mistake X, then the correct counterspell to avoid the mistake is Y). This document is highly opinionated and would come off as rude if directly cited/quoted; it is not good communication. However, it is something you can write once and use many times.
When responding to an email/etc, load the email and the prompt document into Claude and ask Claude to respond to the email using the document. Claude will write something polite, informative, and persuasive based on the document, with maybe a few iterations of correcting Claude if its first response doesn’t make sense. The person also emphasized that things should be written in small pieces, as quality declines rapidly when Claude tries to do more at once.
They also mentioned that Claude is awesome at coming up with meme versions of ideas to include in powerpoints and such, which is another useful communication tool.
So, my main conclusion is that there isn’t a big overlap between what LLMs are useful for and what I personally could use. I buy that there are some excellent use-cases for other people who spend their time doing other things.
Still, I agree with you that people are easily fooled into thinking these things are more useful than they actually are. If you aren’t an expert in the subfield you’re asking about, then the LLM outputs will probably look great due to Gell-Mann Amnesia type effects. When checking to see how good the LLM is, people often check the easier sorts of cases which the LLMs are actually decent at, and then wrongly generalize to conclude that the LLMs are similarly good for other cases.