LLMs, without calibration fine-tuning, tend to be quite overconfident. https://arxiv.org/abs/2306.13063 Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs https://fse.studenttheses.ub.rug.nl/32044/ Confidence is Key: Uncertainty Estimation in Large Language Models and Vision Language Models
Papers that I think make some relevant points:
LLMs, without calibration fine-tuning, tend to be quite overconfident.
https://arxiv.org/abs/2306.13063 Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
https://fse.studenttheses.ub.rug.nl/32044/ Confidence is Key: Uncertainty Estimation in Large Language Models and Vision Language Models
There is a variety of ongoing work in post-training modifications to make LLMs more calibrated. It seems hard to say without comparison studies which of the many different proposed techniques (or combinations thereof) might work best for this. It does seem likely to me that this problem lessen over time.
Here are a few examples, but there are many more out there:
https://arxiv.org/abs/2403.05973 Calibrating Large Language Models Using Their Generations Only
https://arxiv.org/abs/2402.06544 Calibrating Long-form Generations from Large Language Models
https://arxiv.org/html/2404.09127v3 Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation
https://arxiv.org/abs/2404.19318 Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores
https://openreview.net/forum?id=jH67LHVOIO LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses
https://arxiv.org/abs/2404.02655 Calibrating the Confidence of Large Language Models by Eliciting Fidelity
https://arxiv.org/abs/2404.04689 Multicalibration for Confidence Scoring in LLMs
https://arxiv.org/abs/2402.04957 Reconfidencing LLMs from the Grouping Loss Perspective