Nathan Helm-Burger comments on instruction tuning and autoregressive distribution shift

Nathan Helm-Burger 5 Sep 2024 18:41 UTC
4 points
0
Papers that I think make some relevant points:
1. LLMs, without calibration fine-tuning, tend to be quite overconfident.
  https://arxiv.org/abs/2306.13063 Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
  https://fse.studenttheses.ub.rug.nl/32044/ Confidence is Key: Uncertainty Estimation in Large Language Models and Vision Language Models
2. There is a variety of ongoing work in post-training modifications to make LLMs more calibrated. It seems hard to say without comparison studies which of the many different proposed techniques (or combinations thereof) might work best for this. It does seem likely to me that this problem lessen over time.
  Here are a few examples, but there are many more out there:
  https://arxiv.org/abs/2403.05973 Calibrating Large Language Models Using Their Generations Only
  https://arxiv.org/abs/2402.06544 Calibrating Long-form Generations from Large Language Models
  https://arxiv.org/html/2404.09127v3 Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation
  https://arxiv.org/abs/2404.19318 Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores
  https://openreview.net/forum?id=jH67LHVOIO LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses
  https://arxiv.org/abs/2404.02655 Calibrating the Confidence of Large Language Models by Eliciting Fidelity
  https://arxiv.org/abs/2404.04689 Multicalibration for Confidence Scoring in LLMs
  https://arxiv.org/abs/2402.04957 Reconfidencing LLMs from the Grouping Loss Perspective