Lech Mazur comments on GPT-o1

Lech Mazur 10 Oct 2024 19:52 UTC
1 point
0
I included o1-preview and o1-mini in a new hallucination benchmark using provided text documents and deliberately misleading questions. While o1-preview ranks as the top-performing single model, o1-mini’s results are somewhat disappointing. A popular existing leaderboard on GitHub uses a highly inaccurate model-based evaluation of document summarization.

The chart above isn’t very informative without the non-response rate for these documents, which I’ve also calculated:
The GitHub page has further notes.