If someone is part of both a large category and a small category that usually don’t overlap, it’s likely that they are an outlier in the large category, not a representative member.
For example, if someone is both a rationalist and a Muslim, you shouldn’t expect them to be very similar to a normal Muslim but much more similar to a rationalist, and that it’s possible that they may not be very good at representing Muslims in general to a rationalist audience.
Epoch AI’s new evaluation for Gemini 2.5 Flash Preview is broken.
On their AI Benchmarking dashboard, the newest Gemini 2.5 Flash model is listed as having an accuracy of 4% ± 0.71% on GPQA Diamond, when Google’s official announcement lists it at over 80%, and when GPQA is a multiple-choice test with 4 options:
It’s because of formatting issues. Helpfully, Epoch provides the logs from the evaluation, and the model just simply hasn’t been responding in the correct format.
For example, if you look at the first sample from the logs, the correct answer is listed as “B”, but the model answered in Latex, $\boxed{B}$, so it was scored incorrect. There are plenty of other examples like this.
Thanks for your comment, I’m happy you found the logs helpful! I wouldn’t call the evaluation broken—the prompt clearly states the desired format, which the model fails to follow. We mention this in our Methodology section and FAQ (“Why do some models underperform the random baseline?”), but I think we’re also going to add a clarifying note about it in the tooltip.
While I do think “how well do models respect the formatting instructions in the prompt” is also valuable to know, I agree that I’d want to disentangle that from “how good are models at reasoning about the question”. Adding a second, more flexible scorer (likely based on an LLM-judge, like we have for OTIS Mock AIME) is in our queue, we’re just pretty strapped on engineering capacity at the moment :)
ETA: since it’s particularly extreme in this case I plan to hide this evaluation until we have the new scorer added
UBI would probably work better in Kenya compared to the US because when people are in extreme poverty, funds get used to meet pressing needs, but when these basic needs are met, extra funds are probably more likely to go into having more leisure time. People in Kenya generally have more low-hanging fruit to buy a motorcycle or to get a home renovation, but in the US, major life-changing things probably cost too much for UBI to cover and the income might be used on something like alcohol.
Made a small, quick website showing GPQA benchmark scores plotted against LLM inference cost, at https://ai-benchmark-price.glitch.me/. See how much you get for your buck:
Most benchmark data is from Epoch AI, except for those marked “not verified”, which I got from the model developer. Pricing data is from OpenRouter.
All the LLMs on this graph which are on the Pareto frontier of performance vs price were released December 2024 or later...
Is the Renaissance caused by the new elite class, the merchants, focusing more on pleasure and having fun compared to the lords, who focused more on status and power?
Hmm. No, but only because what you describe’s a massive oversimplification of what was actually going on? (Not a historian, though). In the 1400’s in W Europe, there was: kings gaining power over their feudal lords, hence less infighting between local lords. That does give more time for pleasure, or at least more opportunities to have fancy houses instead of fortresses. There also had been the crusades a couple of centuries before, allowing to bring knowledge from eg. Ancient Greece that had only been preserved in the middle East: that brings new forms of art, new knowledge, etc. An actual historian might even want to say something about more centralised governments, needing more bureaucrats, hence more people who can write and think about politics and philosophy? And of course, merchants were on the rise compared to kings and lords. But I’m not sure why they specifically as a class would have focused more on pleasure than on status?
I’m not sure what actually constitutes the Renaissance? Is just an art movement, or does it describe the totality of what was happening in European courts at the time? Is it just a propagandistic term? However two major trends that are associated with it—linear perspective paintings, and the rediscovery of Greco-Latin Literature both are at least partly indebted to developments in the Middle East.
The Book of Optics by Ibn al-Haytham appears to be particularly important in the developments of painting and the understanding of how light transmits. It contains a rejection of the emission theory of optics (rays come from the eyes) in favour of the intromission theory that light bounces off of objects before entering the eye. And translated into Latin in the late 12th century. I would surmise that it had at least an influence in the popularity and use of Linear Perspective in Renaissance Art.
Greco-Latin Literature was preserved, albeit in various translated forms, across the Islamic World and highly popular. As Wikipedia puts it:
The line between Greek scholarship and Arab scholarship in Western Europe was very blurred during the Middle Ages and the Early Modern Period. Sometimes the concept of the transmission of Greek Classics is often used to refer to the collective knowledge that was obtained from the Arab and Byzantine Empires, regardless of where the knowledge actually originated.
It is important to note that like the Renaissance itself, this was not some single catalytic moment, but both serial and parallel transmissions that happened over a number of centuries. Most interestingly at first these texts arrived in Europe being translated from some intermediary language like Syriac or Arabic. A Greek classic may have reached early modern Europeans only after being translated into Latin, then Syriac, and back into Latin.
Andalusian scholars began translating from Islamic sources from at least the early 10th century. Gerard of Cremona (1114-1187) set out to learn Arabic so he could read Ptolemy’s Almagest and later translated works of Aristotle, Euclid, Jabir ibn Aflah and Al-Khwarizmi. The Fourth Crusade (1202-1204) eventually facilitated Dutch scholar Willem van Moerbeke coming into contact and translating works of Aristotle, Hero of Alexandria, and Archimedes.
I assume these developments culminated in the artistic trademarks of the Renaissance.
People are better defined by smaller categories.
If someone is part of both a large category and a small category that usually don’t overlap, it’s likely that they are an outlier in the large category, not a representative member.
For example, if someone is both a rationalist and a Muslim, you shouldn’t expect them to be very similar to a normal Muslim but much more similar to a rationalist, and that it’s possible that they may not be very good at representing Muslims in general to a rationalist audience.
Epoch AI’s new evaluation for Gemini 2.5 Flash Preview is broken.
On their AI Benchmarking dashboard, the newest Gemini 2.5 Flash model is listed as having an accuracy of 4% ± 0.71% on GPQA Diamond, when Google’s official announcement lists it at over 80%, and when GPQA is a multiple-choice test with 4 options:
It’s because of formatting issues. Helpfully, Epoch provides the logs from the evaluation, and the model just simply hasn’t been responding in the correct format.
For example, if you look at the first sample from the logs, the correct answer is listed as “B”, but the model answered in Latex, $\boxed{B}$, so it was scored incorrect. There are plenty of other examples like this.
[I work at Epoch AI]
Thanks for your comment, I’m happy you found the logs helpful! I wouldn’t call the evaluation broken—the prompt clearly states the desired format, which the model fails to follow. We mention this in our Methodology section and FAQ (“Why do some models underperform the random baseline?”), but I think we’re also going to add a clarifying note about it in the tooltip.
While I do think “how well do models respect the formatting instructions in the prompt” is also valuable to know, I agree that I’d want to disentangle that from “how good are models at reasoning about the question”. Adding a second, more flexible scorer (likely based on an LLM-judge, like we have for OTIS Mock AIME) is in our queue, we’re just pretty strapped on engineering capacity at the moment :)
ETA: since it’s particularly extreme in this case I plan to hide this evaluation until we have the new scorer added
@Jsevillamol
A helpful page to see and subscribe to all 31 Substack writers (out of 122 total) who were invited to LessOnline: https://lessonline2025invitedlist.substack.com/recommendations
UBI would probably work better in Kenya compared to the US because when people are in extreme poverty, funds get used to meet pressing needs, but when these basic needs are met, extra funds are probably more likely to go into having more leisure time. People in Kenya generally have more low-hanging fruit to buy a motorcycle or to get a home renovation, but in the US, major life-changing things probably cost too much for UBI to cover and the income might be used on something like alcohol.
Made a small, quick website showing GPQA benchmark scores plotted against LLM inference cost, at https://ai-benchmark-price.glitch.me/. See how much you get for your buck:
Most benchmark data is from Epoch AI, except for those marked “not verified”, which I got from the model developer. Pricing data is from OpenRouter.
All the LLMs on this graph which are on the Pareto frontier of performance vs price were released December 2024 or later...
Is the Renaissance caused by the new elite class, the merchants, focusing more on pleasure and having fun compared to the lords, who focused more on status and power?
Hmm. No, but only because what you describe’s a massive oversimplification of what was actually going on? (Not a historian, though). In the 1400’s in W Europe, there was: kings gaining power over their feudal lords, hence less infighting between local lords. That does give more time for pleasure, or at least more opportunities to have fancy houses instead of fortresses. There also had been the crusades a couple of centuries before, allowing to bring knowledge from eg. Ancient Greece that had only been preserved in the middle East: that brings new forms of art, new knowledge, etc. An actual historian might even want to say something about more centralised governments, needing more bureaucrats, hence more people who can write and think about politics and philosophy? And of course, merchants were on the rise compared to kings and lords. But I’m not sure why they specifically as a class would have focused more on pleasure than on status?
I’m not sure what actually constitutes the Renaissance? Is just an art movement, or does it describe the totality of what was happening in European courts at the time? Is it just a propagandistic term? However two major trends that are associated with it—linear perspective paintings, and the rediscovery of Greco-Latin Literature both are at least partly indebted to developments in the Middle East.
The Book of Optics by Ibn al-Haytham appears to be particularly important in the developments of painting and the understanding of how light transmits. It contains a rejection of the emission theory of optics (rays come from the eyes) in favour of the intromission theory that light bounces off of objects before entering the eye. And translated into Latin in the late 12th century. I would surmise that it had at least an influence in the popularity and use of Linear Perspective in Renaissance Art.
Greco-Latin Literature was preserved, albeit in various translated forms, across the Islamic World and highly popular. As Wikipedia puts it:
It is important to note that like the Renaissance itself, this was not some single catalytic moment, but both serial and parallel transmissions that happened over a number of centuries. Most interestingly at first these texts arrived in Europe being translated from some intermediary language like Syriac or Arabic. A Greek classic may have reached early modern Europeans only after being translated into Latin, then Syriac, and back into Latin.
Andalusian scholars began translating from Islamic sources from at least the early 10th century. Gerard of Cremona (1114-1187) set out to learn Arabic so he could read Ptolemy’s Almagest and later translated works of Aristotle, Euclid, Jabir ibn Aflah and Al-Khwarizmi. The Fourth Crusade (1202-1204) eventually facilitated Dutch scholar Willem van Moerbeke coming into contact and translating works of Aristotle, Hero of Alexandria, and Archimedes.
I assume these developments culminated in the artistic trademarks of the Renaissance.