From the o1 blog post (evidence about the methodology for presenting results but not necessarily the same):
o1 greatly improves over GPT-4o on challenging reasoning benchmarks. Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples.
Presumably light blue is o3 high, and dark blue is o3 low?
I think they only have formal high and low versions for o3-mini
Edit: nevermind idk
From the o1 blog post (evidence about the methodology for presenting results but not necessarily the same):