the model referred to below as the o3-mini post-mitigation model was the final model checkpoint as of Jan 31, 2025 (unless otherwise specified)
Big if true (and if Preparedness had time to do elicitation and fix spurious failures)
If this is robust to jailbreaks, great, but presumably it’s not, so low post-mitigation performance is far from sufficient for safety-from-misuse; post-mitigation performance isn’t what we care about (we care about approximately what good jailbreakers get on the post-mitigation model).
No mention of METR, Apollo, or even US AISI? (Maybe too early to pay much attention to this, e.g. maybe there’ll be a full-o3 system card soon.) [Edit: also maybe it’s just not much more powerful than o1.]
32% is a lot
The dataset is 1⁄4 T1 (easier), 1⁄2 T2, 1⁄4 T3 (harder); 28% on T3 means that there’s not much difference between T1 and T3 to o3-mini (at least for the easiest-for-LMs quarter of T3)
Probably o3-mini is successfully using heuristics to get the right answer and could solve very few T3 problems in a deep or humanlike way
didn’t run red-teaming and persuasion evals on the actually-final-version
Asking for this is a bit pointless, since even after the actually-final-version there will be a next update for which non-automated evals won’t be redone, so it’s equally reasonable to do non-automated evals only on some earlier version rather than the actually-final one.
o3-mini is out (blogpost, tweet). Performance isn’t super noteworthy (on first glance), in part since we already knew about o3 performance.
Non-fact-checked quick takes on the system card:
Big if true (and if Preparedness had time to do elicitation and fix spurious failures)
If this is robust to jailbreaks, great, but presumably it’s not, so low post-mitigation performance is far from sufficient for safety-from-misuse; post-mitigation performance isn’t what we care about (we care about approximately what good jailbreakers get on the post-mitigation model).
No mention of METR, Apollo, or even US AISI? (Maybe too early to pay much attention to this, e.g. maybe there’ll be a full-o3 system card soon.) [Edit: also maybe it’s just not much more powerful than o1.]
32% is a lot
The dataset is 1⁄4 T1 (easier), 1⁄2 T2, 1⁄4 T3 (harder); 28% on T3 means that there’s not much difference between T1 and T3 to o3-mini (at least for the easiest-for-LMs quarter of T3)
Probably o3-mini is successfully using heuristics to get the right answer and could solve very few T3 problems in a deep or humanlike way
Rushed bc of deepseek?
Similar opinion here, also noting they didn’t run red-teaming and persuasion evals on the actually-final-version:
https://x.com/teortaxesTex/status/1885401111659413590
Asking for this is a bit pointless, since even after the actually-final-version there will be a next update for which non-automated evals won’t be redone, so it’s equally reasonable to do non-automated evals only on some earlier version rather than the actually-final one.