Mostly faster benchmark performance than I expected (see Ajeya’s comment here) and o3 (and o1) being evidence that RL training can scalably work and RL can plausibly scale very far.
Current theme: default
Less Wrong (text)
Less Wrong (link)
Mostly faster benchmark performance than I expected (see Ajeya’s comment here) and o3 (and o1) being evidence that RL training can scalably work and RL can plausibly scale very far.