You are skipping over a very important component: Evaluation.
Which is exactly what we don’t know how to do well enough outside of formally verifiable domains like math and code, which is exactly where o1 shows big performance jumps.
You are skipping over a very important component: Evaluation.
Which is exactly what we don’t know how to do well enough outside of formally verifiable domains like math and code, which is exactly where o1 shows big performance jumps.