I also tested O3, and it looks better than Gemini 2.5 on Vision. Although it missed the second flat, it correctly identified that the ends had different diameters and picked up on some genuinely impressive details, like the grooved thread relief behind the larger thread.
However, it’s still terrible at spatial reasoning. I now feel more confident in the argument in my post. It proposes many egregious, physically impossible operations. For example, it recommends to enclose 2.2 inches of the part in the collet, and then face the part down to the finished length of 2.000 inches. This is obviously impossible, as the part is buried 0.2 inches within the collet. It also makes bizarre decisions, like clamping on the threads for the second lathe op, when the main diameter is obviously a much better location for rigidity / simplicity. It does correctly identify the chatter issue, FWIW.
It feels a bit worse than Gemini’s plan overall, but this is hard to evaluate. It’s basically “here are two plans with multiple egregious errors, which one is worse?”. I’ve also noticed that basically any time I ask an LLM for more specific details on a high level part of the plan that looks reasonable, it begins to make many egregious errors. So, a large part of how bad the plan is just revolves around how much detail the LLM goes into.
I don’t think image understanding is the bottleneck. O3 and O4-mini-high seem like they are a meaningful improvement in vision, where it’s almost good enough for this part, but they still fail miserably at the physical reasoning aspects.
This person got O4-mini-high to generate a reasonably close image depiction of the part.
https://x.com/tombielecki/status/1912913806541693253