Great work here, but I do feel that the only important observations in practice are those about reasoning. To the extent that obtaining visual information is the problem, I think the design of language models currently is just not representative of how this task would be implemented in real robotics applications for at least two reasons:
The model is not using anywhere near all of the information about an image that it could be, as language models which accept image data are just accepting an embedding that is far smaller (in an information theoretic sense) than either the original image (probably) or the activation spaces of a large independent vision model (definitely).
Even if we accept the constraint of using less data than we have by converting to an embedding and dropping it in line with other text tokens as these models mostly do, a real system would have the opportunity to consume many images if it was attached to physical machinery that was used to capture them and could eventually get far more data than the embedding provides.
——
I believe a hypothetically “real” system of this kind would almost always be implemented as a powerful vision model scaffolded on to a language model that directed it as a tool to perform an analysis progressively, operating on this data in a manner that is very different from the one-shot approach here. While these experiments are very helpful, I personally am not updating much toward it being very hard to pull off—except where the evidence is about the reasoning after the fact of understanding the image data correctly (which obviously will always be important even if vision is provided as an iterative tool of much higher bandwidth).
I don’t think image understanding is the bottleneck. O3 and O4-mini-high seem like they are a meaningful improvement in vision, where it’s almost good enough for this part, but they still fail miserably at the physical reasoning aspects.
This person got O4-mini-high to generate a reasonably close image depiction of the part.
Great work here, but I do feel that the only important observations in practice are those about reasoning. To the extent that obtaining visual information is the problem, I think the design of language models currently is just not representative of how this task would be implemented in real robotics applications for at least two reasons:
The model is not using anywhere near all of the information about an image that it could be, as language models which accept image data are just accepting an embedding that is far smaller (in an information theoretic sense) than either the original image (probably) or the activation spaces of a large independent vision model (definitely).
Even if we accept the constraint of using less data than we have by converting to an embedding and dropping it in line with other text tokens as these models mostly do, a real system would have the opportunity to consume many images if it was attached to physical machinery that was used to capture them and could eventually get far more data than the embedding provides.
——
I believe a hypothetically “real” system of this kind would almost always be implemented as a powerful vision model scaffolded on to a language model that directed it as a tool to perform an analysis progressively, operating on this data in a manner that is very different from the one-shot approach here. While these experiments are very helpful, I personally am not updating much toward it being very hard to pull off—except where the evidence is about the reasoning after the fact of understanding the image data correctly (which obviously will always be important even if vision is provided as an iterative tool of much higher bandwidth).
I don’t think image understanding is the bottleneck. O3 and O4-mini-high seem like they are a meaningful improvement in vision, where it’s almost good enough for this part, but they still fail miserably at the physical reasoning aspects.
This person got O4-mini-high to generate a reasonably close image depiction of the part.
https://x.com/tombielecki/status/1912913806541693253