It can read images, but that seems to be a different task than reading text-based ascii figures, which it’s sort of 40⁄50 at very very roughly 20% successful at (better than I predicted, but far from perfect on more than the simplest tasks). Here’s some examples:
...And here’s some simple art taken from https://www.asciiart.eu it tries (semi-successfully) to identify:
Here’s some more complex art from the same source, which it almost always fails at (note the images are distorted vertically in the ChatGPT interface, but display perfectly on a terminal, so it should be readable in theory to GPT4):
Wait wait wait. It got 40⁄50 of these? For at best the ‘second try’ at a system with vision?
(since training costs are so high, there is not the opportunity to do many variations on this at scale. And I’m assuming a couple of Google AI researchers who had worked on Palm’s vision shifted over and shared everything they remembered, making it the “second” iteration)
Apologies, I have no idea what notation I meant to be using last night there, I meant “very roughly 20% accuracy” but my 2 am brain wrote it out like that...somehow lol. Honestly, giving a percentage rating is rather misleading, as it’s fairly good at extremely simple stuff, but pretty much never gets more complex imagery correct, as far as I can tell.
That sounds about right. I tried getting it to recognize some moderately complex ASCII art, and its guesses were consistently wrong. But nevertheless, its guesses were not that far from the outline of the images.
But it is worse at drawing shapes. I can get it to make some very basic shapes consistently, but it fails quite badly at anything more complex.
Heck, I can’t even get it to draw a pentagon. It can draw triangles and hexagons, but apparently five sides is forbidden to it. Maybe it can only draw unit cells of a 2d lattice? /s
I was granted an early-access API key, but I was using ChatGPT+ above, which has a limited demo of GPT-4 available to everyone, if you’re willing to pay for it.
Question: are you inputting ASCII text and asking the model to “see” it or are you inputting images of ASCII text and asking the model’s pixel input engine to “see” it?
Those are enormously different asks. The tokenizer may destroy the very patterns you are querying about.
As a human could you see ASCII art if viewing in too narrow a terminal window for it to render properly? You couldn’t, right?
I am inputting ASCII text, not images of ASCII text. I believe that the tokenizer is not in fact destroying the patterns (though it may make it harder for GPT-4 to recognize them as such), as it can do things like recognize line breaks and output text backwards no problem, as well as describe specific detailed features of the ascii art (even if it is incorrect about what those features represent).
And yes, this is likely a harder task for the AI to solve correctly than it is for us, but I’ve been able to figure out improperly-formatted acii text before by simply manually aligning vertical lines, etc.
if you think about it, the right way to “do” this would be to internally generate a terminal with the same width as the chatGPT text window or a standard terminal window width, then generate an image, then process it as an image.
That’s literally what you are doing when you manually align the verticals and look.
GPT-4 is not architecturally doing that, it’s missing that capability yet we can trivially see a toolformer version of it that could decide to feed the input stream to a simulated terminal then feed that to a vision module and then process that would be able to solve it.
Without actually making the core llm any smarter, just giving it more peripherals.
A bunch of stuff like that, you realize the underlying llm is capable of doing it but it’s currently just missing the peripheral.
It can read images, but that seems to be a different task than reading text-based ascii figures, which it’s
sort of 40⁄50 atvery very roughly 20% successful at (better than I predicted, but far from perfect on more than the simplest tasks). Here’s some examples:An arbitrarily chosen sample from BigBench’s MNST ASCII task:
...And here’s some simple art taken from https://www.asciiart.eu it tries (semi-successfully) to identify:
Here’s some more complex art from the same source, which it almost always fails at (note the images are distorted vertically in the ChatGPT interface, but display perfectly on a terminal, so it should be readable in theory to GPT4):
Wait wait wait. It got 40⁄50 of these? For at best the ‘second try’ at a system with vision?
(since training costs are so high, there is not the opportunity to do many variations on this at scale. And I’m assuming a couple of Google AI researchers who had worked on Palm’s vision shifted over and shared everything they remembered, making it the “second” iteration)
Apologies, I have no idea what notation I meant to be using last night there, I meant “very roughly 20% accuracy” but my 2 am brain wrote it out like that...somehow lol. Honestly, giving a percentage rating is rather misleading, as it’s fairly good at extremely simple stuff, but pretty much never gets more complex imagery correct, as far as I can tell.
That sounds about right. I tried getting it to recognize some moderately complex ASCII art, and its guesses were consistently wrong. But nevertheless, its guesses were not that far from the outline of the images.
But it is worse at drawing shapes. I can get it to make some very basic shapes consistently, but it fails quite badly at anything more complex.
Heck, I can’t even get it to draw a pentagon. It can draw triangles and hexagons, but apparently five sides is forbidden to it. Maybe it can only draw unit cells of a 2d lattice? /s
Did you use the leaked API key or how did you produce this? If you work for OAI you presumably would have an explanation for the limit.
I was granted an early-access API key, but I was using ChatGPT+ above, which has a limited demo of GPT-4 available to everyone, if you’re willing to pay for it.
Question: are you inputting ASCII text and asking the model to “see” it or are you inputting images of ASCII text and asking the model’s pixel input engine to “see” it?
Those are enormously different asks. The tokenizer may destroy the very patterns you are querying about.
As a human could you see ASCII art if viewing in too narrow a terminal window for it to render properly? You couldn’t, right?
I am inputting ASCII text, not images of ASCII text. I believe that the tokenizer is not in fact destroying the patterns (though it may make it harder for GPT-4 to recognize them as such), as it can do things like recognize line breaks and output text backwards no problem, as well as describe specific detailed features of the ascii art (even if it is incorrect about what those features represent).
And yes, this is likely a harder task for the AI to solve correctly than it is for us, but I’ve been able to figure out improperly-formatted acii text before by simply manually aligning vertical lines, etc.
if you think about it, the right way to “do” this would be to internally generate a terminal with the same width as the chatGPT text window or a standard terminal window width, then generate an image, then process it as an image.
That’s literally what you are doing when you manually align the verticals and look.
GPT-4 is not architecturally doing that, it’s missing that capability yet we can trivially see a toolformer version of it that could decide to feed the input stream to a simulated terminal then feed that to a vision module and then process that would be able to solve it.
Without actually making the core llm any smarter, just giving it more peripherals.
A bunch of stuff like that, you realize the underlying llm is capable of doing it but it’s currently just missing the peripheral.