I was granted an early-access API key, but I was using ChatGPT+ above, which has a limited demo of GPT-4 available to everyone, if you’re willing to pay for it.
Question: are you inputting ASCII text and asking the model to “see” it or are you inputting images of ASCII text and asking the model’s pixel input engine to “see” it?
Those are enormously different asks. The tokenizer may destroy the very patterns you are querying about.
As a human could you see ASCII art if viewing in too narrow a terminal window for it to render properly? You couldn’t, right?
I am inputting ASCII text, not images of ASCII text. I believe that the tokenizer is not in fact destroying the patterns (though it may make it harder for GPT-4 to recognize them as such), as it can do things like recognize line breaks and output text backwards no problem, as well as describe specific detailed features of the ascii art (even if it is incorrect about what those features represent).
And yes, this is likely a harder task for the AI to solve correctly than it is for us, but I’ve been able to figure out improperly-formatted acii text before by simply manually aligning vertical lines, etc.
if you think about it, the right way to “do” this would be to internally generate a terminal with the same width as the chatGPT text window or a standard terminal window width, then generate an image, then process it as an image.
That’s literally what you are doing when you manually align the verticals and look.
GPT-4 is not architecturally doing that, it’s missing that capability yet we can trivially see a toolformer version of it that could decide to feed the input stream to a simulated terminal then feed that to a vision module and then process that would be able to solve it.
Without actually making the core llm any smarter, just giving it more peripherals.
A bunch of stuff like that, you realize the underlying llm is capable of doing it but it’s currently just missing the peripheral.
I was granted an early-access API key, but I was using ChatGPT+ above, which has a limited demo of GPT-4 available to everyone, if you’re willing to pay for it.
Question: are you inputting ASCII text and asking the model to “see” it or are you inputting images of ASCII text and asking the model’s pixel input engine to “see” it?
Those are enormously different asks. The tokenizer may destroy the very patterns you are querying about.
As a human could you see ASCII art if viewing in too narrow a terminal window for it to render properly? You couldn’t, right?
I am inputting ASCII text, not images of ASCII text. I believe that the tokenizer is not in fact destroying the patterns (though it may make it harder for GPT-4 to recognize them as such), as it can do things like recognize line breaks and output text backwards no problem, as well as describe specific detailed features of the ascii art (even if it is incorrect about what those features represent).
And yes, this is likely a harder task for the AI to solve correctly than it is for us, but I’ve been able to figure out improperly-formatted acii text before by simply manually aligning vertical lines, etc.
if you think about it, the right way to “do” this would be to internally generate a terminal with the same width as the chatGPT text window or a standard terminal window width, then generate an image, then process it as an image.
That’s literally what you are doing when you manually align the verticals and look.
GPT-4 is not architecturally doing that, it’s missing that capability yet we can trivially see a toolformer version of it that could decide to feed the input stream to a simulated terminal then feed that to a vision module and then process that would be able to solve it.
Without actually making the core llm any smarter, just giving it more peripherals.
A bunch of stuff like that, you realize the underlying llm is capable of doing it but it’s currently just missing the peripheral.