I am inputting ASCII text, not images of ASCII text. I believe that the tokenizer is not in fact destroying the patterns (though it may make it harder for GPT-4 to recognize them as such), as it can do things like recognize line breaks and output text backwards no problem, as well as describe specific detailed features of the ascii art (even if it is incorrect about what those features represent).
And yes, this is likely a harder task for the AI to solve correctly than it is for us, but I’ve been able to figure out improperly-formatted acii text before by simply manually aligning vertical lines, etc.
if you think about it, the right way to “do” this would be to internally generate a terminal with the same width as the chatGPT text window or a standard terminal window width, then generate an image, then process it as an image.
That’s literally what you are doing when you manually align the verticals and look.
GPT-4 is not architecturally doing that, it’s missing that capability yet we can trivially see a toolformer version of it that could decide to feed the input stream to a simulated terminal then feed that to a vision module and then process that would be able to solve it.
Without actually making the core llm any smarter, just giving it more peripherals.
A bunch of stuff like that, you realize the underlying llm is capable of doing it but it’s currently just missing the peripheral.
I am inputting ASCII text, not images of ASCII text. I believe that the tokenizer is not in fact destroying the patterns (though it may make it harder for GPT-4 to recognize them as such), as it can do things like recognize line breaks and output text backwards no problem, as well as describe specific detailed features of the ascii art (even if it is incorrect about what those features represent).
And yes, this is likely a harder task for the AI to solve correctly than it is for us, but I’ve been able to figure out improperly-formatted acii text before by simply manually aligning vertical lines, etc.
if you think about it, the right way to “do” this would be to internally generate a terminal with the same width as the chatGPT text window or a standard terminal window width, then generate an image, then process it as an image.
That’s literally what you are doing when you manually align the verticals and look.
GPT-4 is not architecturally doing that, it’s missing that capability yet we can trivially see a toolformer version of it that could decide to feed the input stream to a simulated terminal then feed that to a vision module and then process that would be able to solve it.
Without actually making the core llm any smarter, just giving it more peripherals.
A bunch of stuff like that, you realize the underlying llm is capable of doing it but it’s currently just missing the peripheral.