I think it makes sense that it fails in this way. ChatGPT really doesn’t see lines arranged vertically, it just sees the prompt as one long line. But given that it has been trained on a lot of ASCII art, it will probably be successful at copying some of it some of the time.
In case there is any doubt, here is GPT4′s own explanation of these phenomena:
Lack of spatial awareness: GPT-4 doesn’t have a built-in understanding of spatial relationships or 2D layouts, as it is designed to process text linearly. As a result, it struggles to maintain the correct alignment of characters in ASCII art, where spatial organization is essential.
Formatting inconsistencies in training data: The training data for GPT-4 contains a vast range of text from the internet, which includes various formatting styles and inconsistent examples of ASCII art. This inconsistency makes it difficult for the model to learn a single, coherent way of generating well-aligned ASCII art.
Loss of formatting during preprocessing: When text is preprocessed and tokenized before being fed into the model, some formatting information (like whitespaces) might be lost or altered. This loss can affect the model’s ability to produce well-aligned ASCII art.
This is a more sensible representation of a balloon than one in the post, it’s just small. More prompts tested on both ChatGPT-3.5 and GPT-4 would clarify the issue.
ChatGPT really doesn’t see lines arranged vertically, it just sees the prompt as one long line.
Vision can be implemented in transformers by representing pictures with linear sequences of tokens, which stand for small patches of the picture, left-to-right, top-to-bottom (see appendix D.4 of this paper). The model then needs to learn on its own how the rows fit together into columns and so on. The vision part of the multimodal PaLM-E seems to be trained this way. So it’s already essentially ASCII art, just with a different character encoding.
It’s a subjective matter whether the above is successful ASCII art balloon or not. If we hold GPT to the same standards we do for text generation, I think we can safely say the above depiction is a miserable failure. The lack of symmetry and overall childishness of it suggests it has understood nothing about the spatiality and only by random luck manages to approximate something it has explicitly seen in the training data. I’ve done a fair bit of repeated generations and they all come out poorly). I think the Transformer paper was interesting as well, although they do mention that it only works when there is a large amount of training data. Otherwise, the inductive biases of CNNs do have their advantages, and combining both is probably superior since the added computational burden of a CNN in conjunction with a Transformer is hardly worth talking about.
I think it makes sense that it fails in this way. ChatGPT really doesn’t see lines arranged vertically, it just sees the prompt as one long line. But given that it has been trained on a lot of ASCII art, it will probably be successful at copying some of it some of the time.
In case there is any doubt, here is GPT4′s own explanation of these phenomena:
This is a more sensible representation of a balloon than one in the post, it’s just small. More prompts tested on both ChatGPT-3.5 and GPT-4 would clarify the issue.
Vision can be implemented in transformers by representing pictures with linear sequences of tokens, which stand for small patches of the picture, left-to-right, top-to-bottom (see appendix D.4 of this paper). The model then needs to learn on its own how the rows fit together into columns and so on. The vision part of the multimodal PaLM-E seems to be trained this way. So it’s already essentially ASCII art, just with a different character encoding.
It’s a subjective matter whether the above is successful ASCII art balloon or not. If we hold GPT to the same standards we do for text generation, I think we can safely say the above depiction is a miserable failure. The lack of symmetry and overall childishness of it suggests it has understood nothing about the spatiality and only by random luck manages to approximate something it has explicitly seen in the training data. I’ve done a fair bit of repeated generations and they all come out poorly). I think the Transformer paper was interesting as well, although they do mention that it only works when there is a large amount of training data. Otherwise, the inductive biases of CNNs do have their advantages, and combining both is probably superior since the added computational burden of a CNN in conjunction with a Transformer is hardly worth talking about.