Bill Benzon comments on The surprising parameter efficiency of vision models

Bill Benzon 9 Apr 2023 12:45 UTC
2 points
1
Try transforming the language task and the image task into the same format and comparing the two. It’s easy to rasterize images so that they become transformed into strings of colored dots. For language, replace each token with a colored dot and do so uniformly across all texts. You have now transformed each text into a string of colored dots.

Now take, say, the first billion dots from your (transformed) image collection and the first billion dots from your (transformed) text collection. Which string has the higher entropy?