gwern comments on What DALL-E 2 can and cannot do

gwern 30 May 2022 20:34 UTC
35 points
Will future generative models choke to death on their own excreta? No.

Now that goalposts have moved from “these neural nets will never work and that’s why they’re bad” to “they are working and that’s why they’re bad”, a repeated criticism of DALL·E 2 etc is that their deployment will ‘pollute the Internet’ by democratizing high-quality media, which may (given all the advantages of machine intelligence) quickly come to exceed ‘regular’ (artisanally-crafted?) media, and that ironically this will make it difficult or impossible to train better models. I don’t find this plausible at all but lots of people seem to and no one is correcting all these wrong people on the Internet, so here’s a quick rundown why:
1. It Hasn’t Happened Yet: there is no such thing as ‘natural’ media on the Internet, and never has been. Even a smartphone photograph is heavily massaged by a pipeline of algorithms (increasingly DL-based) before it is encoded into a codec designed to throw away human-perceptually-unimportant data such as JPEG. We are saturated in all sorts of Photoshopped, CGIed, video-game-rendered, Instagram-filtered, airbrushed, posed, lighted, (extremely heavily) curated media. If these models are as uncreative as claimed by critics and merely memorizing… what’s the big deal?
  - Spread will happen slowly: see earlier comment. Consider GPT-3: you can sign up for the OA API easily, there’s GPT-Neo-20b, FB is releasing OPT, etc. It’s still ‘underused’ compared to what it could be used, and most of the text on the Internet continues to be written by humans or non-GPT-3-quality software etc.
2. There’s Enough Data Already: what if we have ‘used up’ the existing data and are ‘polluting’ new data unfixably such that the old pre-generative datasets are inadequate and we can’t cheaply get clean new data?
  - sample efficiency: critics love to harp on the sample-inefficiency. How can these possibly be good if it takes hundreds of millions of images? Surely the best algorithm would be far more sample-efficient, and you should be able to match DALL·E 2 with a million well-selected images, max. Sure: the first is always the worst, the experience curves will kick in, we are trading data for compute because photos are cheap & GPUs dear—I agree, I just don’t think those are all that bad or avoidable.
    
    But the flip side of this is that if existing quantities are already enough to train near-photorealistic models, when these models are admittedly so grossly sample-inefficient, then when those more sample-efficient models become possible, we will definitely have enough data to train much better quality models in the future simply by using the same amount of data.
  - diversity: Sometimes people think that 400m or 1b images are ‘not sufficiently diverse’, and must be missing something. This might be due to a lack of appreciation for the tail and Littlewood’s Law: if you spend time going through datasets like LAION-400m or YFCC100M, there is a lot of strange stuff there. All it takes is one person out of 7 billion doing it for any crazy old reason (including being crazy).
    
    This raises another ironic reversal of criticisms: to deprecate some remarkable sample, a critic will often try to show “it’s just copying” something that looks vaguely similar in Google Images. Obviously, this is a dilemma for the future-choke argument: if the remarkable sample is not already in the corpus, then the present models already have learned it adequately and ‘generalized’ (despite all of the noise from point #1), and future data is not necessary for future models; if the remarkable sample is in the corpus, then future models can just learn from that!
  - all the other data: current models only scratch the surface of available media. There is a lot of image data out there. A lot.
    
    For example, Twitter blocks Common Crawl, which is the base for many scrapes; how many tens of billions of artworks alone is that? Twitter has tons of artists who post art on a regular basis (including many series of sketches to finished works which would be particularly valuable). Or DALL·E 2 is strikingly bad at anime and appears to have clearly filtered out or omitted all anime sources—so, that’s 5 million images from Danbooru2021 not included (>9.9 million on Big Booru), an overlapping >100 million images from Pixiv (powered in part by commissions through things like Skeb.jp, >100k commissions thus far), >2.8 million furry images on e621, 2 million My Little Pony images on Derpibooru… What about BAM!, n = 65 million? DeviantArt seems to be represented, but I doubt most of it is (>400 million). There’s something like >2 million books published every year; many have no illustrations but also many are things like comic books (>1500 series by major publishers?) or manga books (>8000/year?) which have hundreds or thousands of discrete illustrations—so that’s got to be millions of images piling up every year since. How about gig markets? Fiverr, just one of many, reports 3.4 million buyers spending >$200/each; at a nominal rate of $50/image, that would be 4 per buyer or 13 million images. So all together, there’s billions of high-quality images out there for the collecting, and something like hundreds of millions being added each year.
  (These numbers may seem high but it makes sense if you think about the exponential growth of the Internet or smartphones or digital art: the majority of it must have been created recently, so if there are billions of images, then the current annual rate is probably going to be something like ‘hundreds of millions’. They can’t all be scans of ancient Victorian erotica.)
  
  Also: multimodal data. How about video? A high-quality Hollywood movie or TV series probably provides a good non-redundant still every couple of seconds; TV production the past few years has been, due to streaming money, at all time highs like >500 series per year...
  
  I mean, this is all a ludicrously narrow slice of the world, so I hope it adequately makes the point that we have not, do not now, and very much will not in the future, suffer from any hard limit on data availability. Chinchilla may make future models more data-hungry, and you may have to do some real engineering work to get as much as you want, or pay for it, but it’s there if you want it enough. (We will have a ‘data shortage’ in the same way we have a $5/pound filet mignon shortage.)
3. The Student Becomes The Master: imitation learning can lead to better performance than the original demonstrations or ‘experts’.
  
  There are a bunch of ways to do bootstraps, but to continue the irony theme here, the most obvious one is the ‘cherry-picking’ accusation: “these samples are good but they’re just cherry-picked!” This is silly because all samples, human or past models, are all cherrypicked; the human samples are typically filtered much harder than the machine samples; there are machine ways to do the cherrypicking automatically; and you couldn’t get current samples out of old models no matter how hard you feasibly selected.
  
  But for the choking argument, this is a problem: it is easier to recognize good samples than to create them. If humans are filtering model outputs, then the more heavily they filter, the more they are teaching future models what are good and bad samples by populating the dataset with good samples and adding criticism to bad samples. (Every time someone posts a tweet with a snarky “#dalle-fail”, they are labeling that sample as bad and teaching future DALL·Es what a mistake looks like and how not to draw a hand.) Good samples will get spread and edited and copied online, and will elicit more samples like that, as people imitate them and step up their game and learn how to use DALL·E right.
4. Mistakes Are Just Another Style: we can divide the supposed pernicious influence of the generated samples into 2: random vs systematic error.
  - random error: is not a problem.
    
    NNs are notoriously robust to random error like label errors. This is already the case if you look at the data these models are trained on. (On the DALL·E 2 subreddit, people will sometimes exclaim that the model understood their “gamboy made of crystal” prompt: “it knew I meant ‘Gameboy’ despite my typo, amazing!” You sweet, sweet summer child.) You can do wacky things like scramble 99% of ImageNet labels, and as long as there is a seed of correctly-classified images reflecting the truth, a CNN classifier will… mostly work? Yeah. (Or the observation with training GPT-style models that as long as there is a seed of high-quality text like Wikipedia to work with, you can throw in pretty crummy Internet text and it’ll still help.) To the extent that the models are churning out random errors (similar to how GPT-3 stochastic sampling often leads to just dumb errors), they aren’t really going to interfere with themselves.
    
    You pay a price in compute, of course, and GPUs aren’t free, but it’s certainly no fatal problem.
  - systematic error: people generally seem to have in mind more like systematic error in generated samples: DALL·E 2 can’t generate hands, and so all future models are doomed to look like missives from the fish-people beneath the sea.
    
    But the thing is, if an error is consistent enough for the critic to notice, then it’s also obvious enough for the model to notice. If you repeat a joke enough times, it can become funny; if you repeat an error enough, then it’s just your style. Errors will be detected, learned, and conditioned on for generation. Past work in deepfake detection has shown that even when models like StyleGAN2 are doing photorealistic faces, they are still detectable with high confidence by other models because of subtle issues like reflections in the eye or just invisible artifacts in the high frequency domains where humans are blind. We all know the ‘Artstation’ prompt by now, and it goes the other way too: you can ask DALL·E 2 for “DeepDream” (remember the psychedelic dog-slugs?) (or ‘grainy 1940s photo’ or ’1980s VHS tape’ or ‘out of focus shot’ or ‘drawn by a child’ or...). You can’t really ask it for ‘DALL·E’ images, and that’s a testament to the success.
    
    DALL·E 2 errors are just the DALL·E 2 style, and it will be promptable like anything else by future models, which will detect the errors (and/or the OA watermark in the lower right corner), and it will no more ‘break’ them than some melting watches in a Salvador Dali painting destroys their ability to depict a pocket-watch. Dali paintings have melty dairy products and timekeeping devices, and DALL·E 2 paintings have melty faces and hands, and that’s just the way those artistic genres are, but it doesn’t make you or future models think that’s how everything is.
    
    (And if you can’t see a style, well then: Mission. Accomplished.)
If I worried about something, one worry I have not seen mentioned is the effect of generative model excellence on cultural evolution, not by being incapable of high-quality or diverse styles, but by being too capable, too good, too cheap, too fast. Ted Gioia in a recent grumpy article about sequelitis has a line:

These are the key indicators that you might be living in a society without a counterculture:...5. Alternative voices exist—in fact, they are everywhere—but are rarely heard, and their cultural impact is negligible

Generative models may help push us towards a world in which alternative voices are everywhere and it has never been cheaper or easier for anyone in the world to create a unique new style which can complement their other skills and produce a vast corpus of high-quality works in that style, but also in which it has never been harder to be heard or to develop that style into any kind of ‘cultural impact’.

When content is infinite, what becomes scarce? Attention ie. selection. Communities rely on multilevel selection to incubate new styles and gradually percolate them up from niches to broader audiences. For this to work, those small communities need to be somewhat insulated from fashions, because a new style will never enter the world as fully optimized as old styles; they need investment, perhaps exploiting sunk costs, perhaps exploiting parasociality, so they can deeply explore it. Unfortunately, by rendering entry into a niche trivial, producing floods of excellent content, and accelerating cultural turnover & amnesia, it gets harder & harder to make any dent: by the time you show up to the latest thing with your long-form illustrated novel (which you could never have done by yourself), that’s soooo last week and your link is already buried 100 pages deep into the submission queue. Your 1000 followers on Twitter will think you’re a genius, and you will be, but that will be all it ever amounts to. I am already impressed just how many quite beautiful or interesting images I see posted from DALL-E 2 or Stable Diffusion, but which immediately disappear in the infinite scroll under the flood of further selections, never to be seen again or archived anywhere or curated.
What links here?
- lc 30 May 2022 20:43 UTC
  2 points
  Parent
  Now that goalposts have moved from “these neural nets will never work and that’s why they’re bad” to “they are working and that’s why they’re bad”, a repeated criticism of DALL·E 2 etc is that their deployment will ‘pollute the Internet’ by democratizing high-quality media, which may (given all the advantages of machine intelligence) quickly come to exceed ‘regular’ (artisanally-crafted?) media, and that ironically this will make it difficult or impossible to train better models. I don’t find this plausible at all but lots of people seem to and no one is correcting all these wrong people on the Internet
  Can I get a link to someone who actually believes this? I’m honestly a little skeptical this is a common opinion, but wouldn’t put it past people I guess.
  - gwern 30 May 2022 20:51 UTC
    12 points
    Parent
    I’ve seen it several times on Twitter, Reddit, and HN, and that’s excluding the people like Jack Clark who has pondered it repeatedly in his Import.ai newsletter & used it as theme in some of his short stories (but much more playfully & thoughtfully in his case so he’s not the target here). I think probably the one that annoyed me enough to write this was when Imagen hit HN and the second lengthy thread was all about ‘poisoning the well’ with most of them accepting the premise. It has also been asked here on LW at least twice in different places. (I’ve also since linked this writeup at least 4 times to various people asking this exact question about generative models choking on their own exhaust, and the rise of ChatGPT has led to it coming up even more often.)