Or has it, and it’s just not highly publicized?
Five years ago, I was under the impression that most “machine learning” jobs were mostly just data cleaning, linear regression, working with regular data stores, and debugging stuff. Or, that was at least the meme that I heard from a lot of people. That didn’t surprise me at the time. It was easy to imagine that all the fancy research results were fragile, or hard to apply to products, or would at the very least take a long time to adapt.
But at this point it’s been quite a few years since there have existed machine learning systems that immensely impressed me. The first such system was probably AlphaGo—all the way back in 2016! AlphaGo then spun off in to multiple better faster cheaper systems that I didn’t even keep track of them. And since then I’ve lost track of the number of unrelated systems that immensely impressed me. And their capabilities are so general that I feel sure that they must be convertible into enormous economic value. I still believe that it takes a long time to boot up a company around novel research results, but I’m not actually well calibrated on how long that takes, and it’s been long enough that it’s starting to feel awkward, like my models are missing something. Here are examples of AI products that I wouldn’t have been surprised if they existed by now, but which I don’t think do. (I can imagine that many of these examples technically exist, but not at the level that I mean).
Spotify playlists that are actually just procedurally generated music of various genres
A tool that helps researchers/legislators/et cetera by summarizing papers, books, laws on demand
Tools that help people (like writers) brainstorm, flesh out ideas by generating further details, asking questions, etc
A version of photoshop but with tons of AI tools
Widely available self-driving cars
Physics simulators that are way faster
Paradigmatically different and better web search
So what’s the deal? Here’s a list of possible explanations. I’ve love to hear if anyone has evidence for any of them, or if you know of reasons not on the list.
The research results are actually not all that applicable to products; more research is needed to refine them
They’re way too expensive to run to be profitable
Yeah, no, it just takes a really long time to convert innovation into profitable, popular product
Something something regulation?
The AI companies are deliberately holding back for whatever reason
The models are already integrated into the economy and you just don’t know it.
I deny the premise. It’s publicized, you’re just not paying attention to the water in which you swim. Companies like Google and even Apple talk a great deal about how they increasingly employ DL at every layer of the stack. Just for smartphones: pull your smartphone out of your pocket. This is how DL generates economic value: DL affects the chip design, is an increasing fraction of the chips on the SoC, is most of what your camera does, detects your face to unlock it, powers the recommendations of every single thing on it whether Youtube or app store or streaming service (including the news articles and notifications shown to you as you unlock), powers the features like transcripts of calls or machine translation of pages or spam detection that you take for granted, powers the ads which monetize you in the search engine results which they also power, the anti-hacking and anti-abuse measures which keep you safe (and also censor hatespeech etc on streams or social media), the voice synthesis you hear when you talk to it, the voice transcription when you talk to it or have your Zoom/Google videoconference sessions during the pandemic, the wake words, the predictive text when you prefer to type rather than talk and the email suggestions (the whole email, or just the spelling/grammar suggestions), the GNN traffic forecasts changing your Google Maps route to the meeting you emailed about, the cooling systems of the data centers running all of this (not to mention optimizing the placement of the programs within the data centers both spatially in solving the placement problem and temporally in forecasting)...
This all is, of course, in addition to the standard adoption curves & colonization wave dynamics, and merely how far it’s gotten so far.
I think the conclusion here is probably right, but a lot of the examples seem to exaggerate the role of DL. Like, if I thought all of the obvious-hype-bullshit put forward by big companies about DL were completely true, then it would look like this answer.
Starting from the top:
So, a few years back Google was pushing the idea of “AI first design” internally—i.e. design apps around the AI use-cases. By all reports from the developers I know at Google, this whole approach crashed and burned. Most ML applications didn’t generalize well beyond their training data. Also they were extremely unreliable so they always needed to either be non-crucial or to have non-ML fallbacks. (One unusually public example: that scandal where black people were auto-labelled as gorillas.) I hear that the whole “AI first” approach has basically been abandoned since then.
Of course, Google still talks about how they increasingly employ DL at every layer of the stack. It’s great hype.
I mean, maybe it’s used somewhere in the design loop, but I doubt it’s particularly central. My guess would be it’s used in one or two tools somewhere which are in practice not-importantly-better than the corresponding non-DL version, but someone stuck a net in there somewhere just so that they could tell a clueless middle manager “it uses deep learning!” and the clueless middle manager would buy this mediocre piece of software.
Misleading. Yeah, computational photography techniques have exploded, but the core tricks are not deep learning at all.
This one I think I basically buy, although I don’t know much about how face detection is done today.
Misleading. Those recommenders presumably aren’t using end-to-end DL; they’re mixing it in in a few specific places. It’s a marginal value add within a larger system, not the backbone of the system.
I basically buy the transcripts and translation examples, and basically don’t buy the spam example—we already had basically-viable spam detection before DL, the value-add there has been marginal at best.
I hear companies wish they could get ML to do this well, but in practice most things still need to loop through humans. That’s epistemic status: hearsay, so not confident, but it matches my priors.
These examples I basically buy.
These seem to work in practice exactly when they’re doing the same thing an n-gram predictor would do, and not work whenever they try to predict anything more ambitious than that.
I would be surprised if DL were doing most of the heavy lifting in Maps’ traffic forecast at this point, although I would not be surprised if it were sprinkled in and hyped up. That use-case should work really well for non-DL machine learning systems (or so I’d guess), which are a lot more transparent to the designers and developers.
Another two places where I doubt that DL is the main backbone, although it may be sprinkled in here and there and hyped up a lot. I doubt that the marginal value-add from DL is all that high in either of these use-cases, since non-DL machine learning should already be pretty good at these sorts of problems.
Stellar breakdown of hype vs. reality. Just wanted to share some news from today that Google has fired an ML scientist for challenging their paper on DL for chip placement.
From Engadget (ungated):
Sounds like challenging the hype is a terminable offense.But see gwern’s context for the article below.“One story is good until another is told”. The chip design work has apparently been replicated, and Metz’s* writeup there has several red flags: in describing Gebru’s departure, he omits any mention of her ultimatum and list of demands, so he’s not above leaving out extremely important context in these departures in trying to build up a narrative of ‘Google fires researchers for criticizing research’; he explicitly notes that Chatterjee was fired ‘for cause’ which is rather eyebrow-raising when usually senior people ‘resign to spend time with their families’ (said nonfirings typically involving things like keeping their stock options while senior people are only ‘fired for cause’ when they’ve really screwed up—like, say, harassment of an attractive young woman) but he doesn’t give what that ‘cause’ was (does he really not know after presumably talking to people?) or wonder why both Chatterjee and Google are withholding it; and he uninterestedly throws in a very brief and selective quote from a presumably much longer statement by a woman involved which should be raising your other eyebrow:
(I note that this is put at the end, which in the NYT house style, is where they bury the inconvenient facts that they can’t in good journalist conscience leave out entirely, and that makes me suspect there is more to this part than is given.)
So, we’ll see. EDIT: Timnit Gebru, perhaps surprisingly, denies any parallel and seems to say Chatterjee deserved to be fired, saying:
Wired has a followup article with more detailed timeline and discussion. It edges much closer to the misogyny narrative than the evil-corporate-censorship narrative.
* yes, the SSC Metz.
Fair enough! Great context, thanks.
In my experience, not enough people on here publically realise their errors and thank the corrector. Nice to see it happen here.
I don’t think Alex is saying deep learning is valueless, he’s saying the new value generated doesn’t seem commensurate with the scale of the research achievements. Everyone is using algorithmic recommendations, but they don’t feel better than Netflix or Amazon could do 10 years ago. Speech to text is better than it was, but not groundbreakingly so. Predictive text may add value to my life one day, but currently it’s an annoyance.
Maybe the more hidden applications have undergone bigger shifts. I’d love to hear more about deep learning for chip or data center design. But right now the consumer uses feel like modest improvements compounding over time, and I’m constantly frustrated by how unconfigurable tools are becoming.
I don’t know what you’re talking about. Speech to text actually works now! It was completely unusable just 12 years ago.
Agreed. I distinctly remember it becoming worth using in 2015, and was using that as my reference point. Since then it’s probably improved, but it’s been gradual enough I haven’t noticed as it happens. Everything Alex cites came after 2015, so I wasn’t counting that as “had major discontinuities in line with the research discontinuities”.
However I think foreign language translation has experienced such a discontinuity, and it’s y of comparable magnitude to the wishlist.
Was circa 2015 speech-to-text using deep learning? If not, how did it work?
Prior to DL text-to-speech used hidden markov models. Those were replaced with LSTMs relatively early in the DL-revolution (random 2014 paper). In 2015 there were likely still many HHM-based models around, but apparently at least Google already used DL-based text-to-speech.
I would point out that the tech sector is the single most lucrative sector to have invested in in the past decade, despite endless predictions that the tech bubble is finally going to pop, and this techlash or that recession will definitely do it real soon now.
What would the world look like if there were extensive quality improvements in integrated bundles of services behind APIs and SaaS and smartphones driven by, among other things, DL? I submit that it would look like ours looks.
Consumer-obvious stuff is just a small chunk of the economy.
How would you know that? You aren’t Amazon. And when corporations do report lift, the implied revenue gains are pretty big. Even back in 2014 or so, Google could make a business case for dropping $130m on an order of Nvidia GPUs (ch8, Genius Makers), much more for DeepMind, and that was back when DL was mostly ‘just’ image stuff & NMT looking inevitable, well before it began eating the rest of the stack and modalities.
On tech sector out-performance, I think the more appropriate lookback period started around 2016 when AlphaGo became famous.
On predictions, there were also countless many that tech would take over the world. Abundance of predictions for boom or bust is a constant feature of capital markets, and should be given no weight.
On causal attribution, note that there have been many other advances in the tech sector, such as cloud computing, mobile computing, industry digitization, Moore’s law, etc. It’s unclear how much of the value added is driven by DL.
I disagree. Major investments in DL by big tech like FB, Baidu, and Google started well before 2016. I cited that purchase by Google partially to ward off exactly this sort of goalpost moving. And stock markets are forward-looking, so I see no reason to restrict it to AlphaGo (?) actually winning.
Who cares about predictions? Talk is cheap. I’m talking about returns. Stock markets are forward-looking, so if that were really the consensus, they wouldn’t’ve outperformed.
And yet, in worlds where DL delivers huge economic value in consumer-opaque ways all throughout the stack, they look like our world looks.
Consumer-obvious stuff (“final goods and services”) is what is measured by GDP, which I think is the obvious go-to metric when considering “economic value.” The items on Alex’s list strike me as final goods, while the applications of DL you’ve mentioned are mostly intermediate goods. Alex wasn’t super clear on this, but he seems to be gesturing at the question of why we haven’t seen more new types of final goods and services, or “paradigmatically better” ones.
So while I think you are correct that DL is finding applications in making small improvements to consumer goods and research, design, and manufacturing processes, I think Alex is correct in pointing out that this has yet to introduce a whole new aisle of product types at Target.
I didn’t say ‘final goods or services’. Obviously yes, in the end, everything in the economy exists for the sake of human consumers, there being no one else who it could be for yet (as we don’t care about animals or whatever). I said ‘consumer-obvious’ to refer to what is obvious to consumers, like OP’s complaint.
This is not quite as simple as ‘final’ vs ‘intermediate’ goods. Many of the examples I gave often are final goods, like machine translation. (You, the consumer, punch in a foreign text, get its translation, and go on your merry way.) It’s just that they are upgrades to final goods, which the consumer doesn’t see. If you were paying attention, the rollout of Google Translate from n-grams statistical models to neural machine translation was such a quality jump that people noticed it had happened before Google happened to officially announce it. But if you weren’t paying attention at that particular time in November 2015 or whenever it was, well, Google Translate doesn’t, like, show you little animations of brains chugging away inside TPUs; so you, consumer, stand around like OP going “but why DL???” even as you use Google Translate on a regular basis.
Consumers either never realize these quality improvements happen (perhaps you started using GT after 2015), or they just forget about the pain points they used to endure (cf. my Ordinary Life Improvements essay which is all about that), or they take for granted that ‘line on graph go up’ where everything gets 2% better per year and they never think about the stacked sigmoids and radical underlying changes it must take to keep that steady improvement going.
Yes, I can agree with this. OP is wrong about DL not translating into huge amounts of economic value in excess of the amount invested & yielding profits, because it does, all through the stack, and part of his mistake is in not knowing how many existing things now rely on or plug in DL in some way; but the other part of the mistake is the valid question of “why don’t I see completely brand-new, highly-economically-valuable, things which are blatantly DL, which would satisfy me at a gut level about DL being a revolution?”
So, why don’t we? I don’t think it’s necessarily any one thing, but a mix of factors that mean it would always be slow to produce these sorts of brand new categories, and others which delay by relatively small time periods and mean that the cool applications we should’ve seen this year got delayed to 2025, say. I would appeal to a mix of:
the future is already here, just unevenly distributed: unfamiliarity with all the things that already do exist (does OP know about DALL-E 2 or 15.ai? OK, fine, does he know about Purplesmart.ai where you could chat with Twilight Sparkle, using face, voice, & text synthesis? Where did you do that before?)
automation-as-colonization-wave dynamics like Shirky’s observations about blogs taking a long time to show up after they were feasible. How long did it take to get brandnew killer apps for ‘electricity’?
Hanson uses the metaphor of a ‘rising tide’; DL can be racing up the spectrum from random to superhuman, but it may not have any noticeable effects until it hits a certain point. Below a certain error rate, things like machine translation or OCR or TTS just aren’t worth bothering with, no matter how impressive they are otherwise or how much progress they represent or how fast they are improving. AlphaGo Fan Hui vs AlphaGo Lee Sedol, GPT-2 vs GPT-3, DALL-E 1 vs DALL-E 2...
Most places are still trying to integrate and invent uses for spreadsheets. Check back in 50 years for a final list of applications of today’s SOTA.
the limitations of tool AI designs: “tool AIs want to be agent AIs” because tools lose a lot of performance and need to go through human bottlenecks, and are inherently molded to existing niches, like hooking an automobile engine up to a buggy. It’ll pull the buggy, sure, but you aren’t going to discover all the other things it could be doing, and it’ll just be a horse which doesn’t poop as much.
exogenous events like
GPU shortages (we would be seeing way more cool applications of just existing models if hobbyists didn’t have to sell a kidney to get a decent Nvidia GPU), which probably lets Nvidia keep prices up (killing tons of DL uses on the margin) and hold back compute progress in favor of dripfeeding
strategic missteps (Intel’s everything, AMD’s decision to ignore Nvidia building up a software ecosystem monopoly & rendering themselves irrelevant to DL, various research orgs ignoring scaling hypothesis work until relatively recently, losing lots of time for R&D cycles)
basic commercial dynamics (hiding stuff behind an API is good business model, but otherwise massively holds back progress),
Marginal cost: We can also note that general tech commercial dynamics like commoditize-your-complement lead to weird, perverse effects because of the valley of death between extremely high-priced services and free services. Like, Google Translate couldn’t roll out NMT using RNNs until they got TPUs. Why? Because a translation has to be almost free before Google can offer it effectively at global scale; and yet, it’s also not worth Google’s time to really try to offer paid APIs because people just don’t want to use them (‘free is different’), it captures little of the value, and Google profits most by creating an integrated ecosystem of services and it’s just not worth bothering doing. And because Google has created ‘a desert of profitability’ around it, it’s hard for any pure-NMT play to work. So you have the very weird ‘overhang’ of NMT in the labs for a long time with ~$0 economic value despite being much better, until suddenly it’s rolled out, but charging $0 each.
Risk aversion/censorship: putting stuff behind an API enables risk aversion and censorship to avoid any PR problems. How ridiculous that you can’t generate faces with DALL-E 2! Or anime!
Have a cool use for LaMDA, Chinchilla/Flamingo, Gopher, or PaLM? Too bad! And big corps can afford the opportunity cost because after all they make so much money already. They’re not going to go bankrupt or anything… So we regularly see researchers leaving GB, OA, or DM, (most recently, Adept AI Labs, with incidentally a really horrifying mission from the perspective of AI safety) and scuttlebutt has it, like Jang reports, that this is often because it’s just such a pain in the ass to get big corps to approve any public use of the most awesome models, that it’s easier to leave for a startup to recreate it from scratch and then deploy it. Or consider AI Dungeon: it used to be one of the best examples of something you just couldn’t do with earlier approaches, but has gone through so many wild change in quality apparently due to the backend and OA issues that I’m too embarrassed to mention it much these days because I have no idea if it’s lobotomized this month or not.
(I have also read repeatedly that exciting new Google projects like Duplex or a Google credit card have been killed by management afraid of any kind of backlash or criticism; in the case of the credit card, apparently DEI advocates brought up the risk of it ‘exacerbating economic inequality’ or something. Plus, remember that whole thing where for like half a year Googlers weren’t allowed to mention the name “LaMDA” even as they were posting half a dozen papers on Arxiv all about it?)
bottlenecks in compute (even ignoring the GPU shortage part) where our reach exceeds our grant-making grasp (we know that much bigger models would do so many cool things, but the big science money continues to flow to things like ITER or LHC)
and in developers/researchers capable of applying DL to all the domains it could be applied to.
(People the other day were getting excited over a new GNN weather-forecaster which apparently beats the s-t out of standard weather forecasting models. Does it? I dunno, I know very little about weather forecasting models and what it might be doing wrong or being exaggerated. Could I believe that one dude did so as a hobby? Absolutely—just how many DL experts do you think there are in weather-forecasting?)
general underdevelopment of approaches making them inefficient in many ways, so you can see the possibility long before the experience curve has cranked away enough times to democratize it (things like Chinchilla show how far even the basics are from being optimized, and are why DL has a steep experience curve)
Applications are a flywheel, and our DL flywheel has an incredible amount of friction in it right now in terms of getting out to a wider world and into the hands of more people empowered to find new uses, rather than passively consuming souped-up services.
To continue the analogy, it’s like if there was a black cab monopoly on buggies which was rich off fares & deliveries and worried about criticism in the London Times for running over old ladies, and automobile engines were still being hand-made one at a time by skilled mechanicks and all the iron & oil was being diverted to manufacture dreadnoughts, so they were slowly replacing horses one at a time with the new iron horses, but only eccentric aristocrats could afford to buy any to try to use elsewhere, which keeps demand low for engines, keeping them expensive and scarce, keeping mechanicks scarce… etc.
The worst part is, for most of these, time lost is gone forever. It’s just a slowdown. Like the Thai floods simply permanently set back hard drive progress and made them expensive for a long time, there was never any ‘catchup growth’ or ‘overhang’ from it. You might hope that stuff like the GPU shortages would lead to so much capital investment and R&D that we’d enjoy a GPU boom in 2023, given historical semiconductor boom-and-bust dynamics, but I’ve yet to see anything hopeful in that vein.
Gwern, aren’t you in the set that’s aware there’s no plan and this is just going to kill us? Are you that eager to get this over with? Somewhat confused here.
I too am confused.
Isn’t this great news for AI safety due to giving us longer timelines?
This is a brilliant comment for understanding the current deployment of DL. Deserves its own post.
This is the rather disappointing part.
(I moved this to answers, since while it isn’t technically an answer, I think it still functions better as an answer than as a comment)
[I generally approve of mods moving comments to answers.]
Datapoint in favor, Patrick Collison of Stripe says ML has made them $1 billion: https://mobile.twitter.com/patrickc/status/1188890271854915586?lang=en-GB
Well, merchant revenue, not Stripe profit, so not quite as impressive as it sounds, but it’s a good example of the sort of nitty-gritty DL applications you will never ever hear about unless you are deep into that exact niche and probably an employee; so a good Bayesian will remember that where there is smoke, there is fire and adjust for the fact that you’ll never hear of 99% of uses.
How are you distinguishing “new DL was instrumental in this process” from “they finally got enough data that existing data janitor techniques worked” or “DL was marginally involved and overall used up more time than it saved, but CEOs are incentivized to give it excess credit”?
It’s totally possible my world is constantly being made more magical in imperceptible ways by deep learning. It’s also possible that magic is improving at a pretty constant rate, disconnected from the flashy research successes, and PR is lying to me about it’s role.
Does anybody know what “optimize the bitfields of card network requests” actually means?
The above answer, partially as bulleted lists.
affects the chip design, is an increasing fraction of the chips on the SoC,
is most of what your camera does,
detects your face to unlock it,
powers the recommendations of every single thing on it whether Youtube or app store or streaming service (including the news articles and notifications shown to you as you unlock),
powers the features like
transcripts of calls or
machine translation of pages or
spam detection that you take for granted,
powers the ads which monetize you in the search engine
[search engine] results which they also power,
the anti-hacking and anti-abuse measures which keep you safe
(and also censor hatespeech etc on streams or social media),
the voice synthesis you hear when you talk to it,
the voice transcription when you talk to it or have your Zoom/Google videoconference sessions during the pandemic,
the wake words,
the predictive text when you prefer to type rather than talk
and the email suggestions (the whole email, or just the spelling/grammar suggestions),
the GNN traffic forecasts changing your Google Maps route to the meeting you emailed about,
the cooling systems of the data centers running all of this
(not to mention optimizing the placement of the programs within the data centers both spatially in solving the placement problem and temporally in forecasting)...
Recently I learned that Pixel phones actually contain TPUs. This is a good indicator of how much deep learning is being used (particularly it is used by the camera I think)
My money is mostly on “It just takes a really long time to convert innovation into profitable, popular product”
A related puzzle piece IMO: Several years ago, all my friends used f.lux to reduce the amount that computer screens screwed up their circadian rhythm. It had to be manually installed. I was confused/annoyed why Apple didn’t do this automatically.
A couple years later, Apple did start doing it automatically (and more recently start shifting everything to darkmode at night)
Meanwhile: A couple years ago, we released shortform on LessWrong. There’s a fairly obvious feature missing, which is showing a user’s shortform on their User Profile. That feature is still missing a couple years later. It would take maybe a day to build, and a week to get reviewed and merged into production. There are other obvious missing features we haven’t gotten around to. The reason we haven’t gotten around to it is something like “well, there’s a lot of competing engineering work to do instead, and there’s a bunch of small priorities that make it hard to just set aside a day for doing it”.
I think Habryka believes this just isn’t the most important thing missing from LW and that keeping the eye on bigger bottlenecks/opportunities is more important. I think Jimrandomh thinks it’s important to make this sort of small feature improvement, but also there’s a bunch of other small feature improvements that need doing (as well as big feature improvements that take up a lot of cognitive attention)
There’s also a bit of organization dysfunction, and/or “the cost of information flow and decisionmaking flow is legitimately ‘real’”.
Something about all this is immensely dissatisfying to me, but it seems like a brute fact about how hard things are. LW is a small team. Apple is a much larger organization that probably pays much higher decisionmaking overhead cost.
I think the bridge from “GPT is really impressive” to “GPT successfully summarizes research reports for you” is a much harder problem than adding f.lux to Mac OS or adding shortform to a User Profile. Also, the teams capable of doing it are mostly working on doing the next cool research thing. Also, InstructGPT totally does exist, but each major productization is a lot of gnarly engineering work (and again the people with the depth of understanding to do it are largely busy)
Note that this is also where some of my “somewhat longer AGI timelines” beliefs come from (i.e 7 years seems more like the minimum to me, whereas I know a couple people listing that as more like a median).
It seems to me that most of the pieces of AGI exist already, but that actually getting from here to AGI will require a 2-3 steps, and each step probably turns out to require some annoying engineering steps.
I wonder if there’s also some basic business domain expertise that generalizes here but hasn’t been developed yet. “How to use software to replace humans with spreadsheets” is a piece of domain expertise the SaaS business community has developed to the point where it gets pretty reliably executed. I don’t know that we have widespread knowledge of how to reliably turn models into services/products.
Riffing on the idea that “productionizing a cool research result into a tool/product/feature that a substantial number of users find better than their next best alternative is actually a lot of work”: it’s a lot less work in larger organizations with existing users numbering in the millions (or billions). But, as noted, larger orgs have their own overhead.
I think this predicts that most of the useful products built around deep learning which come out of larger orgs will have certain characteristics, like “is a feature that integrates/enhances an existing product with lots of users” rather than “is a totally new product that was spun up incubator-style within the organization”. It plays to the strengths of those orgs—having both datasets and users, playing better with the existing org structure and processes, more incentive-aligned with the people who “make things happen”, etc.
A couple examples of what I’m thinking of:
substantial improvements in speech recognition—productionized as voice assistant technology, it’s now good enough that it’s sometimes easier to use one than to do something by hand, like setting a timer/alarm/reminder/etc while your hands are occupied with something else,
substantial improvements in image recognition—productionized as image search. I can search for “documents” in Google Photos, and it’ll pull up everything that looks like a document. I can more narrowly search “passport” and it’ll pull up pictures I took of my passport. I can search for “license plate” and it’ll pull up a picture I took of my license plate. I just tried searching for “animal” and it pulled up:
An animated gif of a dog with large glasses on it
Statues of men on horseback, as well as some sculptures of eagles
A bunch of fish in tanks
For structural reasons I’d expect “totally novel, standalone products” to come out of startups rather than larger organizations, but because they’re startups they lack many of the “hard things are easy” buttons that some larger orgs have.
This is what Elicit is working on, roughly.
I’d have gone with—it can take a long time for a society to adapt to a new technology.
Here’s another possible explanation: The models aren’t actually as impressive as they’re made out to be. For example, take DallE2. Yes, it can create amazingly realistic depictions of noun phrases automatically. But can it give you a stylistically coherent drawing based on a paragraph of text? Probably not. Can it draw the same character in three separate scenarios? No, it cannot.
DallE2 basically lifts the floor of quality for what you can get for free. But anyone who actually wants or needs the things you can get from a human artist cannot yet get it from an AI.
See also, this review of a startup that tries to do data extraction from papers: https://twitter.com/s_r_constantin/status/1518215876201250816
DallE2 is bad at prepositional phrases (above, inside) and negation. It can understand some sentence structure, but not reliably.
In the first example, none of those are paragraphs longer than a single sentence.
In the first example, the images are not stylistically coherent! The bees are illustrated inconsistently from picture to picture. They look like they were drawn by different people working off of similar prompts and with similar materials.
The variational feature is not what I’m talking about; I mean something like “Draw a dragon sleeping on a pile of gold, working in a supermarket, and going to a tea party with a unicorn, with the same dragon in each image”.
Goalpost moving. DALL-E 2 can generate samples matching lots of complex descriptions which are not ‘noun phrases’, and GLIDE is even better at it (also covered in the paper). You said it can’t. It can. Even narrowly, your claim is poorly supported, and for the broader discussion this is in the context of, misleading. You also have not provided any sources or general reasons for this sweeping assertion to be true, or for the broader implications you claimed these are good support for.
What happened to ‘noun phrases’?
Those images are stylistically coherent in being clearly in a pastel style and matching the text input. That meets the demand, and this is only a quick throwaway project establishing a lower bound on what DALL-E 2 can do. “Attacks only get better.”
That they are not, in addition to this, perfectly consistent with each other is too bad, but increased similarity is well within the scope of a DALL-E architecture through, just off the top of my head, the variations functionality, direct optimization by backprop, or CLIP rejection sampling.
You also have not provided any sources or general reasons for this sweeping assertion to be true.
I don’t know why you look at that and say it’s not.
You also have not provided any sources or general reasons for this sweeping assertion to be true.
Don’t be a dick. As a moderator of my own post, I request that you change this to not be insulting.
This is an absurd moderation policy to use on a post which you wrote for the purpose of learning something. It strongly signals that you care much less about learning anything than you do about fussy politeness norms.
It isn’t absurd, and it doesn’t at all signal what you say it signals. Alex_Altair didn’t ask for the comment to be removed, but for it to be changed to something more kind. Learning something is important, kindness is as well.
Given that Alex not just asked, but in fact did remove gwern’s comment, it would seem that you’re mistaken. It turns out that learning something is not important to the OP, after all.
I should have checked that, I’ll admit. Still, the comment wasn’t immediately removed and gwern did get the chance to change it (and still can create a new, more polite version btw).
Also, learning something from a comment and deleting it aren’t mutually exclusive.
If, hypothetically, I were to make a post asking a question and soliciting information, and then someone replied to my question-post with a comment that contained a large amount of exactly the sort of information I had asked for, but then, rather than thanking that person for taking the time out of their day to contribute their knowledge and helping me, and anyone else reading the post, to learn precisely the things I was ostensibly trying to learn, I instead chastised the respondent for some slightly abrasive language and demanded[1] that they take more time to go back and edit their post to conform to my exacting standards of politeness…
… well, I find this scenario embarrassing even to contemplate. I hope that I never display such a degree of intellectual arrogance and close-mindedness.
As for then deleting the response in question, that is so egregiously foolish, petulant, and petty that I can’t even imagine doing it. Are we to believe that because a highly informative comment contains a single mildly sarcastic remark, it would therefore be better that the members of Less Wrong not even be exposed to it? Or are you suggesting that OP read the comment, learned from it, and then deleted it because he had gotten all the value he could from it, and as for the rest of us, we can go hang?
Why in the world would gwern want to do this? Do you think that writing comments on OP’s posts is such a singular privilege that obviously gwern is going to take the time to carefully rewrite his comments to OP’s standards? Is there any reason, at all, why the response to this sort of treatment should be anything other than a shrug and walking away? Do you think that allowing gwern (or anyone else) to comment on his posts is a favor that OP is doing him?
Phrasing the demand as a “request”, prior to enforcing said demand, does not actually make it a request—merely a lie as well.
I hadn’t thought of it like this. You’re right, deleting the comment was a bad action.
I would want to, or at least hope I would. Not that that is a reason why gwern should want it too, but it shines some light on why I think others may want to.
Good point. No, but kindness is still important. (Note I don’t think gwern was that unkind. I just think Alex had a point, and think your reaction to Alex was unfair.)
It seems we have reached partial agreement though; thank you for your time.
Meta: I disagree with Alex’s decision to delete Gwern’s comment on this answer. People can reasonably disagree about the optimal balance between ‘more dickish’ (leaves more room for candor, bluntness, and playfulness in discussions) and ‘less dickish’ (encourages calm and a focus on content) in an intellectual community. And on LW, relatively high-karma users like Alex are allowed to moderate discussion of their posts, so Alex is free to promote the balance he thinks is best here.
But regardless of where you fall on that spectrum, I think LW should have a soft norm that substance trumps style, content is king, argument will be taken seriously on its own terms even if it’s not optimally packaged and uses the wrong shibboleths or whatever.
Deleting substantive, relevant content entirely should mostly not be one of the ‘game moves’ people use in advancing their side of the Dickishness Debate—it’s not worth it on its own terms, it’s not worth it as a punishment for the other side (even if the other side is in fact wrong and you’re right), and it erodes an important thing about LW.
Gwern’s comment had tons of content beyond that one sentence that was phrased a bit rudely; and it spawned a bunch of discussion that’s now hard to follow, on a topic that actually matters. Deleting the whole comment, without copy-pasting all or most of it anywhere, seems bad to me.
I appreciate this comment!
I’m interested in responding to you, Rob, because I already know you to be an entirely reasonable person, and also because I think this is somewhat of a continuation of a difference between you and me in real life. I might bail at any time though, because the fact that posters can have their own custom moderation policy means that I don’t feel particularly obligated to justify myself.
(For context for the rest of this comment, the line I had a problem with was, “‘noun phrases’ is an odd typo for ‘sentences’. They’re not even close to each other on the keyboard.”)
I agree with this, and I think it’s already true. But I also think you worded it too softly to be in contradiction to my comment deletion (and more generally the implicit policy in my head). LW definitely does have said soft norm; I think allowing users to moderate their own posts, and users occasionally doing so, preserves that norm! Never deleting a comment, no matter where it lay on the dickish spectrum, would I think constitute a hard norm. The line I had a problem with was far beyond “not optimally packaged” and had nothing to do with using a “wrong shibboleths”.
I’ll note that I gave them plenty of time and opportunity to edit the comment; I requested it in the comments, I requested it in a PM, and I saw that they made other comments on the same post much later.
It spawned two comments (which you can still read, though I agree they’re harder to follow). I agree that the rest of said content was substantive and relevant. If there had actually been a whole lot more, I probably wouldn’t have deleted the parent. I just don’t think it was enough to tip the scale. And there’s tons of other similar discussion all over the comments, especially by Gwern. And like, everyone involved is free to reiterate the substantive and relevant content. (The option of copy-pasting someone’s comment feels weird to me, for reasons that I haven’t quite explicated and don’t feel super relevant.)
It really wasn’t that big of a comment. But also, I think our core disagreement might be here under “phrased a bit rudely”. I think the criterion I’m using implicitly here is that, if the entire purpose of the statement is to insult someone, then it’s out. That sentence did not have any other purpose. It’s not just phrasing that comes a bit too harsh. It was equivalent to, “You’re an idiot”. This is far from the side of the spectrum that requires finicky social norms. I’m not asking that people make sure that no one could be offended by what they’re saying. I’m not even saying that people shouldn’t say true, useful negative things about others. If it’s important to figuring out AI alignment that we discuss how smart a particular person is, then so be it. That’s a far cry from statements whose only purpose is to insult.
I’d actually be curious to know where the line is for you. If someone literally said, “You’re an idiot”, would you call that too dickish? Or what if someone’s comment had insulting profanities?
It might be worth to make sure that the author of a deleted comment can still read it so they can repost it on their shortform or a similar place.
Authors of deleted comments receive the text of the comment in a PM
Commenting to note that I agree (though I would put the matter in much stronger terms).
(These are all quantitative factors. If Gwern’s overall comment had sucked more, or his sentence had been way more egregious, I’d have objected a lot less to Alex’s call. But it does matter where we put rough quantitative thresholds.)
It seems like the applications of DL that have generated useful products so far have been in the areas in which a useful result is easy or economical to verify, safe to test, close to the research itself, and in areas where small failures are inconsequential. Gwern’s list of applications indicates that this lies mostly in the realm of software engineering infrastructure, particularly for consumer products.
Unfortunately, it seems that the technologies that would most impress us are not bottlenecked by the fast-and-facile intelligence of a GPT-3.
One area that I would have hoped GPT-3 could contribute to would be learning: an automated personal tutor could revolutionize education in a way that MOOCs cannot. Imagine a chatbot with GPT-3′s conversational abilities that could also draw diagrams like DALL-E.
Unfortunately, GPT-3 just isn’t reliable enough for that. Actually, it’s still deeply problematic, because its explanations and answers to technical questions seem plausible to a novice, but are incorrect and lack of deep understanding. So it’s currently smart enough to mislead, but not smart enough to educate.
Seconded. AI is good at approximate answers, and bad at failing gracefully. This makes it very hard to apply to some problems, or requires specialized knowledge/implementation that there isn’t enough expertise or time for.
For most products to be useful, they must be (perhaps not perfectly, but near-perfectly) reliable. A fridge that works 90% of the time is useless, as is a car that breaks down 1 out of every 10 times you try to go to work. The problem with AI is inherently that it’s unreliable—we don’t know how the inner algorithm works, so it just breaks at random points, especially because most of the tasks it handles are really hard (hence why we can’t just use classical algorithms). This makes it really hard to integrate AI until it gets really good, to the point where it can actually be called reliable
The things AI is already used for are things where reliability doesn’t matter as much. Advertisement algorithms just need to be as good as possible to make the company as much revenue as possible. People currently use machine translation just to get the message across and not for formal purposes, making AI algorithms sufficient (if they were better maybe we could use them for more formal purpose’s!). The list goes on.
I honestly think AI won’t become super practical until we reach AGI, at which point (if we ever get there) its usage will explode due to massive applicability and solid reliability (if it doesn’t take over the world, that is).
For all the hypothetical products I listed, I think this level of unreliability is totally fine! Even self-driving cars only need to beat the reliability of human drivers, which I don’t think is that far from achievable.
Mostly #6 - there is a LOT of deep learning (and other advanced modeling that’s not specifically DNN) out there, but it’s generally for commercial decisions, not as much in consumer products. And rarely is it very visible what mechanisms are being used—that sort of detail is lawsuit-bait.
I think the main thing is that the ML researchers with enough knowledge are in short supply. They are:
doing foundational ai research
being paid megabucks to do the data center cooling ai and the smartphone camera ai
freaking out about AGI
The money and/or lifestyle isn’t in procedural Spotify playlists.
DeepMind have delivered AlphaFold thereby solving a really important outstanding scientific problem. They have used it to generate 3D models of almost every human protein (and then some) which have been released to the community. This is, actually, a huge deal. It will save many many millions in research costs and speed up the generation of new therapeutics.
The US GDP is 21 trillion. Saving millions of research dollars is a rounding error and not significant economic value.
There’s no sign of Eroom’s law stopping and being reversed by discoveries like AlphaFold.
OK, the question asked for demonstration of economic value now and I grant you AlphaFold, which is solely a research enabler, has not demonstrated that to date. Whether AlphaFold will have a significant role in breaking Eroom’s law is a good question but cannot be answered for at least 10 years. I would still argue that the future economic benefits of what has already been done with AlphaFold and made open access, are likely to be substantial. Consider Alzheimer’s. The current global economic burden is reckoned to be $300 B, p.a. rising in future to $1T. If, say, an Alzheimer’s drug that halved the economic cost, was discovered 5 years earlier on account of AlphaFold the benefit would run to at least $0.75 T in total. This kind of possibility is not unreasonable (for Alzheimer’s replace with your favourite druggable high economic value medical condition)
It’s unclear to me why we should expect protein-structure prediction to be the bottleneck for finding an Alzheimer cure.
Not a bottleneck so much as a numbers game. Difficult diseases require many shots on goal to maximise the chance of a successful program. That means trying to go after as many biological targets as there are rationales for, and a variety of different approaches (or chemical series) for each target. Success may even require knocking out two targets in a drug combination approach. You don’t absolutely need protein structures of a target to have a successful drug-design program but using them as a template for molecular design (Structure-Based Drug Design) is a successful and well established approach and and can give rise to alternative chemical series to non-structure based methods. X-ray crystal derived protein structures are the usual way in but if you are unable to generate X-Ray structures, which is still true for many targets, AlphaFold structures can in principle provide the starting point for a program. They can also help generate experimental structures in cases where the X-ray data is difficult to interpret.
Most of the money spent in developing drugs is not about finding targets but about running clinical studies to validate targets.
The time when structure-based drug design became possible did not coincide with drug development getting cheaper.
I agree with you on both counts. So, I concede, saving millions in research costs may be small beer. But I don’t see that invalidates the argument in my previous comment, which is about getting good drugs discovered as fast as is feasible. Achieving this will still have significant economic and humanitarian benefit even if they are no cheaper to develop. There are worthwhile drugs we have today that we wouldn’t have without Structure-Based Design.
The solving of the protein folding problem will also help us to design artificial enzymes and molecular machines. That won‘t be small potatoes either IMO.
AI tech seen in the wild: I’ve been writing C# in MS Visual Studios for the current job, and now have full line AI driven code completion out of the box that I’m finding useful in practice. Much better than anything I’ve seen for smartphones or e.g. gmail text composition. In one instance it correctly infered an entire ~120 character line including the entire anonymous function I was passing into the method call. It won’t do the tricky parts at all, but regardless does wonders for cutting through drudgery and general fatigue. Sure feels like living in the future!
VS has had non-AI based completion of next token, for a long time that’s already very good (.NET/C# being strongly typed is a huge boon for these kinds of infernces). I imagine that extra context is why this is performing so much better than general text completion.
What code completion service are you using? Codex/Copilot?
Looks like it’s just whatever ships with VS 2022: https://devblogs.microsoft.com/visualstudio/type-less-code-more-with-intellicode-completions/ ; No idea if it’s actually first party, whitelabel/rebranded, or somewhere inbetween.
I’d guess it’s GPT3 running on Azure, as Microsoft has licensed the full version to resell on Azure. See also
Let me suggest an alternate answer: there is a lot of resistance to AI coming from the media and the general public. A lot of this is unconscious, so you rarely hear people say “I hate AI” or “I want AI to stop.” (You do hear this sometimes, if you listen closely.) This has the consequence that our standards for deploying AI in a consumer-facing way is absurdly high, leading to ML mostly being deployed behind the scenes. That’s why we see a lot of industrial and scientific use of deep learning, as well as some consumer-facing cases in risk-free contexts. (It’s hard to make the case that e.g. speech-to-text is going to kill anyone.)
If safety wasn’t (so much of) an issue, we could have deployed self driving cars as early as the 1990s. As a thought experiment, imagine that 2016-level self driving technology was available to the culture and society of 1900. 1900 was a pivotal year for the automobile, and at that time, our horse-based transportation system was causing a lot of problems for big cities like New York. If you live in a big city today, you might find yourself wondering how it came to be that we live with big, fast, noisy, polluting machines clogging up our cities and denying the streets to pedestrians. Well, horses were a lot worse, and people in 1900 saw the automobile as their savoir. (Read the book Internal Combustion if you want the whole story. Great book.)
The society of 1900, or 1955 for that matter, would have embraced 2016-level self driving with a passion. Good transportation saves lives, so they would not have quibbled about it being slightly less safe than a sober driver or weird edge cases like the car getting stuck sometimes. But the society of 20XX has an extremely high standard for safety (some would say unreasonably high) and there are a lot of people who are afraid of AI, even if they won’t say as much explicitly. It’s a little like nuclear power, where the new vaguely scary technology is resisted by society.
I agree with what Gwern said about things being behind-the-scenes, but it’s also worth noting that there are many impactful consumer technologies that use DL. In fact, some of the things that you don’t think exist actually do exist!
Google Translate: https://www.washingtonpost.com/news/innovations/wp/2016/10/03/google-translate-is-getting-really-really-accurate/
Google Search: https://blog.google/products/search/search-language-understanding-bert/
PhotoShop: https://blog.adobe.com/en/publish/2020/10/20/photoshop-the-worlds-most-advanced-ai-application-for-creatives
Examples of other DL-powered consumer applications
Grammarly: https://www.grammarly.com/blog/how-grammarly-uses-ai/
Apple FaceID: https://support.apple.com/en-us/HT208108
JP Morgan Chase: https://www.jpmorgan.com/technology/applied-ai-and-ml
Google search gets less usable every year, even for Scholar, which has a much less adversarial search space. It’s better for very common searches like popular tv shows, but approaching worthlessness for long tail stuff. Maybe this is just “search is hard”, but improving the common case at the cost of the long tail is exactly what I’d expect AI search to do.
I wonder how we’d go about designing a reward signal for the long-tailed stuff.
One thing I’d really like to see is reward for diversity of results. Bringing me the same listicle with slight rewrites 10 times provides no value while pushing out better results.
A friend of mine doing an ML PhD claims it’s possible to train a search engine to identify the shitty pages that might as well have been written by GPT-3, even if that’s not literally true. I’m skeptical this can be done in a way that keeps up with the adversarial adaptation, but it would be cool if it did.
Just ran into the listicle problem myself; it effectively slew searching Google for anything where I don’t already know most of what I need. It feels weird that in the name of ad revenue the algorithm promotes junk whose sole purpose is also to generate ad revenue. Process seems to be cannibalizing itself somehow.
It would be cool to filter GPT-3-ish things. It seems like we could get most of the diversity without anything very sophisticated; something like negatively weighting hits according to how many other results have similar/very similar content. If all the pages containing some variation of “Top #INT #VERB #NOUN” could get kicked to the bottom of the rankings forever, I’d be a happy camper.
If adversarial adaptation means that shitty pages needs to appear as good pages with solid argumentation, it seems like a win to me.
Elon Musk said a few weeks ago that Tesla’s main strategy right now is to slash the cost of personal transportation by 4x by perfecting full-self-driving AI, and attempting to achieve that this year. (Relatedly, they’re not allocating resources to making an even cheaper version of the Model 3 because it wouldn’t be 4x cheaper anyway.)
Making good on Musk’s claim would probably add another $trillion to Tesla’s market cap in short order.
Even if tesla’s self-driving technology freezes at its current level, it’s clearly added value to the cars. Maybe not $10,000 per car or whatever they are charging for it, but probably at least $1,000. Multiply that by the million or so cars they sell per year, and that’s a billion dollars of economic value due to recent deep learning advances.
Of course, a billion is not a trillion. Plausibly by “significant” the OP meant something more like a trillion.
I work at a large, not quite FAANG company, so I’ll offer my perspective. It’s getting there. Generally, the research results are good, but not as good as they sound in summary. Despite the very real and very concerning progress, most papers you take at face value are a bit hyped. The exceptions to some extent are the large language models. However, not everyone has access to these. The open source versions of them are good but not earth shattering. I think they might be if the goal is to general fluent sounding chatbots, but this is not the goal of most work I am aware of. Companies, at least mine, are hesitant on this because they are worried the bot will say something dumb, racist, or just made-up. Most internet applications are more to do with recommendation, ranking, and classification. In these settings large language models are helping, though they often need to be domain adapted. In those cases they are often only helping +1-2% over well trained classical models, e.g. logistic regression. Still a lot revenue-wise though. They are also big and slow and not suited for every application yet, at least not until the infrastructure (training and serving) catches up. A lot of applications are therefore comfortable iterating on smaller end-to-end trained models, though they are gradually adopting features from large models. They will get there, in time. Progress is also slower in big companies, since (a) you can’t simply plug in somebody’s huggingface model or code and be done with it, (b) there are so many meetings to be had to discuss ‘alignment’ (not that kind) before anything actually gets done.
For some of your examples:
* procedurally generated music. From what I’ve listened to, the end-to-end generated music is impressive but not impressive enough that I would listen to it for fun. They seem to have little large scale coherence. However this seems like someone could step in and introduce some inductive bias (for example, verse-bridge-chorus repeating song structure), and actually get something good. Maybe they should stick to to instrumental and have a singer-songwriter riff on it. I just don’t think any big name record companies are funding this at the moment, probably they have little institutional AI expertise and think it’s a risk, especially to bring on teams of highly paid engineers.
* tools for writers to brainstorm. I think GPT-3 has this as an intended use case? At the moment there are few competitors to make such a large model, so we will see how their pilot users like it.
* photoshop with AI tools. That sounds like it should be a thing. Wonder why Adobe hasn’t picked that up (if they haven’t? if it’s still in development?). Could be an institutional thing.
* Widely available self driving cars. IMO I think real-world agents are still missing some breakthroughs. That’s one of the last hurdles I think that will be broken to AGI. It’ll happen but I would not be surprised if it is slower than expected.
* Physics simulators. Not sure really. I suspect this might be a case of overhyped research papers. Who knows? I actually used to work on this in grad school, using old fashioned finite difference / multistep / RK methods. Usually relying on taylor series coefficients canceling out nicely, or doing gaussian quadrature. On the one hand I can imagine it hard to beat such precisely defined models, but on the other hand, at the end of the day it’s sort of assuming nice properties of functions in a generic way, I can easily imagine a tuned DL stencil doing better for specific domains, e.g. fluids or something. Still, it’s hard to imagine it being a slam dunk rather than an iterative improvement.
* Paradigmatically different and better web search. I think we are actually getting there. When I say “hey google”, I actually get very real answers to my questions 90% of the time. It’s crazy to me. Kids love it. Though I may be in the minority. I always see reddit threads about people saying that google search has gotten worse. I think there’s a lot of people who are very used to keyword based searches and are not used to the model trying to anticipate them. This will slow adoption since metrics won’t be universally lifted across all users. Also, there’s something to be said for the goodness of old fashioned look up tables.
My take on your reasons—they are mostly spot on.
1. Yes | The research results are actually not all that applicable to products; more research is needed to refine them
2. Yes | They’re way too expensive to run to be profitable
3. Yes | Yeah, no, it just takes a really long time to convert innovation into profitable, popular product
4. No, but possibly institutional momentum | Something something regulation?
5. No | The AI companies are deliberately holding back for whatever reason
6. Yes, incrementally | The models are already integrated into the economy and you just don’t know it.
Given some of it is institutional slowness, there is room for disruption, which is probably why VC’s are throwing money at people. Still though, in many cases a startup is going to have a hard time competing with the compute resources of larger companies.
To put it broadly, the answer is because AI is way, way more unreliable than is worth it to use for a large portion of jobs, and capabilities improvements have not generally fixed this issue.
An update I’ve made on AI is that in large part a big reason why AI hasn’t impacted the job market/economy is because way, way more jobs require way more reliability/error-correction than current AI does, and just because an AI has a certain capability doesn’t mean that it will reliably do so, and this is one of the reasons why I expect the first use of AI to be in AI research, as it has lower reliability requirements, and also why I think progress can be both discontinuous and continuous on the same time, just on different axes.
Alyssa Vance got it very right here in Humans are very reliable agents:
https://www.lesswrong.com/posts/28zsuPaJpKAGSX4zq/humans-are-very-reliable-agents
Great list of RL use cases: https://mighty-melody-f4b.notion.site/RL-for-real-world-problems-0114c270e5d94894b3c4f227e24401db
In so far as the answer isn’t what gwern already pointed out, bigger, more visible and ambitious software projects take longer to realize, you’re more likely to hear about failures, and may not be viable until more of the operational kinks get worked out with more managable projects. As much novel stuff as DL has enabled we’re still not quite mature enough that a generalist is wise to pull DL tools into a project that doesn’t clearly require them.
First, the powerful. Then, the rich. Finally, you. The illusion this community provides of an academic scientific establishment begrudgingly beholden to a capitalist economy serving a consumerist society is fake.
Faker than the medieval assumption of a clergy serving an aristocracy serving the peasantry. The economy is fake. It’s not hard to predict because it’s complex, it’s hard to predict because it literally doesn’t exist.
Reason is but the first step along the staircase of your ability to sense truth. Keep climbing!