Yeah, I’m not at all confident in this model, but I do suspect it’s underrated. I’m reminded of something Andrew Ng mentioned in his machine learning class about how he would run into machine learning projects where they didn’t have much data, and he would ask them to do a back of the napkin calculation to figure out how much time it would take them to hand-label more data. He said that oftentimes with just a week’s time spent hand-labeling data, they’d dramatically increase the amount of data available and improve their algorithm’s performance. It’s not clever or “scalable”, but sometimes the solution that doesn’t scale is the best one.
This has definitely been true in my experience with ML/DL so far. If you grit your teeth and put a bit of effort into a reasonably low-latency script for hand-labeling, you can often do a few hundred or thousand datapoints in a quite feasible amount of time and it will be enough to work with in finetuning (eg. PALM) or training on an embedding (eg. StyleGAN latents).
And this is something that a lot of people have been learning with image generation models the past 3 years: it is often way faster to just curate a small set of images to, eg. train a LoRA, than to try to train some sort of uber-model which fixes the problem out of the box or use some complicated fancy algorithm on top of the existing model or brute force sample generation & selection. It’s not necessarily that the model doesn’t “know” the thing you want, it’s that you can’t tell it accurately. ‘A picture is worth a thousand words’, which is a lot of words to have to figure out! (Or right now—we’ve been working with a guy on InvertOrNot.com as an API service for better dark modes, to automatically classify images for website dark modes, and while GPT-4-V can do it if phrased as a pairwise comparison, one could never afford to offer that in bulk as a free public service as we would like to—so… he finetuned a tiny cheap EfficientNet on ~1000 hand-labeled images and it works great now and runs for free. Because you can’t easily “prompt” for it AFAICT, but it’s easy to collect a few thousand examples to hand-label for a pretrained CNN.)
Of course, the models or services improve over time and now you can often just zero-shot what you want, but one’s ambitions always grows to match… Whether it’s art or porn or business, once we can do the old thing we dreamed of, soon we sour on it and demand even more, something even more specific and precise and niche, and the only way to hit that ultra-niche will often be some sort of hand-labeled dataset.
Yeah, I’m not at all confident in this model, but I do suspect it’s underrated. I’m reminded of something Andrew Ng mentioned in his machine learning class about how he would run into machine learning projects where they didn’t have much data, and he would ask them to do a back of the napkin calculation to figure out how much time it would take them to hand-label more data. He said that oftentimes with just a week’s time spent hand-labeling data, they’d dramatically increase the amount of data available and improve their algorithm’s performance. It’s not clever or “scalable”, but sometimes the solution that doesn’t scale is the best one.
This has definitely been true in my experience with ML/DL so far. If you grit your teeth and put a bit of effort into a reasonably low-latency script for hand-labeling, you can often do a few hundred or thousand datapoints in a quite feasible amount of time and it will be enough to work with in finetuning (eg. PALM) or training on an embedding (eg. StyleGAN latents).
And this is something that a lot of people have been learning with image generation models the past 3 years: it is often way faster to just curate a small set of images to, eg. train a LoRA, than to try to train some sort of uber-model which fixes the problem out of the box or use some complicated fancy algorithm on top of the existing model or brute force sample generation & selection. It’s not necessarily that the model doesn’t “know” the thing you want, it’s that you can’t tell it accurately. ‘A picture is worth a thousand words’, which is a lot of words to have to figure out! (Or right now—we’ve been working with a guy on InvertOrNot.com as an API service for better dark modes, to automatically classify images for website dark modes, and while GPT-4-V can do it if phrased as a pairwise comparison, one could never afford to offer that in bulk as a free public service as we would like to—so… he finetuned a tiny cheap EfficientNet on ~1000 hand-labeled images and it works great now and runs for free. Because you can’t easily “prompt” for it AFAICT, but it’s easy to collect a few thousand examples to hand-label for a pretrained CNN.)
Of course, the models or services improve over time and now you can often just zero-shot what you want, but one’s ambitions always grows to match… Whether it’s art or porn or business, once we can do the old thing we dreamed of, soon we sour on it and demand even more, something even more specific and precise and niche, and the only way to hit that ultra-niche will often be some sort of hand-labeled dataset.