This has definitely been true in my experience with ML/DL so far. If you grit your teeth and put a bit of effort into a reasonably low-latency script for hand-labeling, you can often do a few hundred or thousand datapoints in a quite feasible amount of time and it will be enough to work with in finetuning (eg. PALM) or training on an embedding (eg. StyleGAN latents).
And this is something that a lot of people have been learning with image generation models the past 3 years: it is often way faster to just curate a small set of images to, eg. train a LoRA, than to try to train some sort of uber-model which fixes the problem out of the box or use some complicated fancy algorithm on top of the existing model or brute force sample generation & selection. It’s not necessarily that the model doesn’t “know” the thing you want, it’s that you can’t tell it accurately. ‘A picture is worth a thousand words’, which is a lot of words to have to figure out! (Or right now—we’ve been working with a guy on InvertOrNot.com as an API service for better dark modes, to automatically classify images for website dark modes, and while GPT-4-V can do it if phrased as a pairwise comparison, one could never afford to offer that in bulk as a free public service as we would like to—so… he finetuned a tiny cheap EfficientNet on ~1000 hand-labeled images and it works great now and runs for free. Because you can’t easily “prompt” for it AFAICT, but it’s easy to collect a few thousand examples to hand-label for a pretrained CNN.)
Of course, the models or services improve over time and now you can often just zero-shot what you want, but one’s ambitions always grows to match… Whether it’s art or porn or business, once we can do the old thing we dreamed of, soon we sour on it and demand even more, something even more specific and precise and niche, and the only way to hit that ultra-niche will often be some sort of hand-labeled dataset.
This has definitely been true in my experience with ML/DL so far. If you grit your teeth and put a bit of effort into a reasonably low-latency script for hand-labeling, you can often do a few hundred or thousand datapoints in a quite feasible amount of time and it will be enough to work with in finetuning (eg. PALM) or training on an embedding (eg. StyleGAN latents).
And this is something that a lot of people have been learning with image generation models the past 3 years: it is often way faster to just curate a small set of images to, eg. train a LoRA, than to try to train some sort of uber-model which fixes the problem out of the box or use some complicated fancy algorithm on top of the existing model or brute force sample generation & selection. It’s not necessarily that the model doesn’t “know” the thing you want, it’s that you can’t tell it accurately. ‘A picture is worth a thousand words’, which is a lot of words to have to figure out! (Or right now—we’ve been working with a guy on InvertOrNot.com as an API service for better dark modes, to automatically classify images for website dark modes, and while GPT-4-V can do it if phrased as a pairwise comparison, one could never afford to offer that in bulk as a free public service as we would like to—so… he finetuned a tiny cheap EfficientNet on ~1000 hand-labeled images and it works great now and runs for free. Because you can’t easily “prompt” for it AFAICT, but it’s easy to collect a few thousand examples to hand-label for a pretrained CNN.)
Of course, the models or services improve over time and now you can often just zero-shot what you want, but one’s ambitions always grows to match… Whether it’s art or porn or business, once we can do the old thing we dreamed of, soon we sour on it and demand even more, something even more specific and precise and niche, and the only way to hit that ultra-niche will often be some sort of hand-labeled dataset.