it is painful to have to spend tokens on few-shot examples when I just know the model could 0-shot the task in principle
Is your situation that few-shot prompting would actually work but it is too expensive? If so, then possibly doing a general purpose few-shot prompt (e.g. a prompt which shows how to generally respond to instructions) and then doing context distillation on this with a model like GPT-3.5 could be pretty good.
I find that extremely long few-shot prompts (e.g. 5 long reasoning traces) help with ~all of the things you mentioned to a moderate extent. That said, I also think a big obstacle is just core intelligence: sometimes it is hard to know how you should reason and the model are quite dumb...
I can DM you some prompts if you are interested. See also the prompts I use for the recent ARC-AGI stuff I was doing, e.g. here.
I find that recent anthropic models (e.g. Opus) are much better at learning from long few-shot examples than openai models.
Is your situation that few-shot prompting would actually work but it is too expensive? If so, then possibly doing a general purpose few-shot prompt (e.g. a prompt which shows how to generally respond to instructions) and then doing context distillation on this with a model like GPT-3.5 could be pretty good.
Yeah, that is the situation. Unfortunately, there is a large inference markup for OpenAI finetuning—GPT-3.5 becomes 6x (input) / 4x (output) more expensive when finetuned—so you only break even if you are are distilling a fairly large quantity of prompt text, and even then, this is vastly more expensive than my fantasy dream scenario where OpenAI has already done this tuning for me as part of the standard-pricing GPT-3.5 model.
For what I’m doing, this level of expense is typically prohibitive in practice. The stuff I do typically falls into one of two buckets: either we are so stingy that even the cost of GPT-3.5 zero-shot is enough to raise eyebrows and I get asked to reduce cost even further if at all possible, or we are doing a one-time job where quality matters a lot and so GPT-4 or the like is acceptable. I realize that this is not the shape of the tradeoff for everyone, but this is a situation that one can be in, and I suspect I am not the only one in it.
Anyway, I didn’t mean to say that today’s models can’t be cajoled into doing the right thing at all. Only that it requires time and effort and (more importantly) added expense, and I’m frustrated that I don’t get these generic capabilities out of the box. I shouldn’t have to do extra generic instruction tuning on top of GPT-3.5. That’s OpenAI’s job. That’s supposed to be the product.
EDIT: also, I basically agree with this
That said, I also think a big obstacle is just core intelligence: sometimes it is hard to know how you should reason and the model are quite dumb...
but I also wonder whether this framing is too quick to accept the limitations of the HHH Assistant paradigm.
For instance, on complex tasks, I find that GPT-4 has a much higher rate of “just doing it right on the first try” than GPT-3.5. But in the occasional cases where GPT-4 starts off on the wrong foot, it stays on the wrong foot, and ends up sounding as confused/dumb as GPT-3.5.
Which makes me suspect that, if GPT-3.5 were tuned to be more robust to its own mistakes, along the lines of my final bullet point in OP (I’m imagining things like “whoops, I notice I am confused, let me repeat back my understanding of the task in more detail”), it would be able to close some of the distance to GPT-4. The situation is less “GPT-4 is smart enough to do the task” and more “both models are extremely non-robust and one slight misstep is all it takes for them to fail catastrophically, but GPT-4 is smart enough to compensate for this by simply taking zero missteps in a large fraction of cases.”
Have you looked at the new Gemini ‘prompt caching’ feature, where it stores the hidden state for reuse to save the cost of recomputing multi-million token contexts? It seems like it might get you most of the benefit of finetuning. Although I don’t really understand their pricing (is that really $0.08 per completion...?) so maybe it works out worse than the OA finetuning service.
EDIT: also of some interest might be the new OA batching API, which is half off as long as you are willing to wait up to 24 hours (but probably a lot less). The obvious way would be to do something like prompt caching and exploit the fact that probably most of the requests to such an API will share a common prefix, in addition to the benefit of being able to fill up idle GPU-time and shed load.
Given that Gemini 1.5 Flash already performs decently in my tests with relatively short prompts, and it’s even cheaper than GPT-3.5-Turbo, I could probably get a significant pareto improvement (indeed, probably an improvement on all fronts) by switching from {GPT-3.5-Turbo + short prompt} to {Gemini 1.5 Flash + long cached prompt}. Just need to make the case it’s worth the hassle...
The minimum input token count for context caching is 32,768
Obviously nice for truly long context stuff, but I’m not going to add tens of thousands of tokens to my prompt just for the privilege of using this feature.
Yeah, I was thinking that you might be able to fill the context adequately, because otherwise you would have to be in an awkward spot where you have too many examples to cheaply include them in the prompt to make the small cheap models work out, but also still not enough for finetuning to really shine by training a larger high-end model over millions of tokens to zero-shot it.
Is your situation that few-shot prompting would actually work but it is too expensive? If so, then possibly doing a general purpose few-shot prompt (e.g. a prompt which shows how to generally respond to instructions) and then doing context distillation on this with a model like GPT-3.5 could be pretty good.
I find that extremely long few-shot prompts (e.g. 5 long reasoning traces) help with ~all of the things you mentioned to a moderate extent. That said, I also think a big obstacle is just core intelligence: sometimes it is hard to know how you should reason and the model are quite dumb...
I can DM you some prompts if you are interested. See also the prompts I use for the recent ARC-AGI stuff I was doing, e.g. here.
I find that recent anthropic models (e.g. Opus) are much better at learning from long few-shot examples than openai models.
Yeah, that is the situation. Unfortunately, there is a large inference markup for OpenAI finetuning—GPT-3.5 becomes 6x (input) / 4x (output) more expensive when finetuned—so you only break even if you are are distilling a fairly large quantity of prompt text, and even then, this is vastly more expensive than my fantasy dream scenario where OpenAI has already done this tuning for me as part of the standard-pricing GPT-3.5 model.
For what I’m doing, this level of expense is typically prohibitive in practice. The stuff I do typically falls into one of two buckets: either we are so stingy that even the cost of GPT-3.5 zero-shot is enough to raise eyebrows and I get asked to reduce cost even further if at all possible, or we are doing a one-time job where quality matters a lot and so GPT-4 or the like is acceptable. I realize that this is not the shape of the tradeoff for everyone, but this is a situation that one can be in, and I suspect I am not the only one in it.
Anyway, I didn’t mean to say that today’s models can’t be cajoled into doing the right thing at all. Only that it requires time and effort and (more importantly) added expense, and I’m frustrated that I don’t get these generic capabilities out of the box. I shouldn’t have to do extra generic instruction tuning on top of GPT-3.5. That’s OpenAI’s job. That’s supposed to be the product.
EDIT: also, I basically agree with this
but I also wonder whether this framing is too quick to accept the limitations of the HHH Assistant paradigm.
For instance, on complex tasks, I find that GPT-4 has a much higher rate of “just doing it right on the first try” than GPT-3.5. But in the occasional cases where GPT-4 starts off on the wrong foot, it stays on the wrong foot, and ends up sounding as confused/dumb as GPT-3.5.
Which makes me suspect that, if GPT-3.5 were tuned to be more robust to its own mistakes, along the lines of my final bullet point in OP (I’m imagining things like “whoops, I notice I am confused, let me repeat back my understanding of the task in more detail”), it would be able to close some of the distance to GPT-4. The situation is less “GPT-4 is smart enough to do the task” and more “both models are extremely non-robust and one slight misstep is all it takes for them to fail catastrophically, but GPT-4 is smart enough to compensate for this by simply taking zero missteps in a large fraction of cases.”
Have you looked at the new Gemini ‘prompt caching’ feature, where it stores the hidden state for reuse to save the cost of recomputing multi-million token contexts? It seems like it might get you most of the benefit of finetuning. Although I don’t really understand their pricing (is that really $0.08 per completion...?) so maybe it works out worse than the OA finetuning service.
EDIT: also of some interest might be the new OA batching API, which is half off as long as you are willing to wait up to 24 hours (but probably a lot less). The obvious way would be to do something like prompt caching and exploit the fact that probably most of the requests to such an API will share a common prefix, in addition to the benefit of being able to fill up idle GPU-time and shed load.
Yeah, it’s on my radar and seems promising.
Given that Gemini 1.5 Flash already performs decently in my tests with relatively short prompts, and it’s even cheaper than GPT-3.5-Turbo, I could probably get a significant pareto improvement (indeed, probably an improvement on all fronts) by switching from {GPT-3.5-Turbo + short prompt} to {Gemini 1.5 Flash + long cached prompt}. Just need to make the case it’s worth the hassle...
EDIT: oh, wait, I just found the catch.
Obviously nice for truly long context stuff, but I’m not going to add tens of thousands of tokens to my prompt just for the privilege of using this feature.
Yeah, I was thinking that you might be able to fill the context adequately, because otherwise you would have to be in an awkward spot where you have too many examples to cheaply include them in the prompt to make the small cheap models work out, but also still not enough for finetuning to really shine by training a larger high-end model over millions of tokens to zero-shot it.