I do a lot of “mundane utility” work with chat LLMs in my job[1], and I find there is a disconnect between the pain points that are most obstructive to me and the kinds of problems that frontier LLM providers are solving with new releases.
I would pay a premium for an LLM API that was better tailored to my needs, even if the LLM was behind the frontier in terms of raw intelligence. Raw intelligence is rarely the limiting factor on what I am trying to do. I am not asking for anything especially sophisticated, merely something very specific, which I need to be done exactly as specified.
It is demoralizing to watch the new releases come out, one after the next. The field is overcrowded with near-duplicate copies of the same very specific thing, the frontier “chat” LLM, modeled closely on ChatGPT (or perhaps I should say modeled closely on Anthropic’s HHH Assistant concept, since that came first). If you want this exact thing, you have many choices—more than you could ever need. But if its failure modes are problematic for you, you’re out of luck.
Some of the things I would pay a premium for:
Tuning that “bakes in” the idea that you are going to want to use CoT in places where it is obviously appropriate. The model does not jump to conclusions; it does not assert something in sentence #1 and then justify it in sentences #2 through #6; it moves systematically from narrow claims to broader ones. This really ought to happen automatically, but failing that, maybe some “CoT mode” that can be reliably triggered with a special string of words.
Getting chat LLMs to do 0-shot CoT is still (!) kind of a crapshoot. It’s ridiculous. I shouldn’t have to spend hours figuring out the magic phrasing that will make your model do the thing that you know I want, that everyone always wants, that your own “prompting guide” tells me I should want.
Really good instruction-following, including CoT-like reviewing and regurgitation of instruction text as needed.
Giving complex instructions to chat LLMs is still a crapshoot. Often what one gets is a sort of random mixture of “actually following the damn instructions” and “doing something that sounds, out of context, like a prototypical ‘good’ response from an HHH Assistant—even though in context it is not ‘helpful’ because it flagrantly conflicts with the instructions.”
“Smarter” chat LLMs are admittedly better at this than their weaker peers, but they are still strikingly bad at it, given how easy instruction-following seems in principle, and how capable they are at various things that seem more difficult.
It’s 2024, the whole selling point of these things is that you write verbal instructions and they follow them, and yet I have become resigned to the idea that they will just ignore half of my instructions half of the time. Something is wrong with this picture!
Quantified uncertainty. Some (reliable) way of reporting how confident they are about a given answer, and/or how likely it is that they will be able to perform a given task, and/or similar things.
Anthropic published some promising early work on this back in mid-2022, the P(IK) thing. When I first read that paper, I was expecting something like that to get turned into a product fairly soon. Yet it’s mid-2024, and still nothing, from any LLM provider.
A more humanlike capacity to write/think in an exploratory, tentative manner without immediately trusting what they’ve just said as gospel truth. And the closely related capacity to look back at something they’ve written and think “hmm, actually no, that argument didn’t work,” and then try something else. A certain quality of looseness, of “slack,” of noncommittal just-trying-things-out, which is absolutely required for truly reliable reasoning and which is notably absent in today’s “chat” models.
I think this stuff is inherently kind of hard for LLMs, even ignoring the distortions introduced by instruction/chat tuning. LLMs are trained mostly on texts that capture the final products of human thought, without spelling out all the messy intermediate steps, all the failed attempts and doubling back on oneself and realizing one isn’t making any sense and so on. That tends to stay inside one’s head, if one is a human.
But, precisely because this is hard for LLMs (and so doesn’t come for free with better language modeling, or comes very slowly relative to other capabilities), it ought to be attacked as a research problem unto itself.
I get the sense that much of this is downstream from the tension between the needs of “chat users”—people who are talking to these systems as chatbots, turn by turn—and the needs of people like me who are writing applications or data processing pipelines or whatever.
People like me are not “chatting” with the LLM. We don’t care about its personality or tone, or about “harmlessness” (typically we want to avoid refusals completely). We don’t mind verbosity, except insofar as it increases cost and latency (very often there is literally no one reading the text of the responses, except for a machine that extracts little pieces of them and ignores the rest).
But we do care about the LLM’s ability to perform fundamentally easy but “oddly shaped” and extremely specific business tasks. We need it to do such tasks reliably, in a setting where there is no option to write customized follow-up messages to clarify and explain things when it fails. We also care a lot about cost and latency, because we’re operating at scale; it is painful to have to spend tokens on few-shot examples when I just know the model could 0-shot the task in principle, and no, I can’t just switch to the most powerful available model the way all the chat users do, those costs really add up, yes GPT-4-insert-latest-suffix-here is much cheaper than GPT-4 at launch but no, it is still not worth the extra money at scale, not for many things at least.
It seems like what happened was:
The runaway success of ChatGPT caused a lot of investment in optimizing inference for ChatGPT-like models
As a result, no matter what you want to do with an LLM, a ChatGPT-like model is the most cost-effective option
Social dynamics caused all providers “in the game” to converge around this kind of model—everyone wants to prove that, yes, they are “in the game,” meaning they have to compare their model to the competition in an apples-to-apples manner, meaning they have to build the same kind of thing that the competition has built
It’s mid-2024, there are a huge number of variants of the Anthropic HHH Assistant bot with many different options at each of several price/performance points, and absolutely nothing else
It turns out the Anthropic HHH Assistant bot is not ideal for a lot of use cases, there would be tremendous value in building better models for those use cases, but no one (?) is doing this because … ?? (Here I am at a loss. If your employer is doing this, or wants to, please let me know!)
And I work on tools for businesses that are using LLMs themselves, so I am able to watch what various others are doing in this area as well, and where they tend to struggle most.
it is painful to have to spend tokens on few-shot examples when I just know the model could 0-shot the task in principle
Is your situation that few-shot prompting would actually work but it is too expensive? If so, then possibly doing a general purpose few-shot prompt (e.g. a prompt which shows how to generally respond to instructions) and then doing context distillation on this with a model like GPT-3.5 could be pretty good.
I find that extremely long few-shot prompts (e.g. 5 long reasoning traces) help with ~all of the things you mentioned to a moderate extent. That said, I also think a big obstacle is just core intelligence: sometimes it is hard to know how you should reason and the model are quite dumb...
I can DM you some prompts if you are interested. See also the prompts I use for the recent ARC-AGI stuff I was doing, e.g. here.
I find that recent anthropic models (e.g. Opus) are much better at learning from long few-shot examples than openai models.
Is your situation that few-shot prompting would actually work but it is too expensive? If so, then possibly doing a general purpose few-shot prompt (e.g. a prompt which shows how to generally respond to instructions) and then doing context distillation on this with a model like GPT-3.5 could be pretty good.
Yeah, that is the situation. Unfortunately, there is a large inference markup for OpenAI finetuning—GPT-3.5 becomes 6x (input) / 4x (output) more expensive when finetuned—so you only break even if you are are distilling a fairly large quantity of prompt text, and even then, this is vastly more expensive than my fantasy dream scenario where OpenAI has already done this tuning for me as part of the standard-pricing GPT-3.5 model.
For what I’m doing, this level of expense is typically prohibitive in practice. The stuff I do typically falls into one of two buckets: either we are so stingy that even the cost of GPT-3.5 zero-shot is enough to raise eyebrows and I get asked to reduce cost even further if at all possible, or we are doing a one-time job where quality matters a lot and so GPT-4 or the like is acceptable. I realize that this is not the shape of the tradeoff for everyone, but this is a situation that one can be in, and I suspect I am not the only one in it.
Anyway, I didn’t mean to say that today’s models can’t be cajoled into doing the right thing at all. Only that it requires time and effort and (more importantly) added expense, and I’m frustrated that I don’t get these generic capabilities out of the box. I shouldn’t have to do extra generic instruction tuning on top of GPT-3.5. That’s OpenAI’s job. That’s supposed to be the product.
EDIT: also, I basically agree with this
That said, I also think a big obstacle is just core intelligence: sometimes it is hard to know how you should reason and the model are quite dumb...
but I also wonder whether this framing is too quick to accept the limitations of the HHH Assistant paradigm.
For instance, on complex tasks, I find that GPT-4 has a much higher rate of “just doing it right on the first try” than GPT-3.5. But in the occasional cases where GPT-4 starts off on the wrong foot, it stays on the wrong foot, and ends up sounding as confused/dumb as GPT-3.5.
Which makes me suspect that, if GPT-3.5 were tuned to be more robust to its own mistakes, along the lines of my final bullet point in OP (I’m imagining things like “whoops, I notice I am confused, let me repeat back my understanding of the task in more detail”), it would be able to close some of the distance to GPT-4. The situation is less “GPT-4 is smart enough to do the task” and more “both models are extremely non-robust and one slight misstep is all it takes for them to fail catastrophically, but GPT-4 is smart enough to compensate for this by simply taking zero missteps in a large fraction of cases.”
Have you looked at the new Gemini ‘prompt caching’ feature, where it stores the hidden state for reuse to save the cost of recomputing multi-million token contexts? It seems like it might get you most of the benefit of finetuning. Although I don’t really understand their pricing (is that really $0.08 per completion...?) so maybe it works out worse than the OA finetuning service.
EDIT: also of some interest might be the new OA batching API, which is half off as long as you are willing to wait up to 24 hours (but probably a lot less). The obvious way would be to do something like prompt caching and exploit the fact that probably most of the requests to such an API will share a common prefix, in addition to the benefit of being able to fill up idle GPU-time and shed load.
Given that Gemini 1.5 Flash already performs decently in my tests with relatively short prompts, and it’s even cheaper than GPT-3.5-Turbo, I could probably get a significant pareto improvement (indeed, probably an improvement on all fronts) by switching from {GPT-3.5-Turbo + short prompt} to {Gemini 1.5 Flash + long cached prompt}. Just need to make the case it’s worth the hassle...
The minimum input token count for context caching is 32,768
Obviously nice for truly long context stuff, but I’m not going to add tens of thousands of tokens to my prompt just for the privilege of using this feature.
Yeah, I was thinking that you might be able to fill the context adequately, because otherwise you would have to be in an awkward spot where you have too many examples to cheaply include them in the prompt to make the small cheap models work out, but also still not enough for finetuning to really shine by training a larger high-end model over millions of tokens to zero-shot it.
I do a lot of “mundane utility” work with chat LLMs in my job[1], and I find there is a disconnect between the pain points that are most obstructive to me and the kinds of problems that frontier LLM providers are solving with new releases.
I would pay a premium for an LLM API that was better tailored to my needs, even if the LLM was behind the frontier in terms of raw intelligence. Raw intelligence is rarely the limiting factor on what I am trying to do. I am not asking for anything especially sophisticated, merely something very specific, which I need to be done exactly as specified.
It is demoralizing to watch the new releases come out, one after the next. The field is overcrowded with near-duplicate copies of the same very specific thing, the frontier “chat” LLM, modeled closely on ChatGPT (or perhaps I should say modeled closely on Anthropic’s HHH Assistant concept, since that came first). If you want this exact thing, you have many choices—more than you could ever need. But if its failure modes are problematic for you, you’re out of luck.
Some of the things I would pay a premium for:
Tuning that “bakes in” the idea that you are going to want to use CoT in places where it is obviously appropriate. The model does not jump to conclusions; it does not assert something in sentence #1 and then justify it in sentences #2 through #6; it moves systematically from narrow claims to broader ones. This really ought to happen automatically, but failing that, maybe some “CoT mode” that can be reliably triggered with a special string of words.
Getting chat LLMs to do 0-shot CoT is still (!) kind of a crapshoot. It’s ridiculous. I shouldn’t have to spend hours figuring out the magic phrasing that will make your model do the thing that you know I want, that everyone always wants, that your own “prompting guide” tells me I should want.
Really good instruction-following, including CoT-like reviewing and regurgitation of instruction text as needed.
Giving complex instructions to chat LLMs is still a crapshoot. Often what one gets is a sort of random mixture of “actually following the damn instructions” and “doing something that sounds, out of context, like a prototypical ‘good’ response from an HHH Assistant—even though in context it is not ‘helpful’ because it flagrantly conflicts with the instructions.”
“Smarter” chat LLMs are admittedly better at this than their weaker peers, but they are still strikingly bad at it, given how easy instruction-following seems in principle, and how capable they are at various things that seem more difficult.
It’s 2024, the whole selling point of these things is that you write verbal instructions and they follow them, and yet I have become resigned to the idea that they will just ignore half of my instructions half of the time. Something is wrong with this picture!
Quantified uncertainty. Some (reliable) way of reporting how confident they are about a given answer, and/or how likely it is that they will be able to perform a given task, and/or similar things.
Anthropic published some promising early work on this back in mid-2022, the P(IK) thing. When I first read that paper, I was expecting something like that to get turned into a product fairly soon. Yet it’s mid-2024, and still nothing, from any LLM provider.
A more humanlike capacity to write/think in an exploratory, tentative manner without immediately trusting what they’ve just said as gospel truth. And the closely related capacity to look back at something they’ve written and think “hmm, actually no, that argument didn’t work,” and then try something else. A certain quality of looseness, of “slack,” of noncommittal just-trying-things-out, which is absolutely required for truly reliable reasoning and which is notably absent in today’s “chat” models.
I think this stuff is inherently kind of hard for LLMs, even ignoring the distortions introduced by instruction/chat tuning. LLMs are trained mostly on texts that capture the final products of human thought, without spelling out all the messy intermediate steps, all the failed attempts and doubling back on oneself and realizing one isn’t making any sense and so on. That tends to stay inside one’s head, if one is a human.
But, precisely because this is hard for LLMs (and so doesn’t come for free with better language modeling, or comes very slowly relative to other capabilities), it ought to be attacked as a research problem unto itself.
I get the sense that much of this is downstream from the tension between the needs of “chat users”—people who are talking to these systems as chatbots, turn by turn—and the needs of people like me who are writing applications or data processing pipelines or whatever.
People like me are not “chatting” with the LLM. We don’t care about its personality or tone, or about “harmlessness” (typically we want to avoid refusals completely). We don’t mind verbosity, except insofar as it increases cost and latency (very often there is literally no one reading the text of the responses, except for a machine that extracts little pieces of them and ignores the rest).
But we do care about the LLM’s ability to perform fundamentally easy but “oddly shaped” and extremely specific business tasks. We need it to do such tasks reliably, in a setting where there is no option to write customized follow-up messages to clarify and explain things when it fails. We also care a lot about cost and latency, because we’re operating at scale; it is painful to have to spend tokens on few-shot examples when I just know the model could 0-shot the task in principle, and no, I can’t just switch to the most powerful available model the way all the chat users do, those costs really add up, yes GPT-4-insert-latest-suffix-here is much cheaper than GPT-4 at launch but no, it is still not worth the extra money at scale, not for many things at least.
It seems like what happened was:
The runaway success of ChatGPT caused a lot of investment in optimizing inference for ChatGPT-like models
As a result, no matter what you want to do with an LLM, a ChatGPT-like model is the most cost-effective option
Social dynamics caused all providers “in the game” to converge around this kind of model—everyone wants to prove that, yes, they are “in the game,” meaning they have to compare their model to the competition in an apples-to-apples manner, meaning they have to build the same kind of thing that the competition has built
It’s mid-2024, there are a huge number of variants of the Anthropic HHH Assistant bot with many different options at each of several price/performance points, and absolutely nothing else
It turns out the Anthropic HHH Assistant bot is not ideal for a lot of use cases, there would be tremendous value in building better models for those use cases, but no one (?) is doing this because … ?? (Here I am at a loss. If your employer is doing this, or wants to, please let me know!)
And I work on tools for businesses that are using LLMs themselves, so I am able to watch what various others are doing in this area as well, and where they tend to struggle most.
Is your situation that few-shot prompting would actually work but it is too expensive? If so, then possibly doing a general purpose few-shot prompt (e.g. a prompt which shows how to generally respond to instructions) and then doing context distillation on this with a model like GPT-3.5 could be pretty good.
I find that extremely long few-shot prompts (e.g. 5 long reasoning traces) help with ~all of the things you mentioned to a moderate extent. That said, I also think a big obstacle is just core intelligence: sometimes it is hard to know how you should reason and the model are quite dumb...
I can DM you some prompts if you are interested. See also the prompts I use for the recent ARC-AGI stuff I was doing, e.g. here.
I find that recent anthropic models (e.g. Opus) are much better at learning from long few-shot examples than openai models.
Yeah, that is the situation. Unfortunately, there is a large inference markup for OpenAI finetuning—GPT-3.5 becomes 6x (input) / 4x (output) more expensive when finetuned—so you only break even if you are are distilling a fairly large quantity of prompt text, and even then, this is vastly more expensive than my fantasy dream scenario where OpenAI has already done this tuning for me as part of the standard-pricing GPT-3.5 model.
For what I’m doing, this level of expense is typically prohibitive in practice. The stuff I do typically falls into one of two buckets: either we are so stingy that even the cost of GPT-3.5 zero-shot is enough to raise eyebrows and I get asked to reduce cost even further if at all possible, or we are doing a one-time job where quality matters a lot and so GPT-4 or the like is acceptable. I realize that this is not the shape of the tradeoff for everyone, but this is a situation that one can be in, and I suspect I am not the only one in it.
Anyway, I didn’t mean to say that today’s models can’t be cajoled into doing the right thing at all. Only that it requires time and effort and (more importantly) added expense, and I’m frustrated that I don’t get these generic capabilities out of the box. I shouldn’t have to do extra generic instruction tuning on top of GPT-3.5. That’s OpenAI’s job. That’s supposed to be the product.
EDIT: also, I basically agree with this
but I also wonder whether this framing is too quick to accept the limitations of the HHH Assistant paradigm.
For instance, on complex tasks, I find that GPT-4 has a much higher rate of “just doing it right on the first try” than GPT-3.5. But in the occasional cases where GPT-4 starts off on the wrong foot, it stays on the wrong foot, and ends up sounding as confused/dumb as GPT-3.5.
Which makes me suspect that, if GPT-3.5 were tuned to be more robust to its own mistakes, along the lines of my final bullet point in OP (I’m imagining things like “whoops, I notice I am confused, let me repeat back my understanding of the task in more detail”), it would be able to close some of the distance to GPT-4. The situation is less “GPT-4 is smart enough to do the task” and more “both models are extremely non-robust and one slight misstep is all it takes for them to fail catastrophically, but GPT-4 is smart enough to compensate for this by simply taking zero missteps in a large fraction of cases.”
Have you looked at the new Gemini ‘prompt caching’ feature, where it stores the hidden state for reuse to save the cost of recomputing multi-million token contexts? It seems like it might get you most of the benefit of finetuning. Although I don’t really understand their pricing (is that really $0.08 per completion...?) so maybe it works out worse than the OA finetuning service.
EDIT: also of some interest might be the new OA batching API, which is half off as long as you are willing to wait up to 24 hours (but probably a lot less). The obvious way would be to do something like prompt caching and exploit the fact that probably most of the requests to such an API will share a common prefix, in addition to the benefit of being able to fill up idle GPU-time and shed load.
Yeah, it’s on my radar and seems promising.
Given that Gemini 1.5 Flash already performs decently in my tests with relatively short prompts, and it’s even cheaper than GPT-3.5-Turbo, I could probably get a significant pareto improvement (indeed, probably an improvement on all fronts) by switching from {GPT-3.5-Turbo + short prompt} to {Gemini 1.5 Flash + long cached prompt}. Just need to make the case it’s worth the hassle...
EDIT: oh, wait, I just found the catch.
Obviously nice for truly long context stuff, but I’m not going to add tens of thousands of tokens to my prompt just for the privilege of using this feature.
Yeah, I was thinking that you might be able to fill the context adequately, because otherwise you would have to be in an awkward spot where you have too many examples to cheaply include them in the prompt to make the small cheap models work out, but also still not enough for finetuning to really shine by training a larger high-end model over millions of tokens to zero-shot it.