Aside from the potential concerns you bring up, here is the most likely way I could see this experiment failing to be informative: rather than having checks and question marks in your tables above, really the model’s ability to solve each task is a question of degree—each table entry will be a real number between 0 and 1. For, say, tone, GPT-3 probably doesn’t have a perfect model of tone, and would get <100% performance on a sentiment classification task, especially if done few-shot.
The issue, then, is that the “fine-tuning for correctness” and “fine-tuning for coherence” processes are not really equivalent—fine-tuning for correctness is in fact giving GPT-3 additional information about tone, which improves its capabilities. In addition, GPT-3 might not “know” exactly what humans mean by the word tone, and so fine-tuning for correctness also helps GPT-3 to better understand the question.
Given these considerations, my modal expectation is that fine-tuning for correctness will provide moderately better results than just doing coherence, but it won’t be clear how to interpret the difference—maybe in both cases GPT-3 provides incoherent outputs 10% of the time, and then additionally coherent but wrong outputs 10% of the time when fine-tuned for correctness, but 17% of the time when fine-tuned only for coherence. What would you conclude from a result like that? I would still have found the experiment interesting, but I’m not sure I would be able to draw a firm conclusion.
So perhaps my main feedback would be to think about how likely you think such an outcome is, how much you mind that, and if there are alternative tasks that avoid this issue without being significantly more complicated.
The issue, then, is that the “fine-tuning for correctness” and “fine-tuning for coherence” processes are not really equivalent—fine-tuning for correctness is in fact giving GPT-3 additional information about tone, which improves its capabilities. In addition, GPT-3 might not “know” exactly what humans mean by the word tone, and so fine-tuning for correctness also helps GPT-3 to better understand the question.
Part of my hope is that “coherence” can do quite a lot of the “telling you what humans mean about tone.” For example, you can basically force the model to talk (in English) about what things contribute to tone, and why it thinks the tone is like such and such (or even what the tone of English sentences is)---anything that a human who doesn’t know French can evaluate. And taken together those things seem like enough to mostly pin down what we are talking about.
Given these considerations, my modal expectation is that fine-tuning for correctness will provide moderately better results than just doing coherence, but it won’t be clear how to interpret the difference—maybe in both cases GPT-3 provides incoherent outputs 10% of the time, and then additionally coherent but wrong outputs 10% of the time when fine-tuned for correctness, but 17% of the time when fine-tuned only for coherence. What would you conclude from a result like that?
I’d tentatively interpret that as a negative result, but I agree with your comments below that ultimately a lot of what we care about here is the scaling behavior and putting together a more holistic picture of what’s going on, in particular:
As we introduce stronger coherence checks, what happens to the accuracy? Is it approaching the quality of correctness, or is it going to asymptote much lower?
Is the gap shrinking as model quality improves, or growing? Do we think that very large models would converge to a small gap or is it a constant?
I’m also quite interested in the qualitative behavior. Probably most interesting are the cases where the initial model is incoherent, the coherence-tuned model is coherent-but-wrong, and the correctness-tuned model is correct. (Of course every example is also fuzzy because of noise from sampling and training, but the degree of fuzziness is smaller as we remove randomness.) In these cases, what is happening with the coherence-tuned model? Are we able to see cases where it cleanly feels like the “wrong” generalization, or is it a plausible ambiguity about what we were looking for? And so on.
I’m interested in the related engineering question: in this setting, what can we do to improve the kind of generalization we get? Can we get some handle on the performance gap and possible approaches to closing it?
And finally I’m interested in understanding how the phenomenon depends on the task: is it basically similar in different domains / for different kinds of question or quite different? How does it depend on the number / type / degree of similarity of the categories?
So perhaps my main feedback would be to think about how likely you think such an outcome is, how much you mind that
I generally agree that my post simplified the empirical situation and actually getting convincing results would require careful empirical work. I do think that initial results (like the 17% vs 10% error rate) would help provide some basic orientation; even if it’s a small datapoint with unclear conclusions it still gives us some sense of what is basically going on, what kind of numbers we are talking about, and what would actually be interesting to measure.
(My guess is that we are roughly on the same page here.)
if there are alternative tasks that avoid this issue without being significantly more complicated.
I do think it’s worth spending time trying to think of better tasks, though I’m not optimistic about finding something a lot better (e.g. avoiding the need for doing a bunch of experiments to understand how the results vary and trying to do some extrapolation to big models).
Actually, another issue is that unsupervised translation isn’t “that hard” relative to supervised translation—I think that you can get pretty far with simple heuristics, such that I’d guess making the model 10x bigger matters more than making the objective more aligned with getting the answer right (and that this will be true for at least a couple more 10x-ing of model size, although at some point the objective will matter more).
This might not matter as much if you’re actually outputting explanations and not just translating from one language to another. Although it is probably true that for tasks that are far away from the ceiling, “naive objective + 10x larger model” will outperform “correct objective”.
I do expect “explanations of what’s going on in this sentence” to be a lot weaker than translations.
For that task, I expect that the model trained on coherence + similar tasks will outperform a 10x larger pre-trained model. If the larger pre-trained model gets context stuffing on similar tasks, but no coherence training, then it’s less clear to me.
But I guess the point is that the differences between various degrees of successful-generalization will be relatively small compared to model size effects. It doesn’t matter so much how good the transfer model is relative to the pre-trained baseline, it matters how large the differences between the possible worlds that we are hoping to distinguish are.
I guess my main hope there is to try to understand whether there is some setting where transfer works quite well, either getting very close to the model fine-tuned on distribution, or at least converging as the pre-trained model grows. Hopefully that will make it easier to notice the effects we are looking for, and it’s OK if those effects are small relative to model doublings.
(Also worth noting that “as good as increasing model size by 10%” is potentially quite economically relevant. So I’m mostly just thinking about the extent to which it can make effects hard to measure.)
Thanks Paul, I generally like this idea.
Aside from the potential concerns you bring up, here is the most likely way I could see this experiment failing to be informative: rather than having checks and question marks in your tables above, really the model’s ability to solve each task is a question of degree—each table entry will be a real number between 0 and 1. For, say, tone, GPT-3 probably doesn’t have a perfect model of tone, and would get <100% performance on a sentiment classification task, especially if done few-shot.
The issue, then, is that the “fine-tuning for correctness” and “fine-tuning for coherence” processes are not really equivalent—fine-tuning for correctness is in fact giving GPT-3 additional information about tone, which improves its capabilities. In addition, GPT-3 might not “know” exactly what humans mean by the word tone, and so fine-tuning for correctness also helps GPT-3 to better understand the question.
Given these considerations, my modal expectation is that fine-tuning for correctness will provide moderately better results than just doing coherence, but it won’t be clear how to interpret the difference—maybe in both cases GPT-3 provides incoherent outputs 10% of the time, and then additionally coherent but wrong outputs 10% of the time when fine-tuned for correctness, but 17% of the time when fine-tuned only for coherence. What would you conclude from a result like that? I would still have found the experiment interesting, but I’m not sure I would be able to draw a firm conclusion.
So perhaps my main feedback would be to think about how likely you think such an outcome is, how much you mind that, and if there are alternative tasks that avoid this issue without being significantly more complicated.
Part of my hope is that “coherence” can do quite a lot of the “telling you what humans mean about tone.” For example, you can basically force the model to talk (in English) about what things contribute to tone, and why it thinks the tone is like such and such (or even what the tone of English sentences is)---anything that a human who doesn’t know French can evaluate. And taken together those things seem like enough to mostly pin down what we are talking about.
I’d tentatively interpret that as a negative result, but I agree with your comments below that ultimately a lot of what we care about here is the scaling behavior and putting together a more holistic picture of what’s going on, in particular:
As we introduce stronger coherence checks, what happens to the accuracy? Is it approaching the quality of correctness, or is it going to asymptote much lower?
Is the gap shrinking as model quality improves, or growing? Do we think that very large models would converge to a small gap or is it a constant?
I’m also quite interested in the qualitative behavior. Probably most interesting are the cases where the initial model is incoherent, the coherence-tuned model is coherent-but-wrong, and the correctness-tuned model is correct. (Of course every example is also fuzzy because of noise from sampling and training, but the degree of fuzziness is smaller as we remove randomness.) In these cases, what is happening with the coherence-tuned model? Are we able to see cases where it cleanly feels like the “wrong” generalization, or is it a plausible ambiguity about what we were looking for? And so on.
I’m interested in the related engineering question: in this setting, what can we do to improve the kind of generalization we get? Can we get some handle on the performance gap and possible approaches to closing it?
And finally I’m interested in understanding how the phenomenon depends on the task: is it basically similar in different domains / for different kinds of question or quite different? How does it depend on the number / type / degree of similarity of the categories?
I generally agree that my post simplified the empirical situation and actually getting convincing results would require careful empirical work. I do think that initial results (like the 17% vs 10% error rate) would help provide some basic orientation; even if it’s a small datapoint with unclear conclusions it still gives us some sense of what is basically going on, what kind of numbers we are talking about, and what would actually be interesting to measure.
(My guess is that we are roughly on the same page here.)
I do think it’s worth spending time trying to think of better tasks, though I’m not optimistic about finding something a lot better (e.g. avoiding the need for doing a bunch of experiments to understand how the results vary and trying to do some extrapolation to big models).
Actually, another issue is that unsupervised translation isn’t “that hard” relative to supervised translation—I think that you can get pretty far with simple heuristics, such that I’d guess making the model 10x bigger matters more than making the objective more aligned with getting the answer right (and that this will be true for at least a couple more 10x-ing of model size, although at some point the objective will matter more).
This might not matter as much if you’re actually outputting explanations and not just translating from one language to another. Although it is probably true that for tasks that are far away from the ceiling, “naive objective + 10x larger model” will outperform “correct objective”.
I do expect “explanations of what’s going on in this sentence” to be a lot weaker than translations.
For that task, I expect that the model trained on coherence + similar tasks will outperform a 10x larger pre-trained model. If the larger pre-trained model gets context stuffing on similar tasks, but no coherence training, then it’s less clear to me.
But I guess the point is that the differences between various degrees of successful-generalization will be relatively small compared to model size effects. It doesn’t matter so much how good the transfer model is relative to the pre-trained baseline, it matters how large the differences between the possible worlds that we are hoping to distinguish are.
I guess my main hope there is to try to understand whether there is some setting where transfer works quite well, either getting very close to the model fine-tuned on distribution, or at least converging as the pre-trained model grows. Hopefully that will make it easier to notice the effects we are looking for, and it’s OK if those effects are small relative to model doublings.
(Also worth noting that “as good as increasing model size by 10%” is potentially quite economically relevant. So I’m mostly just thinking about the extent to which it can make effects hard to measure.)