I really appreciated the degree of clarity and the organization of this post.
I wonder how much the slope of L(D) is a consequence of the structure of the dataset, and whether we have much power to meaningfully shift the nature of L(D) for large datasets. A lot of the structure of language is very repetitive, and once it is learned, the model doesn’t learn much from seeing more examples of the same sort of thing. But, within the dataset are buried very rare instances of important concept classes. (In other words, the Common Crawl data has a certain perplexity, and that perplexity is a function of both how much of the dataset is easy/broad/repetitive/generic and how much is hard/narrow/unique/specific.) For example: I can’t, for the life of me, get GPT-3 to give correct answers on the following type of prompt:
You are facing north. There is a house straight ahead of you. To your left is a mountain. In what cardinal direction is the mountain?
No matter how much priming I give or how I reframe the question, GPT-3 tends to either give a basically random cardinal direction, or just repeat whatever direction I mentioned in the prompt. If you can figure out how to do it, please let me know, but as far as I can tell, GPT-3 really doesn’t understand how to do this. I think this is just an example of the sort of thing which simply occurs so infrequently in the dataset that it hasn’t learned the abstraction. However, I fully suspect that if there were some corner of the Internet where people wrote a lot about the cardinal directions of things relative to a specified observer, GPT-3 would learn it.
It also seems that one of the important things that humans do but transformers do not, is actively seek out more surprising subdomains of the learning space. The big breakthrough in transformers was attention, but currently the attention is only within-sequence, not across-dataset. What does L(D) look like if the model is empowered to notice, while training, that its loss on sequences involving words like “west” and “cardinal direction” is bad, and then to search for and prioritize other sequences with those tokens, rather than simply churning through the next 1000 examples of sequences from which it has essentially already extracted the maximum amount of information. At a certain point, you don’t need to train it on “The man woke up and got out of {bed}”, it knew what the last token was going to be long ago.
It would be good to know if I’m completely missing something here.
GPT 3 solves that easily now. I tried with no prompt tuning, simple structure (Q: and A:) and with “let’s think step by step” and all gave west as the answer. Step by step correctly enumerated the logic that led to the answer.
I assume you mean InstructGPT, specifically, solves that now? That’s worth noting since InstructGPT’s claim to fame is that it greatly reduces how much prompt engineering you need for various ‘tasks’ (even if it’s not too great at creative writing).
I think there’s something we could do even beyond choosing the best of existing data points to study. I think we could create data-generators, which could fill out missing domains of data using logical extrapolations. I think your example is a great type of problem for such an approach.
I really appreciated the degree of clarity and the organization of this post.
I wonder how much the slope of L(D) is a consequence of the structure of the dataset, and whether we have much power to meaningfully shift the nature of L(D) for large datasets. A lot of the structure of language is very repetitive, and once it is learned, the model doesn’t learn much from seeing more examples of the same sort of thing. But, within the dataset are buried very rare instances of important concept classes. (In other words, the Common Crawl data has a certain perplexity, and that perplexity is a function of both how much of the dataset is easy/broad/repetitive/generic and how much is hard/narrow/unique/specific.) For example: I can’t, for the life of me, get GPT-3 to give correct answers on the following type of prompt:
No matter how much priming I give or how I reframe the question, GPT-3 tends to either give a basically random cardinal direction, or just repeat whatever direction I mentioned in the prompt. If you can figure out how to do it, please let me know, but as far as I can tell, GPT-3 really doesn’t understand how to do this. I think this is just an example of the sort of thing which simply occurs so infrequently in the dataset that it hasn’t learned the abstraction. However, I fully suspect that if there were some corner of the Internet where people wrote a lot about the cardinal directions of things relative to a specified observer, GPT-3 would learn it.
It also seems that one of the important things that humans do but transformers do not, is actively seek out more surprising subdomains of the learning space. The big breakthrough in transformers was attention, but currently the attention is only within-sequence, not across-dataset. What does L(D) look like if the model is empowered to notice, while training, that its loss on sequences involving words like “west” and “cardinal direction” is bad, and then to search for and prioritize other sequences with those tokens, rather than simply churning through the next 1000 examples of sequences from which it has essentially already extracted the maximum amount of information. At a certain point, you don’t need to train it on “The man woke up and got out of {bed}”, it knew what the last token was going to be long ago.
It would be good to know if I’m completely missing something here.
I don’t think you’re completely missing something. This is the active learning approach, which gwern also suggested—see that thread for more.
GPT 3 solves that easily now. I tried with no prompt tuning, simple structure (Q: and A:) and with “let’s think step by step” and all gave west as the answer. Step by step correctly enumerated the logic that led to the answer.
I assume you mean InstructGPT, specifically, solves that now? That’s worth noting since InstructGPT’s claim to fame is that it greatly reduces how much prompt engineering you need for various ‘tasks’ (even if it’s not too great at creative writing).
I think there’s something we could do even beyond choosing the best of existing data points to study. I think we could create data-generators, which could fill out missing domains of data using logical extrapolations. I think your example is a great type of problem for such an approach.