Regarding “thinking a problem over”—I have seen some examples where on some questions that GPT-3 can’t answer correctly off the bat, it can answer correctly when the prompt encourages a kind of talking through the problem, where its own generations bias its later generations in such a way that it comes to the right conclusion in the end. This may undercut your argument that the limited number of layers prevents certain kinds of problem solving that need more thought?
I guess I would propose that “using the page as a scratchpad” doesn’t help with the operation “develop an idea, chunk it, and then build on it”. The problem is that the chunking and building-on-the-chunk have to happen sequentially. So maybe it can (barely) develop an idea, chunk it, and write it down. Then you turn the Transformer back on for the next word, with the previous writing as an additional input, and maybe it takes 30 Transformer layers just to get back to where it was, i.e. having re-internalized that concept from before. And then there aren’t enough layers left to build on it… Let alone build a giant hierarchy of new chunks-inside-chunks.
So I think that going more than 1 or 2 “steps” of inferential distance beyond the concepts represented in the training data requires that the new ideas get put into the weights, not just the activations.
I guess you could fine-tune the network on its own outputs, or something. I don’t think that would work, but who knows.
Then you turn the Transformer back on for the next word, with the previous writing as an additional input, and maybe it takes 30 Transformer layers just to get back to where it was, i.e. having re-internalized that concept from before.
Are the remaining 66 layers not enough to build on the concept? What if we’re talking about GPT-N rather than GPT-3, with T >> 96 total layers, such that it can use M layers to re-internalize the concept and T-M layers to build on it?
Aren’t our brains having to do something like that with our working memory?
I agree that, the more layers you have in the Transformer, the more steps you can take beyond the range of concepts and relations-between-concepts that are well-represented in the training data.
If you want your AGI to invent a new gadget, for example, there might be 500 insights involved in understanding and optimizing its operation—how the components relate to each other, what happens in different operating regimes, how the output depends on each component, what are the edge cases, etc. etc. And these insights are probably not particularly parallelizable; rather, you need to already understand lots of them to figure out more. I don’t know how many Transformer layers it takes to internalize a new concept, or how many Transformer layers you can train, so I don’t know what the limit is, only that I think there’s some limit. Unless the Transformer has recurrency I guess, then maybe all bets are off? I’d have to think about that more.
Aren’t our brains having to do something like that with our working memory?
Yeah, definitely. We humans need to incorporate insights into our permanent memory / world-model before we can effectively build on them.
This is analogous to my claim that we need to get new insights somehow out of GPT-N’s activations and into its weights, before it can effectively build on them.
Maybe the right model is a human and GPT-N working together. GPT-N has some glimmer of an insight, and the human “gets it”, and writes out 20 example paragraphs that rely on that insight, and then fine-tunes GPT-N on those paragraphs. Now GPT-N has that insight incorporated into its weights, and we go back, with the human trying to coax GPT-N into having more insights, and repeat.
Regarding “thinking a problem over”—I have seen some examples where on some questions that GPT-3 can’t answer correctly off the bat, it can answer correctly when the prompt encourages a kind of talking through the problem, where its own generations bias its later generations in such a way that it comes to the right conclusion in the end. This may undercut your argument that the limited number of layers prevents certain kinds of problem solving that need more thought?
Yeah, maybe. Well, definitely to some extent.
I guess I would propose that “using the page as a scratchpad” doesn’t help with the operation “develop an idea, chunk it, and then build on it”. The problem is that the chunking and building-on-the-chunk have to happen sequentially. So maybe it can (barely) develop an idea, chunk it, and write it down. Then you turn the Transformer back on for the next word, with the previous writing as an additional input, and maybe it takes 30 Transformer layers just to get back to where it was, i.e. having re-internalized that concept from before. And then there aren’t enough layers left to build on it… Let alone build a giant hierarchy of new chunks-inside-chunks.
So I think that going more than 1 or 2 “steps” of inferential distance beyond the concepts represented in the training data requires that the new ideas get put into the weights, not just the activations.
I guess you could fine-tune the network on its own outputs, or something. I don’t think that would work, but who knows.
Are the remaining 66 layers not enough to build on the concept? What if we’re talking about GPT-N rather than GPT-3, with T >> 96 total layers, such that it can use M layers to re-internalize the concept and T-M layers to build on it?
Aren’t our brains having to do something like that with our working memory?
I agree that, the more layers you have in the Transformer, the more steps you can take beyond the range of concepts and relations-between-concepts that are well-represented in the training data.
If you want your AGI to invent a new gadget, for example, there might be 500 insights involved in understanding and optimizing its operation—how the components relate to each other, what happens in different operating regimes, how the output depends on each component, what are the edge cases, etc. etc. And these insights are probably not particularly parallelizable; rather, you need to already understand lots of them to figure out more. I don’t know how many Transformer layers it takes to internalize a new concept, or how many Transformer layers you can train, so I don’t know what the limit is, only that I think there’s some limit. Unless the Transformer has recurrency I guess, then maybe all bets are off? I’d have to think about that more.
Yeah, definitely. We humans need to incorporate insights into our permanent memory / world-model before we can effectively build on them.
This is analogous to my claim that we need to get new insights somehow out of GPT-N’s activations and into its weights, before it can effectively build on them.
Maybe the right model is a human and GPT-N working together. GPT-N has some glimmer of an insight, and the human “gets it”, and writes out 20 example paragraphs that rely on that insight, and then fine-tunes GPT-N on those paragraphs. Now GPT-N has that insight incorporated into its weights, and we go back, with the human trying to coax GPT-N into having more insights, and repeat.
I dunno, maybe. Just brainstorming. :-)