esp. since GPT-3′s 0-shot learning looks like mesa-optimization
Could you provide more details on this?
Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they’d like to see from GPT-3 in response to those inputs (“few-shot learning”, right? I don’t know what 0-shot learning you’re referring to.) Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?
If that’s what you’re saying… That seems unlikely to me. GPT-3 is essentially a stack of 96 transformers right? So if it was doing something like gradient descent internally, how many consecutive iterations would it be capable of doing? It seems more likely to me that GPT-3 is simply able to learn sufficiently rich internal representations such that when the input/output examples are within its context window, it picks up their input/output structure and forms a sufficiently sophisticated conception of that structure that the word that scores highest according to next-word prediction is a word that comports with the structure.
96 transformers would appear to offer a very limited budget for any kind of serial computation, but there’s a lot of parallel computation going on there, and there are non-gradient-descent optimization algorithms, genetic algorithms say, that can be parallelized. I guess the query matrix could be used to implement some kind of fitness function? It would be interesting to try some kind of layer-wise pretraining on transformer blocks and train them to compute steps in a parallelizable optimization algorithm (probably you’d want to pick a deterministic algorithm which is parallelizable instead of a stochastic algorithm like genetic algorithms). Then you could look at the resulting network and based on it, try to figure out what the telltale signs of a mesa-optimizer are (since this network is almost certainly implementing a mesa-optimizer).
Still, my impression is you need 1000+ generations to get interesting results with genetic algorithms, which seems like a lot of serial computation relative to GPT-3′s budget...
Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they’d like to see from GPT-3 in response to those inputs (“few-shot learning”, right? I don’t know what 0-shot learning you’re referring to.)
No, that’s zero-shot. Few shot is when you train on those instead of just stuffing them into the context.
It looks like mesa-optimization because it seems to be doing something like learning about new tasks or new prompts that are very different from anything its seen before, without any training, just based on the context (0-shot).
Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?
By “training a model”, I assume you mean “a ML model” (as opposed to, e.g. a world model). Yes, I am claiming something like that, but learning vs. inference is a blurry line.
I’m not saying it’s doing SGD; I don’t know what it’s doing in order to solve these new tasks. But TBC, 96 steps of gradient descent could be a lot. MAML does meta-learning with 1.
Could you provide more details on this?
Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they’d like to see from GPT-3 in response to those inputs (“few-shot learning”, right? I don’t know what 0-shot learning you’re referring to.) Is your claim that GPT-3 succeeds at this sort of task by doing something akin to training a model internally?
If that’s what you’re saying… That seems unlikely to me. GPT-3 is essentially a stack of 96 transformers right? So if it was doing something like gradient descent internally, how many consecutive iterations would it be capable of doing? It seems more likely to me that GPT-3 is simply able to learn sufficiently rich internal representations such that when the input/output examples are within its context window, it picks up their input/output structure and forms a sufficiently sophisticated conception of that structure that the word that scores highest according to next-word prediction is a word that comports with the structure.
96 transformers would appear to offer a very limited budget for any kind of serial computation, but there’s a lot of parallel computation going on there, and there are non-gradient-descent optimization algorithms, genetic algorithms say, that can be parallelized. I guess the query matrix could be used to implement some kind of fitness function? It would be interesting to try some kind of layer-wise pretraining on transformer blocks and train them to compute steps in a parallelizable optimization algorithm (probably you’d want to pick a deterministic algorithm which is parallelizable instead of a stochastic algorithm like genetic algorithms). Then you could look at the resulting network and based on it, try to figure out what the telltale signs of a mesa-optimizer are (since this network is almost certainly implementing a mesa-optimizer).
Still, my impression is you need 1000+ generations to get interesting results with genetic algorithms, which seems like a lot of serial computation relative to GPT-3′s budget...
No, that’s zero-shot. Few shot is when you train on those instead of just stuffing them into the context.
It looks like mesa-optimization because it seems to be doing something like learning about new tasks or new prompts that are very different from anything its seen before, without any training, just based on the context (0-shot).
By “training a model”, I assume you mean “a ML model” (as opposed to, e.g. a world model). Yes, I am claiming something like that, but learning vs. inference is a blurry line.
I’m not saying it’s doing SGD; I don’t know what it’s doing in order to solve these new tasks. But TBC, 96 steps of gradient descent could be a lot. MAML does meta-learning with 1.
Thanks!