I am left wondering if when GPT3 does few-shot arithmetics, it is actually fitting a linear model on the examples to predict the next token. I.e. the GPT3 weights do not “know” arithmetics, but they know how to fit, and that’s why they need a few examples before they can tell you the answer to 25+17: they need to know what function of 25 and 17 to return.
It is not that crazy given my understanding of what a transformer does, which is in some sense returning a function of the most recent input which depends on earlier inputs. Or am I confusing them with a different NN design?
Note that there could still be some priors on some functions being more probable, or some more complex case being plainly impossible to fit because there’s no way to get there from the meta-model that is the trained NN.
I am left wondering if when GPT3 does few-shot arithmetics, it is actually fitting a linear model on the examples to predict the next token. I.e. the GPT3 weights do not “know” arithmetics, but they know how to fit, and that’s why they need a few examples before they can tell you the answer to 25+17: they need to know what function of 25 and 17 to return.
It is not that crazy given my understanding of what a transformer does, which is in some sense returning a function of the most recent input which depends on earlier inputs. Or am I confusing them with a different NN design?
Note that there could still be some priors on some functions being more probable, or some more complex case being plainly impossible to fit because there’s no way to get there from the meta-model that is the trained NN.