As I understand it, OpenAI argue that GPT-3 is a mesa-optimizer (though not in those terms) in the announcement paper Language Models are Few-Shot Learners. (Search for meta.) (edit: Might have been in another paper. I’ve seen this argued somewhere, but I might have the wrong link :( ) Paraphrased, the model has been shown so many examples of the form “here are some examples that create an implied class, is X an instance of the class? Yes/no”, that instead of memorizing the answers to all the questions, it has acquired a general skill for abstracting at runtime (over its context window). So while you have gradient descent going trying to teach the network a series of classes, the network might actually pick up feature learning itself as a skill instead, and start doing its own learning algorithm over just the context window.
As I understand it, OpenAI argue that GPT-3 is a mesa-optimizer (though not in those terms) in the announcement paper Language Models are Few-Shot Learners. (Search for meta.) (edit: Might have been in another paper. I’ve seen this argued somewhere, but I might have the wrong link :( ) Paraphrased, the model has been shown so many examples of the form “here are some examples that create an implied class, is X an instance of the class? Yes/no”, that instead of memorizing the answers to all the questions, it has acquired a general skill for abstracting at runtime (over its context window). So while you have gradient descent going trying to teach the network a series of classes, the network might actually pick up feature learning itself as a skill instead, and start doing its own learning algorithm over just the context window.