It seems to me like you only need to finetune a dataset of like 50k diverse samples of this type of error correction built in, or RLHF this type of error correction?
This same problem exists in the behaviour cloning literature, if you have an expert agent behaving under some policy πexpert, and you want to train some other policy to copy the expert, samples from the expert policy are not enough, you need to have a lot of data that shows your agent how to behave when it gets out of distribution, this was the point of the DAGGER paper, and in practice the data that shows the agent how to get back into distribution is significantly larger than the pure expert dataset. There are very many ways that GPT might go out of distribution, and just showing it how to come back for a small fraction of examples won’t be enough.
I have not read the paper you link, but I have this expectation about it: that the limitation of imitation learning is proved in a context that lacks richness compared to imitating language.
My intuition is: I have experience myself of failing to learn just from imitating an expert playing a game the best way possible. But if someone explains to me their actions, I can then learn something.
Language is flexible and recursive: you can in principle represent anything out of the real world in language, including language itself, and how to think. If somehow the learner manages to tap into recursiveness, it can shortcut the levels. It will learn how to act meaningfully not because it has covered all the possible examples of long-term sequences that lead to a goal, but because it has seen many schemes that map to how the expert thinks.
I can not learn chess efficiently by observing a grandmaster play many matches and jotting down all the moves. I could do it if the grandmaster was a short program if implemented in chess moves.
It seems to me like you only need to finetune a dataset of like 50k diverse samples of this type of error correction built in, or RLHF this type of error correction?
This same problem exists in the behaviour cloning literature, if you have an expert agent behaving under some policy πexpert, and you want to train some other policy to copy the expert, samples from the expert policy are not enough, you need to have a lot of data that shows your agent how to behave when it gets out of distribution, this was the point of the DAGGER paper, and in practice the data that shows the agent how to get back into distribution is significantly larger than the pure expert dataset. There are very many ways that GPT might go out of distribution, and just showing it how to come back for a small fraction of examples won’t be enough.
I have not read the paper you link, but I have this expectation about it: that the limitation of imitation learning is proved in a context that lacks richness compared to imitating language.
My intuition is: I have experience myself of failing to learn just from imitating an expert playing a game the best way possible. But if someone explains to me their actions, I can then learn something.
Language is flexible and recursive: you can in principle represent anything out of the real world in language, including language itself, and how to think. If somehow the learner manages to tap into recursiveness, it can shortcut the levels. It will learn how to act meaningfully not because it has covered all the possible examples of long-term sequences that lead to a goal, but because it has seen many schemes that map to how the expert thinks.
I can not learn chess efficiently by observing a grandmaster play many matches and jotting down all the moves. I could do it if the grandmaster was a short program if implemented in chess moves.