This was the very first thing I thought of when language models came to my attention as “hey this looks like it actually might be the thing that the future looks like” (years ago). Since I’m not particularly smart or particularly well-informed, I conclude that I was not the first person to come up with this idea (or the tenth, or even the ten-thousandth). I strongly suspect that the simplest possible approach of “just turn on backprop” was tried within the first couple of days of the weights of a GPT model being available. For context, nostalgebraist-autoresponder has been live on Tumblr since late 2019.
I do concur with you that this is an important thing to explore. However, I am quite confident that “do the thing that is obvious to someone with no background encountering the field for the first time” is not an effective approach.
When I briefly looked into the academic research on this, I picked up the following general impression:
This is a task a lot of people have poured a lot of time into. Search terms are “online learning”, “incremental learning”, “continual learning”, “active learning”,
The primary problem with this approach is that as the model learns new stuff, it forgets old stuff that it is no longer being trained on. The search term here is “catastrophic forgetting”. There are also several less-critical problems which would still be blockers if catastrophic forgetting wasn’t an issue, mostly related to the model going off the rails more and more over time—search terms include “bias amplification”, “overfitting”, and “hallucination”. Some argue that this is also a problem in humans.
There have been some clever attempts to get around this. One example of a particularly clever idea from French and Chater (2002) is “let’s use a clever metric of how important the old stuff is to the network to try to get it to forget less stuff”. I notice that this clever technique is not in use despite there being a Deepmind publication about it in 2017 . Search term: “elastic weight consolidation”, and, in terms of the particular failure modes, I believe “task-agnostic/task-free continual learning” and “scalability”.
Lots of people have had the idea “well humans sleep and dream, maybe something like that?”. Search terms: “knowledge distillation”, “experience replay”.
Also lots of people have had the idea “well what if we hack on an external memory store in a completely unprincipled way”. And this seems to be how AutoGPT works. Also people have tried to do it in a principled way, search term “memory-augmented neural networks”.
Basically, the impression I’ve picked up is
This key problem seems like it’s really hard to solve in a principled way.
Humans are complete disaster monkeys when it comes to ML research, and as such, if there was an obvious way to write a ML program on your desktop computer that rapidly bootstraps its way to godhood, someone would have already done it.
Per 1 and 2, we totally have yet “explored (publicly?)” the potential of “switching on” backpropagation/training while in inference mode (if by “we” you include “the internet” and not just “lesswrong in particular”).
I have noted the problem of catastrophic forgetting in the section “why it might not work”. In general I agree continual learning is obviously a thing, otherwise I would not have used the established terminology. What I believe however is that the problems we face in continual learning in e.g. a 100M BERT model may not be the same as what we observe in models that can now meaningfully self critique. We have explored this technique publicly, but have we tried it with GPT-4? The publicly part was really just a question of whether OpenAI actually did it on this model or not, and it would be an amazing data point if they could say “We couldn’t get it to work.”
Ah, so the point was whether that had been explored publicly on the very largest language models that exist, because of the whole “sometimes approaches that didn’t work at small scale start working when you throw enough compute at them” thing? Makes sense.
Essentially yes, heh. I take this as a learning experience for my writing, I don’t know what I was thinking, but it is obvious in hindsight that saying to just “switch on backprop” sounds very naive.
I also confess I haven’t done the due diligence to find out what the actual largest model that has been tried with this, whether someone has tried it with Pythia or LLaMa. I’ll do some more googling tonight.
One intuition why the largest models might be different, is that part of the training/fine-tuning going on will have to do with the model’s own output. The largest models are the ones where the model’s own output is not essentially word salad.
This was the very first thing I thought of when language models came to my attention as “hey this looks like it actually might be the thing that the future looks like” (years ago). Since I’m not particularly smart or particularly well-informed, I conclude that I was not the first person to come up with this idea (or the tenth, or even the ten-thousandth). I strongly suspect that the simplest possible approach of “just turn on backprop” was tried within the first couple of days of the weights of a GPT model being available. For context, nostalgebraist-autoresponder has been live on Tumblr since late 2019.
I do concur with you that this is an important thing to explore. However, I am quite confident that “do the thing that is obvious to someone with no background encountering the field for the first time” is not an effective approach.
When I briefly looked into the academic research on this, I picked up the following general impression:
This is a task a lot of people have poured a lot of time into. Search terms are “online learning”, “incremental learning”, “continual learning”, “active learning”,
The primary problem with this approach is that as the model learns new stuff, it forgets old stuff that it is no longer being trained on. The search term here is “catastrophic forgetting”. There are also several less-critical problems which would still be blockers if catastrophic forgetting wasn’t an issue, mostly related to the model going off the rails more and more over time—search terms include “bias amplification”, “overfitting”, and “hallucination”. Some argue that this is also a problem in humans.
There have been some clever attempts to get around this. One example of a particularly clever idea from French and Chater (2002) is “let’s use a clever metric of how important the old stuff is to the network to try to get it to forget less stuff”. I notice that this clever technique is not in use despite there being a Deepmind publication about it in 2017 . Search term: “elastic weight consolidation”, and, in terms of the particular failure modes, I believe “task-agnostic/task-free continual learning” and “scalability”.
Lots of people have had the idea “well humans sleep and dream, maybe something like that?”. Search terms: “knowledge distillation”, “experience replay”.
Also lots of people have had the idea “well what if we hack on an external memory store in a completely unprincipled way”. And this seems to be how AutoGPT works. Also people have tried to do it in a principled way, search term “memory-augmented neural networks”.
Basically, the impression I’ve picked up is
This key problem seems like it’s really hard to solve in a principled way.
Humans are complete disaster monkeys when it comes to ML research, and as such, if there was an obvious way to write a ML program on your desktop computer that rapidly bootstraps its way to godhood, someone would have already done it.
Per 1 and 2, we totally have yet “explored (publicly?)” the potential of “switching on” backpropagation/training while in inference mode (if by “we” you include “the internet” and not just “lesswrong in particular”).
I have noted the problem of catastrophic forgetting in the section “why it might not work”. In general I agree continual learning is obviously a thing, otherwise I would not have used the established terminology. What I believe however is that the problems we face in continual learning in e.g. a 100M BERT model may not be the same as what we observe in models that can now meaningfully self critique. We have explored this technique publicly, but have we tried it with GPT-4? The publicly part was really just a question of whether OpenAI actually did it on this model or not, and it would be an amazing data point if they could say “We couldn’t get it to work.”
Ah, so the point was whether that had been explored publicly on the very largest language models that exist, because of the whole “sometimes approaches that didn’t work at small scale start working when you throw enough compute at them” thing? Makes sense.
Essentially yes, heh. I take this as a learning experience for my writing, I don’t know what I was thinking, but it is obvious in hindsight that saying to just “switch on backprop” sounds very naive.
I also confess I haven’t done the due diligence to find out what the actual largest model that has been tried with this, whether someone has tried it with Pythia or LLaMa. I’ll do some more googling tonight.
One intuition why the largest models might be different, is that part of the training/fine-tuning going on will have to do with the model’s own output. The largest models are the ones where the model’s own output is not essentially word salad.