LLMs are trained to write text that would be maximally unsurprising if found on the internet.
This claim is false. If you look at a random text on the internet it would be very surprising if every word in it is the most likely word to follow based on previous words.
Kahneman’s latest book is Noise: A Flaw in Human Judgment. In it, he talks about how errors in human decisions as a combination of bias and noise. If you take a large sample size of human decisions and build your model on it you remove all the noise.
While a large LLM trained on all internet text keeps all the bias of the internet text it can remove all the noise.
In this section, I will give a brief summary of the view that these arguments oppose, as well as provide a standard justification for this view. In short, the view is that we can reach AGI by more or less simply scaling up existing methods (in terms of the size of the models, the amount of training data they are given, and/or the number of gradient steps they take, etc).
The question of whether scaling large language models is enough might have seem relevant a year ago, but it isn’t really today as the strategy that of the top players isn’t just calling large language models.
The step from GTP3 to InstructGPT and ChatGPT was not one of scaling up in terms of size of models and substantial increase in the amount of training data.
It was rather on learning from well well-curated data. ChatGPT itself is a project to gather a lot of data which inturn reveals a lot of the errors that ChatGPT makes and there are likely currently people at OpenAI working on ways of how to learn from that data
Over at Deep Mind they have GATO which is an approach that combines large language model with other problems sets.
LLMs are trained once, on a static set of data, and after their training phase, they cannot commit new knowledge to their long-term memory.
That’s just not true for ChatGPT. ChatGPT was very fast in learning how people tricked it to produce TOS violating content.
The Language of Thought
This, in turn, suggests a data structure that is discrete and combinatorial, with syntax trees, etc, and neural networks do (according to the argument) not use such representations. We should therefore expect neural networks to at some point hit a wall or limit to what they are able to do.
If you ask ChatGPT to multiply two 4-digit numbers it writes out the reasoning process in natural knowledge and comes to the right answer. ChatGPT is already today decent at using language for its reasoning process.
If you ask ChatGPT to multiply two 4-digit numbers it writes out the reasoning process in natural knowledge and comes to the right answer.
People keep saying such things. Am I missing something? I asked it to calculate 1024 * 2047, and the answer isn’t even close. (Though to my surprise, the first 2 steps are at least correct steps, and not nonsense. And it is actually adding the right numbers together in step 3, again, to my surprise. I’ve seen it perform much, much worse.)
I did ask it at the beginning to multiply numbers and it seems to behave now differently than it did 5 weeks ago and isn’t making correct multiplications anymore. Unfortunatley, I can’t access the old chats.
Interesting. I’m having the opposite experience (due to timing, apparently), where at least it’s making some sense now. I’ve seen it using tricks only applicable to addition and pulling numbers out of its ass, so I was surprised what it did wasn’t completely wrong.
If you look at a random text on the internet it would be very surprising if every word in it is the most likely word to follow based on previous words.
I’m not completely sure what your point is here. Suppose you have a biased coin, that comes up heads with p=0.6 and tails with p=0.4. Suppose you flip it 10 times. Would it be surprising if you then get heads 10 times in a row? Yes, in a sense. But that is still the most likely individual sequence.
The step from GTP3 to InstructGPT and ChatGPT was not one of scaling up in terms of size of models and substantial increase in the amount of training data. [...]Over at Deep Mind they have GATO which is an approach that combines large language model with other problems sets.
I would consider InstructGPT, ChatGPT, GATO, and similar systems, to all be in the general reference class of systems that are “mostly big transformers, trained in a self-supervised way, with some comparably minor things added on top”.
That’s just not true for ChatGPT. ChatGPT was very fast in learning how people tricked it to produce TOS violating content.
I’m not sure if this has been made public, but I would be surprised if this was achieved by (substantial) retraining of the underlying foundation model. My guess is that this was achieved mainly by various filters put on top. But it is possible that fine tuning was used. Regardless, catastrophic forgetting remains a fundamental issue. There are various benchmarks you can take a look at, if you want.
If you ask ChatGPT to multiply two 4-digit numbers it writes out the reasoning process in natural knowledge and comes to the right answer. ChatGPT is already today decent at using language for its reasoning process.
A system can multiply two 4-digit numbers and explain the reasoning process without exhibiting productivity and systematicity to the degree that an AGI would have to. Again, the point is not quite whether or not the system can use language to reason, the point is how it represents propositions, and what that tells us about its ability to generalise (the LoT hypothesis should really have been given a different name...).
I’m not sure if this has been made public, but I would be surprised if this was achieved by (substantial) retraining of the underlying foundation model. My guess is that this was achieved mainly by various filters put on top. But it is possible that fine tuning was used. Regardless, catastrophic forgetting remains a fundamental issue. There are various benchmarks you can take a look at, if you want.
The benchmarks tell you about what the existing systems do. They don’t tell you about what’s possible.
One of OpenAI’s current projects is to figure out how to extract from the conversations that ChatGPT has valuable data for fine-tuning.
There’s no fundamental reason why it can’t extract from the conversation it has all the relevant information and do fine-tuning to add it to its long-term memory.
When it comes to ToS violations it seems evident that such a system is working, based on my interactions with it. ChatGPT has basically three ways to answer with normal text, with red text, and with custom answers which explain to you why it won’t answer your query.
Both the red text answers and the custom answers increased over a variety of different prompts. When it does its red text answers there’s a feedback button to tell them if you think it made a mistake.
To me, it seems obvious that those red-text answers get used as training material for fine-tuning and that this helps with detecting similar cases in the future.
I would consider InstructGPT, ChatGPT, GATO, and similar systems, to all be in the general reference class of systems that are “mostly big transformers, trained in a self-supervised way, with some comparably minor things added on top”.
You could summarize InstructGPT’s lesson as “You can get huge capability gains by comparably minor things added on top”.
You can talk about how they are minor at a technical level but that doesn’t change the fact that these minor things produce huge capability gains.
In the future, there’s also a lot of additional room to get more clever about providing training data.
The benchmarks tell you about what the existing systems do. They don’t tell you about what’s possible.
Of course. It is almost certainly possible to solve the problem of catastrophic forgetting, and the solution might not be that complicated either. My point is that it is a fairly significant problem that has not yet been solved, and that solving it probably requires some insight or idea that does not yet exist. You can achieve some degree of lifelong learning through regularised fine-tuning, but you cannot get anywhere near what would be required for human-level cognition.
You could summarize InstructGPT’s lesson as “You can get huge capability gains by comparably minor things added on top”.
Yes, I think that lesson has been proven quite conclusively now. I also found systems like PaLM-SayCan very convincing for this point. But the question is not whether or not you can get huge capability gains—this is evidently true—the question is whether you get close to AGI without new theoretical breakthroughts. I want to know if we are now on (and close to) the end of the critical path, or whether we should expect unforeseeable breakthroughts to throw us off course a few more times before then.
Suppose you have a biased coin, that comes up heads with p=0.6 and tails with p=0.4. Suppose you flip it 10 times.
That’s a different case.
If you have a text you can calculate for every word in the text the likelihood (L_text) how likely it would follow the preceding words in the text. You can also calculate the likelihood (L_ideal) of the most likely word that would follow the preceding text.
L_ideal—L_text is in Kahemann’s words noise. If you look at a given text you can calculate the average of the noise for each word.
The average noise that’s produced by GPT3 is less than that of the average text on the internet. It would be surprising to encounter texts with so little noise randomly on the internet.
Ah, now I get your point, sorry. Yes, it is true that GPTs are not incentivised to reproduce the full data distribution, but rather, are incentivised to reproduce something more similar to a maximum-likelihood estimate point distribution. This means that they have lower variance (at least in the limit), which may improve performance in some domains, as you point out. But individual samples from the model will still have a high likelihood under the data distribution.
But individual samples from the model will still have a high likelihood under the data distribution.
That’s not true for maximum-likelihood distribution is general. It’s been more than a decade since I dealt with that topic in university while studying bioinformatics but in the domain of bioinformatics maximum-likelihood distribution can frequently produce results that are impossible to appear in reality and there are a bunch of tricks to avoid that.
To get back to the actual case of large language models, imagine there’s a complex chain of verbal reasoning. The next correct word in that reasoning chain has a higher likelihood than 200 different words that could be used that lead to a wrong conclusion. The likelihood of the correct word might be 0.01.
A large language model might pick the right word for the reasoning chain for every word over a 1000-word reasoning chain. The result is one that would be very unlikely to appear in the real world.
This claim is false. If you look at a random text on the internet it would be very surprising if every word in it is the most likely word to follow based on previous words.
Kahneman’s latest book is Noise: A Flaw in Human Judgment. In it, he talks about how errors in human decisions as a combination of bias and noise. If you take a large sample size of human decisions and build your model on it you remove all the noise.
While a large LLM trained on all internet text keeps all the bias of the internet text it can remove all the noise.
The question of whether scaling large language models is enough might have seem relevant a year ago, but it isn’t really today as the strategy that of the top players isn’t just calling large language models.
The step from GTP3 to InstructGPT and ChatGPT was not one of scaling up in terms of size of models and substantial increase in the amount of training data.
It was rather on learning from well well-curated data. ChatGPT itself is a project to gather a lot of data which inturn reveals a lot of the errors that ChatGPT makes and there are likely currently people at OpenAI working on ways of how to learn from that data
Over at Deep Mind they have GATO which is an approach that combines large language model with other problems sets.
That’s just not true for ChatGPT. ChatGPT was very fast in learning how people tricked it to produce TOS violating content.
If you ask ChatGPT to multiply two 4-digit numbers it writes out the reasoning process in natural knowledge and comes to the right answer. ChatGPT is already today decent at using language for its reasoning process.
People keep saying such things. Am I missing something? I asked it to calculate 1024 * 2047, and the answer isn’t even close. (Though to my surprise, the first 2 steps are at least correct steps, and not nonsense. And it is actually adding the right numbers together in step 3, again, to my surprise. I’ve seen it perform much, much worse.)
I did ask it at the beginning to multiply numbers and it seems to behave now differently than it did 5 weeks ago and isn’t making correct multiplications anymore. Unfortunatley, I can’t access the old chats.
Interesting. I’m having the opposite experience (due to timing, apparently), where at least it’s making some sense now. I’ve seen it using tricks only applicable to addition and pulling numbers out of its ass, so I was surprised what it did wasn’t completely wrong.
Asking the same question again even gives a completely different (but again wrong) result:
I’m not completely sure what your point is here. Suppose you have a biased coin, that comes up heads with p=0.6 and tails with p=0.4. Suppose you flip it 10 times. Would it be surprising if you then get heads 10 times in a row? Yes, in a sense. But that is still the most likely individual sequence.
I would consider InstructGPT, ChatGPT, GATO, and similar systems, to all be in the general reference class of systems that are “mostly big transformers, trained in a self-supervised way, with some comparably minor things added on top”.
I’m not sure if this has been made public, but I would be surprised if this was achieved by (substantial) retraining of the underlying foundation model. My guess is that this was achieved mainly by various filters put on top. But it is possible that fine tuning was used. Regardless, catastrophic forgetting remains a fundamental issue. There are various benchmarks you can take a look at, if you want.
A system can multiply two 4-digit numbers and explain the reasoning process without exhibiting productivity and systematicity to the degree that an AGI would have to. Again, the point is not quite whether or not the system can use language to reason, the point is how it represents propositions, and what that tells us about its ability to generalise (the LoT hypothesis should really have been given a different name...).
The benchmarks tell you about what the existing systems do. They don’t tell you about what’s possible.
One of OpenAI’s current projects is to figure out how to extract from the conversations that ChatGPT has valuable data for fine-tuning.
There’s no fundamental reason why it can’t extract from the conversation it has all the relevant information and do fine-tuning to add it to its long-term memory.
When it comes to ToS violations it seems evident that such a system is working, based on my interactions with it. ChatGPT has basically three ways to answer with normal text, with red text, and with custom answers which explain to you why it won’t answer your query.
Both the red text answers and the custom answers increased over a variety of different prompts. When it does its red text answers there’s a feedback button to tell them if you think it made a mistake.
To me, it seems obvious that those red-text answers get used as training material for fine-tuning and that this helps with detecting similar cases in the future.
You could summarize InstructGPT’s lesson as “You can get huge capability gains by comparably minor things added on top”.
You can talk about how they are minor at a technical level but that doesn’t change the fact that these minor things produce huge capability gains.
In the future, there’s also a lot of additional room to get more clever about providing training data.
Of course. It is almost certainly possible to solve the problem of catastrophic forgetting, and the solution might not be that complicated either. My point is that it is a fairly significant problem that has not yet been solved, and that solving it probably requires some insight or idea that does not yet exist. You can achieve some degree of lifelong learning through regularised fine-tuning, but you cannot get anywhere near what would be required for human-level cognition.
Yes, I think that lesson has been proven quite conclusively now. I also found systems like PaLM-SayCan very convincing for this point. But the question is not whether or not you can get huge capability gains—this is evidently true—the question is whether you get close to AGI without new theoretical breakthroughts. I want to know if we are now on (and close to) the end of the critical path, or whether we should expect unforeseeable breakthroughts to throw us off course a few more times before then.
That’s a different case.
If you have a text you can calculate for every word in the text the likelihood (L_text) how likely it would follow the preceding words in the text. You can also calculate the likelihood (L_ideal) of the most likely word that would follow the preceding text.
L_ideal—L_text is in Kahemann’s words noise. If you look at a given text you can calculate the average of the noise for each word.
The average noise that’s produced by GPT3 is less than that of the average text on the internet. It would be surprising to encounter texts with so little noise randomly on the internet.
Ah, now I get your point, sorry. Yes, it is true that GPTs are not incentivised to reproduce the full data distribution, but rather, are incentivised to reproduce something more similar to a maximum-likelihood estimate point distribution. This means that they have lower variance (at least in the limit), which may improve performance in some domains, as you point out. But individual samples from the model will still have a high likelihood under the data distribution.
That’s not true for maximum-likelihood distribution is general. It’s been more than a decade since I dealt with that topic in university while studying bioinformatics but in the domain of bioinformatics maximum-likelihood distribution can frequently produce results that are impossible to appear in reality and there are a bunch of tricks to avoid that.
To get back to the actual case of large language models, imagine there’s a complex chain of verbal reasoning. The next correct word in that reasoning chain has a higher likelihood than 200 different words that could be used that lead to a wrong conclusion. The likelihood of the correct word might be 0.01.
A large language model might pick the right word for the reasoning chain for every word over a 1000-word reasoning chain. The result is one that would be very unlikely to appear in the real world.