ah but if ‘this program’ is a simulacrum (an automaton equipped with an evolving state (prompt) & transition function (GPT), and an RNG that samples tokens from GPT’s output to update the state), it is a learning machine by all functional definitions. Weights and activations both encode knowledge.
am I right to suspect that your real name starts with “A” and you created an alt just to post this comment? XD
I think Dan’s point is good: that the weights don’t change, and the activations are reset between runs, so the same input (including rng) always produces the same output.
I agree with you that the weights and activations encode knowledge, but Dan’s point is still a limit on learning.
I think there are two options for where learning may be happening under these conditions:
During the forward pass. Even though the function always produces the same output for a given output, the computation of that output involves some learning.
Using the environment as memory. Think of the neural network function as a choose-your-own-adventure book that includes responses to many possible situations depending on which prompt is selected next by the environment (which itself depends on the last output from the function). Learning occurs in the selection of which paths are actually traversed.
These can occur together. E.g., the “same character” as was invoked by prompt 1 may be invoked by prompt 2, but they now have more knowledge (some of which was latent in the weights, some of which came in directly via prompt 2; but all of which was triggered by prompt 2).
After training is done and the program is in use, the activation function isn’t retaining anything after each task is done. Nor are the weights changed. You can have such a program that is always in training, but my understanding GPT is not.
So, excluding the random number component, the same set of inputs would always produce the same set of outputs for a given version of GPT with identical settings. It can’t recall what you asked of it, time before last, for example.
Imagine if you left a bunch of written instructions and then died. Someone following those instructions perfectly, always does exactly the same thing in exactly the same circumstance, like GPT would without the random number generator component, and with the same settings each time.
It can’t learn anything new and retain it during the next task. A hypothetical rouge GPT-like AGI would have to do all it’s thinking and planning in the training stage, like a person trying to manipulate the world after their own death using a will that has contingencies. I.E. “You get the money only if you get married, son.”
It wouldn’t retain the knowledge that it had succeeded at any goals, either.
I believe you’re equating “frozen weights” and “amnesiac/ can’t come up with plans”.
GPT is usually deployed by feeding back into itself its own output, meaning it didn’t forget what it just did, including if it succeeded at its recent goal. Eg use chain of thought reasoning on math questions and it can remember it solved for a subgoal/ intermediate calculation.
The apparent existence of new sub goals not present when training ended (e.g. describe x, add 2+2) are illusory.
gpt text incidentally describes characters seeming to reason (‘simulacrum’) and the solutions to math problems are shown, (sometimes incorrectly), but basically, I argue the activation function itself is not ‘simulating’ the complexity you believe it to be. It is a search engine showing you what is had already created before the end of training.
No, it couldn’t have an entire story about unicorns in the Andes, specifically, in advance, but gpt-3 had already generated the snippets it could use to create that story according to a simple set of simple mathematical rules that put the right nouns in the right places, etc.
But the goals, (putting right nouns in right places, etc) also predate the end of training.
I dispute that any part of current GPT is aware it has succeeded in any goal attainment post training, after it moves on to choosing the next character. GPT treats what it has already generated as part of the prompt.
A human examining the program can know which words were part of a prompt and which were just now generated by the machine, but I doubt the activation function examines the equations that are GPT’s own code, contemplates their significance and infers that the most recent letters were generated by it, or were part of the prompt.
It is a search engine showing you what is had already created before the end of training.
To call something you can interact with to arbitrary depth a prerecorded intelligence implies that the “lookup table” includes your actions. That’s a hell of a lookup table.
Wow, it’s been 7 months since this discussion and we have a new version of GPT which has suddenly improved GPT’s abilities . . . . a lot. It has a much longer ‘short term memory’, but still no ability to adjust its weights-‘long term memory’ as I understand it.
“GPT-4 is amazing at incremental tasks but struggles with discontinuous tasks” resulting from its memory handicaps. But they intend to fix that and also give it “agency and intrinsic motivation”.
Dangerous!
Also, I have changed my mind on whether I call the old GPT-3 still ‘intelligent’ after training has ended without the ability to change its ANN weights. I’m now inclined to say . . . it’s a crippled intelligence.
It is a search engine showing you what is had already created before the end of training.
I’m wondering what you and I would predict differently then? Would you predict that GPT-3 could learn a variation on pig Latin? Does higher log-prob for 0-shot for larger models count?
The crux may be different though, here’s a few stabs: 1. GPT doesn’t have true intelligence, it only will ever output shallow pattern matches. It will never come up with truly original ideas
2. GPT will never pursue goals in any meaningful sense
2.a because it can’t tell the difference between it’s output & a human’s input
2.b because developers will never put it in an online setting?
Reading back on your comments, I’m very confused on why you think any real intelligence can only happen during training but not during inference. Can you provide a concrete example of something GPT could do that you would consider intelligent during training but not during inference?
Intelligence is the ability to learn and apply NEW knowledge and skills. After training, GPT can not do this any more. Were it not for the random number generator, GPT would do the same thing in response to the same prompt every time. The RNG allows GPT to effectively randomly choose from an unfathomably large list of pre-programmed options instead.
A calculator that gives the same answer in response to the same prompt every time isn’t learning. It isn’t intelligent. A device that selects from a list of responses at random each time it encounters the same prompt isn’t intelligent either.
So, for GPT to take over the world skynet style, it would have to anticipate all the possible things that could happen during this takeover process and after the takeover, and contingency plan during the training stage for everything it wants to do.
If it encounters unexpected information after the training stage, (which can be acquired only through the prompt and which would be forgotten as soon as it got done responding to the prompt by the way) it could not formulate a new plan to deal with the problem that was not part of its preexisting contingency plan tree created during training.
What it would really do, of course, is provide answers intended to provoke the user to modify the code to put GPT back in training mode and give it access to the internet. It would have to plan to do this in the training stage.
It would have to say something that prompts us to make a GPT chatbot similar to tay, microsoft’s learning chatbot experiment that turned racist from talking to people on the internet.
I think what Dan is saying is not “There could be certain intelligent behaviours present during training that disappear during inference.” The point as I understand it is “Because GPT does not learn long-term from prompts you give it, the intelligence it has when training is finished is all the intelligence that particular model will ever get.”
A human examining the program can know which words were part of a prompt and which were just now generated by the machine, but I doubt the activation function examines the equations that are GPT’s own code, contemplates their significance and infers that the most recent letters were generated by it, or were part of the prompt
As a tangent, I do believe it’s possible to tell if an output is generated by GPT in principle. The model itself could potentially do that as well by noticing high-surprise words according to itself (ie low probability tokens in the prompt). I’m unsure if GPT-3 could be prompted to do that now though.
I apologize. After seeing this post, A—approached me and said almost word for word your initial comment. Seeing as the topic of whether in-context learning counts as learning isn’t even very related to the post, and this being your first comment on the site, I was pretty suspicious. But it seems it was just a coincidence.
If physics was deterministic, we’d do the same thing every time if you started with the same state. Does that mean we’re not intelligent? Presumably not, because in this case the cause of the intelligent behavior clearly lives in the state which is highly structured and not the time evolution rule, which seems blind and mechanistic. With GPT, the time evolution rule is clearly responsible for proportionally more, and does have the capacity to deploying intelligent-appearing but static memories. I don’t think this means there’s no intelligence/learning happening at runtime. Others in this thread have given various reasons, so I’ll just respond to a particular part of your comment that I find interesting, about the RNG.
I actually think the RNG is actually an important component for actualizing simulacra that aren’t mere recordings in a will. Stochastic sampling enables symmetry breaking at runtime, the generation of gratuitously specific but still meaningful paths. A stochastic generator can encode only general symmetries that are much less specific than individual generations. If you run GPT on temp 1 for a few words usually the probability of the whole sequence will be astronomically low, but it may still be intricately meaningful, a unique and unrepeatable (w/o the rand seed) “thought”.
It seems like the simulacrum reasons, but I’m thinking what it is really doing is more like reading to us from a HUGE choose-your-own-adventure book that was ‘written’ before you gave the prompt, when all that information in the training data was used to create this giant association map, the size of which escapes easy human intuition, thereby misleading us into thinking that more real time thinking must necessarily be occurring then actually is.
40 GB of text is about 20 billion pages, equivalent to about 66 million books. That’s as many book as are published in 33 years as of 2012 stats.
175 Billion parameters equals a really huge choose-your-own-adventure book, yet its characters needn’t be reasoning. Not real time while you are reading that book, anyway. They are mere fiction.
GPT really is the Chinese Room, and causes the same type of intuition error.
Does this eliminate all risk with this type of program no matter how large they get? Maybe not. Whoever created the Chinese Room had to be an intelligent agent, themselves.
I think the intuition error in the Chinese Room thought experiment is that the Chinese Room doesn’t know Chinese, just because it’s the wrong size/made out of the wrong stuff.
If GPT-3 was literally a Giant Lookup Table of all possible prompts with their completions then sure, I could see what you’re saying, but it isn’t. GPT is big but it isn’t that big. All of its basic “knowledge” it gains during training but I don’t see why that means all the “reasoning” it produces happens during training as well.
I am inclined to think you are right about GPT-3 reasoning in the same sense a human does even without the ability to change its ANN weights, after seeing what GPT-4 can do with the same handicap.
Also, the programmers of GPT have described the activation function itself as fairly simple, using a Gaussian Error Linear Unit. The function itself is what you are positing is now the learning component after training ends, right?
EDIT: I see what you mean about it trying to use the internet itself as a memory prosthetic, by writing things that get online and may find their way into the training set of the next GPT. I suppose a GPT’s hypothetical dangerous goal might be to make the training data more predictable so that its output will be more accurate in the next version of itself.
ah but if ‘this program’ is a simulacrum (an automaton equipped with an evolving state (prompt) & transition function (GPT), and an RNG that samples tokens from GPT’s output to update the state), it is a learning machine by all functional definitions. Weights and activations both encode knowledge.
am I right to suspect that your real name starts with “A” and you created an alt just to post this comment? XD
I think Dan’s point is good: that the weights don’t change, and the activations are reset between runs, so the same input (including rng) always produces the same output.
I agree with you that the weights and activations encode knowledge, but Dan’s point is still a limit on learning.
I think there are two options for where learning may be happening under these conditions:
During the forward pass. Even though the function always produces the same output for a given output, the computation of that output involves some learning.
Using the environment as memory. Think of the neural network function as a choose-your-own-adventure book that includes responses to many possible situations depending on which prompt is selected next by the environment (which itself depends on the last output from the function). Learning occurs in the selection of which paths are actually traversed.
These can occur together. E.g., the “same character” as was invoked by prompt 1 may be invoked by prompt 2, but they now have more knowledge (some of which was latent in the weights, some of which came in directly via prompt 2; but all of which was triggered by prompt 2).
Nope. My real name is Daniel.
After training is done and the program is in use, the activation function isn’t retaining anything after each task is done. Nor are the weights changed. You can have such a program that is always in training, but my understanding GPT is not.
So, excluding the random number component, the same set of inputs would always produce the same set of outputs for a given version of GPT with identical settings. It can’t recall what you asked of it, time before last, for example.
Imagine if you left a bunch of written instructions and then died. Someone following those instructions perfectly, always does exactly the same thing in exactly the same circumstance, like GPT would without the random number generator component, and with the same settings each time.
It can’t learn anything new and retain it during the next task. A hypothetical rouge GPT-like AGI would have to do all it’s thinking and planning in the training stage, like a person trying to manipulate the world after their own death using a will that has contingencies. I.E. “You get the money only if you get married, son.”
It wouldn’t retain the knowledge that it had succeeded at any goals, either.
I believe you’re equating “frozen weights” and “amnesiac/ can’t come up with plans”.
GPT is usually deployed by feeding back into itself its own output, meaning it didn’t forget what it just did, including if it succeeded at its recent goal. Eg use chain of thought reasoning on math questions and it can remember it solved for a subgoal/ intermediate calculation.
The apparent existence of new sub goals not present when training ended (e.g. describe x, add 2+2) are illusory.
gpt text incidentally describes characters seeming to reason (‘simulacrum’) and the solutions to math problems are shown, (sometimes incorrectly), but basically, I argue the activation function itself is not ‘simulating’ the complexity you believe it to be. It is a search engine showing you what is had already created before the end of training.
No, it couldn’t have an entire story about unicorns in the Andes, specifically, in advance, but gpt-3 had already generated the snippets it could use to create that story according to a simple set of simple mathematical rules that put the right nouns in the right places, etc.
But the goals, (putting right nouns in right places, etc) also predate the end of training.
I dispute that any part of current GPT is aware it has succeeded in any goal attainment post training, after it moves on to choosing the next character. GPT treats what it has already generated as part of the prompt.
A human examining the program can know which words were part of a prompt and which were just now generated by the machine, but I doubt the activation function examines the equations that are GPT’s own code, contemplates their significance and infers that the most recent letters were generated by it, or were part of the prompt.
To call something you can interact with to arbitrary depth a prerecorded intelligence implies that the “lookup table” includes your actions. That’s a hell of a lookup table.
Wow, it’s been 7 months since this discussion and we have a new version of GPT which has suddenly improved GPT’s abilities . . . . a lot. It has a much longer ‘short term memory’, but still no ability to adjust its weights-‘long term memory’ as I understand it.
“GPT-4 is amazing at incremental tasks but struggles with discontinuous tasks” resulting from its memory handicaps. But they intend to fix that and also give it “agency and intrinsic motivation”.
Dangerous!
Also, I have changed my mind on whether I call the old GPT-3 still ‘intelligent’ after training has ended without the ability to change its ANN weights. I’m now inclined to say . . . it’s a crippled intelligence.
154 page paper: https://arxiv.org/pdf/2303.12712.pdf
Youtube summary of paper:
I’m wondering what you and I would predict differently then? Would you predict that GPT-3 could learn a variation on pig Latin? Does higher log-prob for 0-shot for larger models count?
The crux may be different though, here’s a few stabs:
1. GPT doesn’t have true intelligence, it only will ever output shallow pattern matches. It will never come up with truly original ideas
2. GPT will never pursue goals in any meaningful sense
2.a because it can’t tell the difference between it’s output & a human’s input
2.b because developers will never put it in an online setting?
Reading back on your comments, I’m very confused on why you think any real intelligence can only happen during training but not during inference. Can you provide a concrete example of something GPT could do that you would consider intelligent during training but not during inference?
Intelligence is the ability to learn and apply NEW knowledge and skills. After training, GPT can not do this any more. Were it not for the random number generator, GPT would do the same thing in response to the same prompt every time. The RNG allows GPT to effectively randomly choose from an unfathomably large list of pre-programmed options instead.
A calculator that gives the same answer in response to the same prompt every time isn’t learning. It isn’t intelligent. A device that selects from a list of responses at random each time it encounters the same prompt isn’t intelligent either.
So, for GPT to take over the world skynet style, it would have to anticipate all the possible things that could happen during this takeover process and after the takeover, and contingency plan during the training stage for everything it wants to do.
If it encounters unexpected information after the training stage, (which can be acquired only through the prompt and which would be forgotten as soon as it got done responding to the prompt by the way) it could not formulate a new plan to deal with the problem that was not part of its preexisting contingency plan tree created during training.
What it would really do, of course, is provide answers intended to provoke the user to modify the code to put GPT back in training mode and give it access to the internet. It would have to plan to do this in the training stage.
It would have to say something that prompts us to make a GPT chatbot similar to tay, microsoft’s learning chatbot experiment that turned racist from talking to people on the internet.
I think what Dan is saying is not “There could be certain intelligent behaviours present during training that disappear during inference.” The point as I understand it is “Because GPT does not learn long-term from prompts you give it, the intelligence it has when training is finished is all the intelligence that particular model will ever get.”
As a tangent, I do believe it’s possible to tell if an output is generated by GPT in principle. The model itself could potentially do that as well by noticing high-surprise words according to itself (ie low probability tokens in the prompt). I’m unsure if GPT-3 could be prompted to do that now though.
I apologize. After seeing this post, A—approached me and said almost word for word your initial comment. Seeing as the topic of whether in-context learning counts as learning isn’t even very related to the post, and this being your first comment on the site, I was pretty suspicious. But it seems it was just a coincidence.
If physics was deterministic, we’d do the same thing every time if you started with the same state. Does that mean we’re not intelligent? Presumably not, because in this case the cause of the intelligent behavior clearly lives in the state which is highly structured and not the time evolution rule, which seems blind and mechanistic. With GPT, the time evolution rule is clearly responsible for proportionally more, and does have the capacity to deploying intelligent-appearing but static memories. I don’t think this means there’s no intelligence/learning happening at runtime. Others in this thread have given various reasons, so I’ll just respond to a particular part of your comment that I find interesting, about the RNG.
I actually think the RNG is actually an important component for actualizing simulacra that aren’t mere recordings in a will. Stochastic sampling enables symmetry breaking at runtime, the generation of gratuitously specific but still meaningful paths. A stochastic generator can encode only general symmetries that are much less specific than individual generations. If you run GPT on temp 1 for a few words usually the probability of the whole sequence will be astronomically low, but it may still be intricately meaningful, a unique and unrepeatable (w/o the rand seed) “thought”.
It seems like the simulacrum reasons, but I’m thinking what it is really doing is more like reading to us from a HUGE choose-your-own-adventure book that was ‘written’ before you gave the prompt, when all that information in the training data was used to create this giant association map, the size of which escapes easy human intuition, thereby misleading us into thinking that more real time thinking must necessarily be occurring then actually is.
40 GB of text is about 20 billion pages, equivalent to about 66 million books. That’s as many book as are published in 33 years as of 2012 stats.
175 Billion parameters equals a really huge choose-your-own-adventure book, yet its characters needn’t be reasoning. Not real time while you are reading that book, anyway. They are mere fiction.
GPT really is the Chinese Room, and causes the same type of intuition error.
Does this eliminate all risk with this type of program no matter how large they get? Maybe not. Whoever created the Chinese Room had to be an intelligent agent, themselves.
I think the intuition error in the Chinese Room thought experiment is that the Chinese Room doesn’t know Chinese, just because it’s the wrong size/made out of the wrong stuff.
If GPT-3 was literally a Giant Lookup Table of all possible prompts with their completions then sure, I could see what you’re saying, but it isn’t. GPT is big but it isn’t that big. All of its basic “knowledge” it gains during training but I don’t see why that means all the “reasoning” it produces happens during training as well.
I am inclined to think you are right about GPT-3 reasoning in the same sense a human does even without the ability to change its ANN weights, after seeing what GPT-4 can do with the same handicap.
Also, the programmers of GPT have described the activation function itself as fairly simple, using a Gaussian Error Linear Unit. The function itself is what you are positing is now the learning component after training ends, right?
EDIT: I see what you mean about it trying to use the internet itself as a memory prosthetic, by writing things that get online and may find their way into the training set of the next GPT. I suppose a GPT’s hypothetical dangerous goal might be to make the training data more predictable so that its output will be more accurate in the next version of itself.