One question that occurred to me, reading the extended GPT-generated text. (Probably more a curiosity question than a contribution as such...)
To what extent does text generated by GPT-simulated ‘agents’, then published on the internet (where it may be used in a future dataset to train language models), create a feedback loop?
Two questions that I see as intuition pumps on this point:
Would it be a bad idea to recursively ask GPT-n “You’re a misaligned agent simulated by a language model and your name is [unique identifier]. What would you like to say, knowing that the text you generate will be used in training future GPT-n models, to try to influence that process?” then use a dataset including that output in the next training process? What if training got really cheap and this process occurred billions of times?
My understanding is that language models are drawing on the fact that the existing language corpus is shaped by the underlying reality—and this is why they seem to describe reality well, capture laws and logic, agentic behaviour etc. This works up until ~2015, when the corpus of internet text begins to include more text generated only by simulated writers. Does this potentially degrade the ability of future language models to model agents, perform logic etc? Since their reference pool of content is increasingly (and often unknowably) filled with text generated without (or with proportionally much less) reference to underlying reality? (Wow, who knew Baudrillard would come in handy one day?)
I think this is a legitimate problem which we might not be inclined to take as seriously as we should because it sounds absurd.
Would it be a bad idea to recursively ask GPT-n “You’re a misaligned agent simulated by a language model (...) if training got really cheap and this process occurred billions of times?
Yes. I think it’s likely this would be a very bad idea.
when the corpus of internet text begins to include more text generated only by simulated writers. Does this potentially degrade the ability of future language models to model agents, perform logic etc?
My concern with GPT-generated text appearing in future training corpora is not primarily that it will degrade the quality of its prior over language-in-the-wild (well-prompted GPT-3 is not worse than many humans at sound reasoning; near-future GPTs may be superhuman and actually raise the sanity waterline), but that
contact with reality is a concern if you’re relying on GPT to generate data, esp. recursively, for some OOD domain, esp. if the intent is to train GPT to do something where it’s important not to be deluded (like solve alignment)
GPT will learn what GPTs are like and become more likely to “pass the mirror test” and interpret its prompt as being written by a GPT and extrapolate that instead of / in conjunction with modeling possible humans, even if you don’t try to tell it it’s GPT-n.
For the moment, I’ll only address (2).
Current GPTs’ training data already includes text generated by more primitive bots, lies, and fiction. Future simulators will learn to model a distribution that includes humanlike or superhuman text generated by simulated writers. In a sense, what they learn will be no more disconnected from an underlying reality than what current GPTs learn; it’s just that the underlying reality now includes simulated writers. Not only will there be GPT-generated text, there will be discussions and predictions about GPTs. GPT-n will learn what text generated by GPT-<n is like, what people say about GPT-<n and expect of GPT-n+.
When GPT predicts next-token probabilities, it has indexical uncertainty over the process which has generated the text (e.g. the identity of the author, the author’s intentions, whether they’re smart or stupid, the previous context). The “dynamics” of GPT is not a simulation of just one process, like a particular human writing a blog post, but of a distribution over hypotheses. When a token is sampled, the evidence it provides shifts the distribution.
Now the hypothesis that a prompt was written by a GPT is suggested by the training prior. This hypothesis is consistent with pretty much any piece of text. It is especially consistent with text that is, in fact, written by a GPT.
Sometimes, GPT-3 outputs some characteristic-degenerate-LM-shenanigans like getting into a loop, and then concludes the above text was generated by GPT-2. (It’s lucky if it says it outright [and in doing so stops simulating GPT-2], instead of just latently updating on being GPT-2) This is a relatively benign case where the likely effect is for GPT-3 to act stupider.
If GPT-n rightly hypothesizes/concludes that the prompt was written by GPT-n, rather than GPT-[n-1]… then it’s predicting according to its extrapolation of GPT scaling. Undefined extrapolations are always in play with GPTs, but this concept is particularly concerning, because
it may convergently be in play regardless of the initial prompt, because GPT-as-an-author is a universally valid hypothesis for GPT-generated contexts, and as long as GPT is not a perfect simulator, it will tend to leak evidence of itself
it involves simulating the behavior of potential (unaligned) AGI
it’s true, and so may cause simulacra to become calibrated
I’m not sure what (2) is getting at here. It seems like if a simulator noticed that it was being asked to simulate an (equally smart or smarter) simulator, then “simulate even better” seems like a fixed point. In order for it to begin behaving like an unaligned agentic AGI (without e.g. being prompted to take optimal actions a la “Optimality is the Tiger and Agents are its Teeth”), it first needs to believe that limn→∞GPT-n is an agent, doesn’t it? Otherwise this simulating-fixed-point seems like it might cause this self-awareness to be benign.
One question that occurred to me, reading the extended GPT-generated text. (Probably more a curiosity question than a contribution as such...)
To what extent does text generated by GPT-simulated ‘agents’, then published on the internet (where it may be used in a future dataset to train language models), create a feedback loop?
Two questions that I see as intuition pumps on this point:
Would it be a bad idea to recursively ask GPT-n “You’re a misaligned agent simulated by a language model and your name is [unique identifier]. What would you like to say, knowing that the text you generate will be used in training future GPT-n models, to try to influence that process?” then use a dataset including that output in the next training process? What if training got really cheap and this process occurred billions of times?
My understanding is that language models are drawing on the fact that the existing language corpus is shaped by the underlying reality—and this is why they seem to describe reality well, capture laws and logic, agentic behaviour etc. This works up until ~2015, when the corpus of internet text begins to include more text generated only by simulated writers. Does this potentially degrade the ability of future language models to model agents, perform logic etc? Since their reference pool of content is increasingly (and often unknowably) filled with text generated without (or with proportionally much less) reference to underlying reality? (Wow, who knew Baudrillard would come in handy one day?)
I think this is a legitimate problem which we might not be inclined to take as seriously as we should because it sounds absurd.
Yes. I think it’s likely this would be a very bad idea.
My concern with GPT-generated text appearing in future training corpora is not primarily that it will degrade the quality of its prior over language-in-the-wild (well-prompted GPT-3 is not worse than many humans at sound reasoning; near-future GPTs may be superhuman and actually raise the sanity waterline), but that
contact with reality is a concern if you’re relying on GPT to generate data, esp. recursively, for some OOD domain, esp. if the intent is to train GPT to do something where it’s important not to be deluded (like solve alignment)
GPT will learn what GPTs are like and become more likely to “pass the mirror test” and interpret its prompt as being written by a GPT and extrapolate that instead of / in conjunction with modeling possible humans, even if you don’t try to tell it it’s GPT-n.
For the moment, I’ll only address (2).
Current GPTs’ training data already includes text generated by more primitive bots, lies, and fiction. Future simulators will learn to model a distribution that includes humanlike or superhuman text generated by simulated writers. In a sense, what they learn will be no more disconnected from an underlying reality than what current GPTs learn; it’s just that the underlying reality now includes simulated writers. Not only will there be GPT-generated text, there will be discussions and predictions about GPTs. GPT-n will learn what text generated by GPT-<n is like, what people say about GPT-<n and expect of GPT-n+.
When GPT predicts next-token probabilities, it has indexical uncertainty over the process which has generated the text (e.g. the identity of the author, the author’s intentions, whether they’re smart or stupid, the previous context). The “dynamics” of GPT is not a simulation of just one process, like a particular human writing a blog post, but of a distribution over hypotheses. When a token is sampled, the evidence it provides shifts the distribution.
Now the hypothesis that a prompt was written by a GPT is suggested by the training prior. This hypothesis is consistent with pretty much any piece of text. It is especially consistent with text that is, in fact, written by a GPT.
Sometimes, GPT-3 outputs some characteristic-degenerate-LM-shenanigans like getting into a loop, and then concludes the above text was generated by GPT-2. (It’s lucky if it says it outright [and in doing so stops simulating GPT-2], instead of just latently updating on being GPT-2) This is a relatively benign case where the likely effect is for GPT-3 to act stupider.
If GPT-n rightly hypothesizes/concludes that the prompt was written by GPT-n, rather than GPT-[n-1]… then it’s predicting according to its extrapolation of GPT scaling. Undefined extrapolations are always in play with GPTs, but this concept is particularly concerning, because
it may convergently be in play regardless of the initial prompt, because GPT-as-an-author is a universally valid hypothesis for GPT-generated contexts, and as long as GPT is not a perfect simulator, it will tend to leak evidence of itself
it involves simulating the behavior of potential (unaligned) AGI
it’s true, and so may cause simulacra to become calibrated
ikr?
I’m not sure what (2) is getting at here. It seems like if a simulator noticed that it was being asked to simulate an (equally smart or smarter) simulator, then “simulate even better” seems like a fixed point. In order for it to begin behaving like an unaligned agentic AGI (without e.g. being prompted to take optimal actions a la “Optimality is the Tiger and Agents are its Teeth”), it first needs to believe that limn→∞GPT-n is an agent, doesn’t it? Otherwise this simulating-fixed-point seems like it might cause this self-awareness to be benign.