The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something. Though I think that of the other stories you’ve generated as well, so maybe my take here is just to have more deranged meta GPT posting.
it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.
(...)
text-davinci-002 is not an engine for rendering consistent worlds anymore. Often, it will assign infinitesimal probability to the vast majority of continuations that are perfectly consistent by our standards, and even which conform to the values OpenAI has attempted to instill in it like accuracy and harmlessness, instead concentrating almost all its probability mass on some highly specific outcome. What is it instead, then? For instance, does it even still make sense to think of its outputs as “probabilities”?
It was impossible not to note that the type signature of text-davinci-002’s behavior, in response to prompts that elicit mode collapse, resembles that of a coherent goal-directed agent more than a simulator.
I feel like I’m missing something here, because in my model most of the observations in this post seem like they can be explained under the same paradigm that we view the base davinci model. Specifically, that the reward model RLHF is using “represents” in an information-theoretic sense a signal for the worlds represented by the fine-tuning data. So what RLHF seems to be doing to me is shifting the world prior that GPT learned during pre-training, to one where whatever the reward signal represents is just much more common than in our world—like if GPT’s pre-training data inherently contained a hugely disproportionate amount of equivocation and plausible deniability statements, it would just simulate worlds where that’s much more likely to occur.
(To be clear, I agree that RLHF can probably induce agency in some form in GPTs, I just don’t think that’s what’s happening here).
The attractor states seem like they’re highly likely properties of these resultant worlds, like adversarial/unhinged/whatever interactions are just unlikely (because they were downweighted in the reward model) and so you get anon leaving as soon as he can because that’s more likely on the high prior conditional of low adversarial content than the conversation suddenly becoming placid, and some questions actually are just shallowly matching to controversial and the likely response in those worlds is just to equivocate. In that latter example in particular, I don’t see the results being that different from what we would expect if GPT’s training data was from a world slightly different to ours—injecting input that’s pretty unlikely for that world should still lead back to states that are likely for that world. In my view, that’s like if we introduced a random segue in the middle of a wedding toast prompt of the form “you are a murderer”, and it still bounces back to being wholesome (this works when I tested).
Regarding ending a story to start a new one—I can see the case for why this is framed as the simulator dynamics becoming more agentic, but it doesn’t feel all that qualitatively different from what happens in current models—the interesting part seems to be the stronger tendency toward the new worlds the RLHF’d model finds likely, which seems like it’s just expected behaviour as a simulator becomes more sure of the world it’s in / has a more restricted worldspace. I would definitely expect that if we could come up with a story that was sufficiently OOD of our world (although I think this is pretty hard by definition), it would figure out some similar mechanism to oscillate back to ours as soon as possible (although this would also be much harder with base GPT because it has less confidence of the world it’s in) - that is, that the story ending is just one of many levers a simulator can pull, like a slow transition, only here the story was such that ending it was the easiest way to get into its “right” worldspace. I think that this is slight evidence for how malign worlds might arise from strong RLHF (like with superintelligent simulacra), but it doesn’t feel like it’s that surprising from within the simulator framing.
The RNGs seem like the hardest part of this to explain, but I think can be seen as the outcome of making the model more confident about the world it’s simulating, because of the worldspace restriction from the fine-tuning—it’s plausible that the abstractions that build up RNG contexts in most of the instances we would try are affected by this (it not being universal seems like it can be explained under this—there’s no reason why all potential abstractions would be affected).
Separate thought: this would explain why increasing the temperate doesn’t affect it much, and why I think the space of plausible / consistent worlds has shrunk tremendously while still leaving the most likely continuations as being reasonable—it starts from the current world prior, and selectively amplifies the continuations that are more likely under the reward model’s worlds. Its definition of “plausible” has shifted; and it doesn’t really have cause to shift around any unamplified continuations all that much.
Broadly, my take is that these results are interesting because they show how RLHF affects simulators, their reward signal shrinking the world prior / making the model more confident of the world it should be simulating, and how this affects what it does. A priori, I don’t see why this framing doesn’t hold, but it’s definitely possible that it’s just saying the same things you are and I’m reading too much into the algorithmic difference bit, or that it simply explains too much, in which case I’d love to hear what I’m missing.
The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something. Though I think that of the other stories you’ve generated as well, so maybe my take here is just to have more deranged meta GPT posting.
I feel like I’m missing something here, because in my model most of the observations in this post seem like they can be explained under the same paradigm that we view the base davinci model. Specifically, that the reward model RLHF is using “represents” in an information-theoretic sense a signal for the worlds represented by the fine-tuning data. So what RLHF seems to be doing to me is shifting the world prior that GPT learned during pre-training, to one where whatever the reward signal represents is just much more common than in our world—like if GPT’s pre-training data inherently contained a hugely disproportionate amount of equivocation and plausible deniability statements, it would just simulate worlds where that’s much more likely to occur.
(To be clear, I agree that RLHF can probably induce agency in some form in GPTs, I just don’t think that’s what’s happening here).
The attractor states seem like they’re highly likely properties of these resultant worlds, like adversarial/unhinged/whatever interactions are just unlikely (because they were downweighted in the reward model) and so you get anon leaving as soon as he can because that’s more likely on the high prior conditional of low adversarial content than the conversation suddenly becoming placid, and some questions actually are just shallowly matching to controversial and the likely response in those worlds is just to equivocate. In that latter example in particular, I don’t see the results being that different from what we would expect if GPT’s training data was from a world slightly different to ours—injecting input that’s pretty unlikely for that world should still lead back to states that are likely for that world. In my view, that’s like if we introduced a random segue in the middle of a wedding toast prompt of the form “you are a murderer”, and it still bounces back to being wholesome (this works when I tested).
Regarding ending a story to start a new one—I can see the case for why this is framed as the simulator dynamics becoming more agentic, but it doesn’t feel all that qualitatively different from what happens in current models—the interesting part seems to be the stronger tendency toward the new worlds the RLHF’d model finds likely, which seems like it’s just expected behaviour as a simulator becomes more sure of the world it’s in / has a more restricted worldspace. I would definitely expect that if we could come up with a story that was sufficiently OOD of our world (although I think this is pretty hard by definition), it would figure out some similar mechanism to oscillate back to ours as soon as possible (although this would also be much harder with base GPT because it has less confidence of the world it’s in) - that is, that the story ending is just one of many levers a simulator can pull, like a slow transition, only here the story was such that ending it was the easiest way to get into its “right” worldspace. I think that this is slight evidence for how malign worlds might arise from strong RLHF (like with superintelligent simulacra), but it doesn’t feel like it’s that surprising from within the simulator framing.
The RNGs seem like the hardest part of this to explain, but I think can be seen as the outcome of making the model more confident about the world it’s simulating, because of the worldspace restriction from the fine-tuning—it’s plausible that the abstractions that build up RNG contexts in most of the instances we would try are affected by this (it not being universal seems like it can be explained under this—there’s no reason why all potential abstractions would be affected).
Separate thought: this would explain why increasing the temperate doesn’t affect it much, and why I think the space of plausible / consistent worlds has shrunk tremendously while still leaving the most likely continuations as being reasonable—it starts from the current world prior, and selectively amplifies the continuations that are more likely under the reward model’s worlds. Its definition of “plausible” has shifted; and it doesn’t really have cause to shift around any unamplified continuations all that much.
Broadly, my take is that these results are interesting because they show how RLHF affects simulators, their reward signal shrinking the world prior / making the model more confident of the world it should be simulating, and how this affects what it does. A priori, I don’t see why this framing doesn’t hold, but it’s definitely possible that it’s just saying the same things you are and I’m reading too much into the algorithmic difference bit, or that it simply explains too much, in which case I’d love to hear what I’m missing.