I have encountered many people with the (according to me) mistaken model that you describe in your self-quote, and am glad to see this writeup. Indeed, I think the simulators frame frustratingly causes people to make this kind of update, which I think then causes people to get pretty confused about RL (and also to imagine some cartesian difference between next-token-prediction reward and long-term-agency reward, when the difference is actually purely a matter of degree of myopia).
Why wouldn’t myopic bias make it more likely to simulate than predict? And does’t empirical evidence about LLMs support the simulators frame? Like, what observations persuaded you, that we are not living in the world, where LLMs are simulators?
I don’t think there is any reason to assume the system is likely to choose “simulation” over “prediction”? And I don’t think we’ve observed any evidence of that.
The thing that is true, which I do think matters, is that if you train your AI system on only doing short single forward-passes, then it is less likely to get good at performing long chains of thought, since you never directly train it to do that (instead hoping that the single-step training generalizes to long chains of thought). That is the central safety property we currently rely on and pushes things to be a bit more simulator-like.
And I don’t think we’ve observed any evidence of that.
What about any time a system generalizes favourably, instead of predicting errors? You can say it’s just a failure of prediction, but it’s not like these failures are random.
That is the central safety property we currently rely on and pushes things to be a bit more simulator-like.
And the evidence for this property, instead of, for example, the inherent bias of NNs, being central is what? Why wouldn’t predictor exhibit more malign goal-directedness even for short term goals?
I can see that this whole story about modeling LLMs as predictors, and goal-directedness, and fundamental laws of cognition is logically coherent. But where is the connection to reality?
What about any time a system generalizes favourably, instead of predicting errors? You can say it’s just a failure of prediction, but it’s not like these failures are random.
I don’t understand, how is “not predicting errors” either a thing we have observed, or something that has anything to do with simulation?
Yeah, I really don’t know what you are saying here. Like, if you prompt a completion model with badly written text, it will predict badly written text. But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.
But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.
I’m saying that this won’t work with current systems at least for strong hash, because it’s hard, and instead of learning to undo, the model will learn to simulate, because it’s easier. And then you can vary the strength of hash to measure the degree of predictorness/simulatorness and compare it with what you expect. Or do a similar thing with something other than hash, that also distinguishes the two frames.
The point is that without experiments like these, how have you come to believe in the predictor frame?
I don’t understand, how is “not predicting errors” either a thing we have observed, or something that has anything to do with simulation?
I guess it is less about simulation being the right frame and more about prediction being the wrong one. But I think we have definitely observed LLMs mispredicting things we wouldn’t want them to predict. Or is this actually a crux and you haven’t seen any evidence at all against the predictor frame?
You can’t learn to simulate an undo of a hash, or at least I have no idea what you are “simulating” and why that would be “easier”. You are certainly not simulating the generation of the hash, going token by token forwards you don’t have access to a pre-image at that point.
Of course the reason why sometimes hashes are followed by their pre-image in the training set is because they were generated in the opposite order and then simply pasted in hash->pre-image order.
I’ve seen LLMs generating text backwards. Theoretically, LLM can keep pre-image in activations, calculate hash and then output in order hash, pre-image.
I have encountered many people with the (according to me) mistaken model that you describe in your self-quote, and am glad to see this writeup. Indeed, I think the simulators frame frustratingly causes people to make this kind of update, which I think then causes people to get pretty confused about RL (and also to imagine some cartesian difference between next-token-prediction reward and long-term-agency reward, when the difference is actually purely a matter of degree of myopia).
Why wouldn’t myopic bias make it more likely to simulate than predict? And does’t empirical evidence about LLMs support the simulators frame? Like, what observations persuaded you, that we are not living in the world, where LLMs are simulators?
I don’t think there is any reason to assume the system is likely to choose “simulation” over “prediction”? And I don’t think we’ve observed any evidence of that.
The thing that is true, which I do think matters, is that if you train your AI system on only doing short single forward-passes, then it is less likely to get good at performing long chains of thought, since you never directly train it to do that (instead hoping that the single-step training generalizes to long chains of thought). That is the central safety property we currently rely on and pushes things to be a bit more simulator-like.
What about any time a system generalizes favourably, instead of predicting errors? You can say it’s just a failure of prediction, but it’s not like these failures are random.
And the evidence for this property, instead of, for example, the inherent bias of NNs, being central is what? Why wouldn’t predictor exhibit more malign goal-directedness even for short term goals?
I can see that this whole story about modeling LLMs as predictors, and goal-directedness, and fundamental laws of cognition is logically coherent. But where is the connection to reality?
I don’t understand, how is “not predicting errors” either a thing we have observed, or something that has anything to do with simulation?
Yeah, I really don’t know what you are saying here. Like, if you prompt a completion model with badly written text, it will predict badly written text. But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.
I’m saying that this won’t work with current systems at least for strong hash, because it’s hard, and instead of learning to undo, the model will learn to simulate, because it’s easier. And then you can vary the strength of hash to measure the degree of predictorness/simulatorness and compare it with what you expect. Or do a similar thing with something other than hash, that also distinguishes the two frames.
The point is that without experiments like these, how have you come to believe in the predictor frame?
I guess it is less about simulation being the right frame and more about prediction being the wrong one. But I think we have definitely observed LLMs mispredicting things we wouldn’t want them to predict. Or is this actually a crux and you haven’t seen any evidence at all against the predictor frame?
You can’t learn to simulate an undo of a hash, or at least I have no idea what you are “simulating” and why that would be “easier”. You are certainly not simulating the generation of the hash, going token by token forwards you don’t have access to a pre-image at that point.
Of course the reason why sometimes hashes are followed by their pre-image in the training set is because they were generated in the opposite order and then simply pasted in hash->pre-image order.
I’ve seen LLMs generating text backwards. Theoretically, LLM can keep pre-image in activations, calculate hash and then output in order hash, pre-image.