And I don’t think we’ve observed any evidence of that.
What about any time a system generalizes favourably, instead of predicting errors? You can say it’s just a failure of prediction, but it’s not like these failures are random.
That is the central safety property we currently rely on and pushes things to be a bit more simulator-like.
And the evidence for this property, instead of, for example, the inherent bias of NNs, being central is what? Why wouldn’t predictor exhibit more malign goal-directedness even for short term goals?
I can see that this whole story about modeling LLMs as predictors, and goal-directedness, and fundamental laws of cognition is logically coherent. But where is the connection to reality?
What about any time a system generalizes favourably, instead of predicting errors? You can say it’s just a failure of prediction, but it’s not like these failures are random.
I don’t understand, how is “not predicting errors” either a thing we have observed, or something that has anything to do with simulation?
Yeah, I really don’t know what you are saying here. Like, if you prompt a completion model with badly written text, it will predict badly written text. But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.
But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.
I’m saying that this won’t work with current systems at least for strong hash, because it’s hard, and instead of learning to undo, the model will learn to simulate, because it’s easier. And then you can vary the strength of hash to measure the degree of predictorness/simulatorness and compare it with what you expect. Or do a similar thing with something other than hash, that also distinguishes the two frames.
The point is that without experiments like these, how have you come to believe in the predictor frame?
I don’t understand, how is “not predicting errors” either a thing we have observed, or something that has anything to do with simulation?
I guess it is less about simulation being the right frame and more about prediction being the wrong one. But I think we have definitely observed LLMs mispredicting things we wouldn’t want them to predict. Or is this actually a crux and you haven’t seen any evidence at all against the predictor frame?
You can’t learn to simulate an undo of a hash, or at least I have no idea what you are “simulating” and why that would be “easier”. You are certainly not simulating the generation of the hash, going token by token forwards you don’t have access to a pre-image at that point.
Of course the reason why sometimes hashes are followed by their pre-image in the training set is because they were generated in the opposite order and then simply pasted in hash->pre-image order.
I’ve seen LLMs generating text backwards. Theoretically, LLM can keep pre-image in activations, calculate hash and then output in order hash, pre-image.
What about any time a system generalizes favourably, instead of predicting errors? You can say it’s just a failure of prediction, but it’s not like these failures are random.
And the evidence for this property, instead of, for example, the inherent bias of NNs, being central is what? Why wouldn’t predictor exhibit more malign goal-directedness even for short term goals?
I can see that this whole story about modeling LLMs as predictors, and goal-directedness, and fundamental laws of cognition is logically coherent. But where is the connection to reality?
I don’t understand, how is “not predicting errors” either a thing we have observed, or something that has anything to do with simulation?
Yeah, I really don’t know what you are saying here. Like, if you prompt a completion model with badly written text, it will predict badly written text. But also, if you predict a completion model where a very weak hash is followed by its pre-image, it will probably have learned to undo the hash, even though the source generation process never performed that (potentially much more complicated than the hashing function itself) operation, which means it’s not really a simulator.
I’m saying that this won’t work with current systems at least for strong hash, because it’s hard, and instead of learning to undo, the model will learn to simulate, because it’s easier. And then you can vary the strength of hash to measure the degree of predictorness/simulatorness and compare it with what you expect. Or do a similar thing with something other than hash, that also distinguishes the two frames.
The point is that without experiments like these, how have you come to believe in the predictor frame?
I guess it is less about simulation being the right frame and more about prediction being the wrong one. But I think we have definitely observed LLMs mispredicting things we wouldn’t want them to predict. Or is this actually a crux and you haven’t seen any evidence at all against the predictor frame?
You can’t learn to simulate an undo of a hash, or at least I have no idea what you are “simulating” and why that would be “easier”. You are certainly not simulating the generation of the hash, going token by token forwards you don’t have access to a pre-image at that point.
Of course the reason why sometimes hashes are followed by their pre-image in the training set is because they were generated in the opposite order and then simply pasted in hash->pre-image order.
I’ve seen LLMs generating text backwards. Theoretically, LLM can keep pre-image in activations, calculate hash and then output in order hash, pre-image.