I mostly disagree with the quote as I understand it.
Unfortunately, it’s very unclear why ability to predict generalization behavior on other tasks would transfer to being able to predict generalization behavior in the cases that we care about—and we can’t test the case that we care about directly due to RSA-2048-style problems.
I don’t buy the RSA-2048 example as plausible generalization that gets baked into weights (though I know that example isn’t meant to be realistic). I agree there exist in weight-space some bad models which this won’t catch, though it’s not obvious to me that they’re realistic cases. I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI’s “pseudocode.”
However, I’m pretty uncertain here, and could imagine you giving me a persuasive counterexample. (I’ve already updated downward a bit, in expectation of that.) I would be pretty surprised if I ended up concluding “there isn’t much transfer to predicting generalization in cases we care about” as opposed to “there are some cases where we miss some important transfer insights.”
For example, if you wanted to generally predict model behavior right now, you’d probably just want to get really good at understanding webtext, practice the next token prediction game, etc.
I think next-token prediction game / statistics of the pretraining corpus gets you some of the way and are the lowest hanging fruit, but to get below a certain misprediction threshold, you need to really start understanding the model.
Or if you really give me freedom to do whatever I want to predict some model’s generalization behavior, I could just train another similar model and see what it does, which obviously isn’t actually producing any additional understanding.
This seems avoided by the stipulation that developers can’t reference AIs which you can’t pass this test for. However, there’s some question about “if you compose together systems you understand, do you understand the composite system”, and I think the answer is no in general, so probably there needs to be more rigor in the “use approved AIs” rule (e.g. “you have to be able to predict the outputs of composite helper AI/AI systems, not just the outputs of the AIs themselves.”)
I agree there exist in weight-space some bad models which this won’t catch, though it’s not obvious to me that they’re realistic cases.
It’s fine that you would guess that, but without a strong reason to believe it’s true—which I definitely don’t think we have—you can’t use something like this as a sufficient condition to label a model as safe.
I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI’s “pseudocode.”
I think this is just really not true. Consider:
Suppose I have a deceptive superintelligent AI. It has some random goal, but is trained on next token prediction and is doing that for instrumental reasons.
To predict what that AI will do, I can just assume that it’s essentially a perfect next token predictor of webtext. It’s so much better than me at predicting webtext, and so much better than anything I can build without building another AI at predicting webtext, that the actual details of the system are irrelevant to predicting it.
As a result, the only thing that matters for this condition is how good of a system I can build to do webtext prediction that technically satisfies all of your requirements.
However, since the AI is only pretending to predict webtext, all of this only gives you any actual ability to effectively predict it while it’s still doing that. That is, this only helps you on the distribution that you can sample from.
So no matter how good of a webtext predictor I produce, and thus no matter how good at predicting the AI system I get, as soon as I move to a distribution where e.g. the model knows it has a good opportunity to defect against me, it’ll do so in a completely unexpected way.
It’s fine that you would guess that, but without a strong reason to believe it’s true—which I definitely don’t think we have—you can’t use something like this as a sufficient condition to label a model as safe.
After thinking more about it earlier this week, I agree.
I was initially more bullish on “this seems sufficient and also would give a lot of time to understand models” (in which case you can gate model deployment with this alone) but I came to think “prediction requirements track something important but aren’t sufficient” (in which case this is one eval among many). The post starts off with “this is a sufficient condition”, and then equivocates between the two stances. I’ll strike the “sufficient” part and then clarify my position.
I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI’s “pseudocode.”
I think this is just really not true. Consider:
The quote you’re responding to is supposed to be about the cases I expect us to actually encounter (e.g. developers slowly are allowed to train larger/more compute-intensive models, after predicting the previous batch; developers predict outputs throughout training and don’t just start off with a superintelligence). My quote isn’t meaning to address hypothetical worst-case in weight space. (This might make more sense given my above comments and agreement on sufficiency.)
To predict what that AI will do, I can just assume that it’s essentially a perfect next token predictor of webtext. It’s so much better than me at predicting webtext, and so much better than anything I can build without building another AI at predicting webtext, that the actual details of the system are irrelevant to predicting it.
Setting aside my reply above and assuming your scenario, I disagree with chunks of this. I think that pretrained models are not “predicting webtext” in precise generality (although I agree they are to a rather loose first approximation).
Furthermore, I suspect that precise logit prediction tells you (in practice) about the internal structure of the superintelligence doing the pretending. I think that an algorithm’s exact output logits will leak bits about internals, but I’m really uncertain how many bits. I hope that this post sparks discussion of that information content.
We expect language models to build models of the world which generated the corpus which they are trained to predict. Analogously, teams of humans (and their predictable helper AIs) should come to build (partial, incomplete) mental models of the AI whose logits they are working to predict.
One way this argument fails is that, given some misprediction tolerance, there are a range of algorithms which produce the given logits. Maybe predicting 200 logit distributions doesn’t pin that down enough to actually be confident in one’s understanding. I agree with that critique. And I still think there’s something quite interesting and valuable about this eval, which I (perhaps wrongly) perceive you to dismiss.
I mostly disagree with the quote as I understand it.
I don’t buy the RSA-2048 example as plausible generalization that gets baked into weights (though I know that example isn’t meant to be realistic). I agree there exist in weight-space some bad models which this won’t catch, though it’s not obvious to me that they’re realistic cases. I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI’s “pseudocode.”
However, I’m pretty uncertain here, and could imagine you giving me a persuasive counterexample. (I’ve already updated downward a bit, in expectation of that.) I would be pretty surprised if I ended up concluding “there isn’t much transfer to predicting generalization in cases we care about” as opposed to “there are some cases where we miss some important transfer insights.”
I think next-token prediction game / statistics of the pretraining corpus gets you some of the way and are the lowest hanging fruit, but to get below a certain misprediction threshold, you need to really start understanding the model.
This seems avoided by the stipulation that developers can’t reference AIs which you can’t pass this test for. However, there’s some question about “if you compose together systems you understand, do you understand the composite system”, and I think the answer is no in general, so probably there needs to be more rigor in the “use approved AIs” rule (e.g. “you have to be able to predict the outputs of composite helper AI/AI systems, not just the outputs of the AIs themselves.”)
It’s fine that you would guess that, but without a strong reason to believe it’s true—which I definitely don’t think we have—you can’t use something like this as a sufficient condition to label a model as safe.
I think this is just really not true. Consider:
Suppose I have a deceptive superintelligent AI. It has some random goal, but is trained on next token prediction and is doing that for instrumental reasons.
To predict what that AI will do, I can just assume that it’s essentially a perfect next token predictor of webtext. It’s so much better than me at predicting webtext, and so much better than anything I can build without building another AI at predicting webtext, that the actual details of the system are irrelevant to predicting it.
As a result, the only thing that matters for this condition is how good of a system I can build to do webtext prediction that technically satisfies all of your requirements.
However, since the AI is only pretending to predict webtext, all of this only gives you any actual ability to effectively predict it while it’s still doing that. That is, this only helps you on the distribution that you can sample from.
So no matter how good of a webtext predictor I produce, and thus no matter how good at predicting the AI system I get, as soon as I move to a distribution where e.g. the model knows it has a good opportunity to defect against me, it’ll do so in a completely unexpected way.
After thinking more about it earlier this week, I agree.
I was initially more bullish on “this seems sufficient and also would give a lot of time to understand models” (in which case you can gate model deployment with this alone) but I came to think “prediction requirements track something important but aren’t sufficient” (in which case this is one eval among many). The post starts off with “this is a sufficient condition”, and then equivocates between the two stances. I’ll strike the “sufficient” part and then clarify my position.
The quote you’re responding to is supposed to be about the cases I expect us to actually encounter (e.g. developers slowly are allowed to train larger/more compute-intensive models, after predicting the previous batch; developers predict outputs throughout training and don’t just start off with a superintelligence). My quote isn’t meaning to address hypothetical worst-case in weight space. (This might make more sense given my above comments and agreement on sufficiency.)
Setting aside my reply above and assuming your scenario, I disagree with chunks of this. I think that pretrained models are not “predicting webtext” in precise generality (although I agree they are to a rather loose first approximation).
Furthermore, I suspect that precise logit prediction tells you (in practice) about the internal structure of the superintelligence doing the pretending. I think that an algorithm’s exact output logits will leak bits about internals, but I’m really uncertain how many bits. I hope that this post sparks discussion of that information content.
We expect language models to build models of the world which generated the corpus which they are trained to predict. Analogously, teams of humans (and their predictable helper AIs) should come to build (partial, incomplete) mental models of the AI whose logits they are working to predict.
One way this argument fails is that, given some misprediction tolerance, there are a range of algorithms which produce the given logits. Maybe predicting 200 logit distributions doesn’t pin that down enough to actually be confident in one’s understanding. I agree with that critique. And I still think there’s something quite interesting and valuable about this eval, which I (perhaps wrongly) perceive you to dismiss.