I agree there exist in weight-space some bad models which this won’t catch, though it’s not obvious to me that they’re realistic cases.
It’s fine that you would guess that, but without a strong reason to believe it’s true—which I definitely don’t think we have—you can’t use something like this as a sufficient condition to label a model as safe.
I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI’s “pseudocode.”
I think this is just really not true. Consider:
Suppose I have a deceptive superintelligent AI. It has some random goal, but is trained on next token prediction and is doing that for instrumental reasons.
To predict what that AI will do, I can just assume that it’s essentially a perfect next token predictor of webtext. It’s so much better than me at predicting webtext, and so much better than anything I can build without building another AI at predicting webtext, that the actual details of the system are irrelevant to predicting it.
As a result, the only thing that matters for this condition is how good of a system I can build to do webtext prediction that technically satisfies all of your requirements.
However, since the AI is only pretending to predict webtext, all of this only gives you any actual ability to effectively predict it while it’s still doing that. That is, this only helps you on the distribution that you can sample from.
So no matter how good of a webtext predictor I produce, and thus no matter how good at predicting the AI system I get, as soon as I move to a distribution where e.g. the model knows it has a good opportunity to defect against me, it’ll do so in a completely unexpected way.
It’s fine that you would guess that, but without a strong reason to believe it’s true—which I definitely don’t think we have—you can’t use something like this as a sufficient condition to label a model as safe.
After thinking more about it earlier this week, I agree.
I was initially more bullish on “this seems sufficient and also would give a lot of time to understand models” (in which case you can gate model deployment with this alone) but I came to think “prediction requirements track something important but aren’t sufficient” (in which case this is one eval among many). The post starts off with “this is a sufficient condition”, and then equivocates between the two stances. I’ll strike the “sufficient” part and then clarify my position.
I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI’s “pseudocode.”
I think this is just really not true. Consider:
The quote you’re responding to is supposed to be about the cases I expect us to actually encounter (e.g. developers slowly are allowed to train larger/more compute-intensive models, after predicting the previous batch; developers predict outputs throughout training and don’t just start off with a superintelligence). My quote isn’t meaning to address hypothetical worst-case in weight space. (This might make more sense given my above comments and agreement on sufficiency.)
To predict what that AI will do, I can just assume that it’s essentially a perfect next token predictor of webtext. It’s so much better than me at predicting webtext, and so much better than anything I can build without building another AI at predicting webtext, that the actual details of the system are irrelevant to predicting it.
Setting aside my reply above and assuming your scenario, I disagree with chunks of this. I think that pretrained models are not “predicting webtext” in precise generality (although I agree they are to a rather loose first approximation).
Furthermore, I suspect that precise logit prediction tells you (in practice) about the internal structure of the superintelligence doing the pretending. I think that an algorithm’s exact output logits will leak bits about internals, but I’m really uncertain how many bits. I hope that this post sparks discussion of that information content.
We expect language models to build models of the world which generated the corpus which they are trained to predict. Analogously, teams of humans (and their predictable helper AIs) should come to build (partial, incomplete) mental models of the AI whose logits they are working to predict.
One way this argument fails is that, given some misprediction tolerance, there are a range of algorithms which produce the given logits. Maybe predicting 200 logit distributions doesn’t pin that down enough to actually be confident in one’s understanding. I agree with that critique. And I still think there’s something quite interesting and valuable about this eval, which I (perhaps wrongly) perceive you to dismiss.
It’s fine that you would guess that, but without a strong reason to believe it’s true—which I definitely don’t think we have—you can’t use something like this as a sufficient condition to label a model as safe.
I think this is just really not true. Consider:
Suppose I have a deceptive superintelligent AI. It has some random goal, but is trained on next token prediction and is doing that for instrumental reasons.
To predict what that AI will do, I can just assume that it’s essentially a perfect next token predictor of webtext. It’s so much better than me at predicting webtext, and so much better than anything I can build without building another AI at predicting webtext, that the actual details of the system are irrelevant to predicting it.
As a result, the only thing that matters for this condition is how good of a system I can build to do webtext prediction that technically satisfies all of your requirements.
However, since the AI is only pretending to predict webtext, all of this only gives you any actual ability to effectively predict it while it’s still doing that. That is, this only helps you on the distribution that you can sample from.
So no matter how good of a webtext predictor I produce, and thus no matter how good at predicting the AI system I get, as soon as I move to a distribution where e.g. the model knows it has a good opportunity to defect against me, it’ll do so in a completely unexpected way.
After thinking more about it earlier this week, I agree.
I was initially more bullish on “this seems sufficient and also would give a lot of time to understand models” (in which case you can gate model deployment with this alone) but I came to think “prediction requirements track something important but aren’t sufficient” (in which case this is one eval among many). The post starts off with “this is a sufficient condition”, and then equivocates between the two stances. I’ll strike the “sufficient” part and then clarify my position.
The quote you’re responding to is supposed to be about the cases I expect us to actually encounter (e.g. developers slowly are allowed to train larger/more compute-intensive models, after predicting the previous batch; developers predict outputs throughout training and don’t just start off with a superintelligence). My quote isn’t meaning to address hypothetical worst-case in weight space. (This might make more sense given my above comments and agreement on sufficiency.)
Setting aside my reply above and assuming your scenario, I disagree with chunks of this. I think that pretrained models are not “predicting webtext” in precise generality (although I agree they are to a rather loose first approximation).
Furthermore, I suspect that precise logit prediction tells you (in practice) about the internal structure of the superintelligence doing the pretending. I think that an algorithm’s exact output logits will leak bits about internals, but I’m really uncertain how many bits. I hope that this post sparks discussion of that information content.
We expect language models to build models of the world which generated the corpus which they are trained to predict. Analogously, teams of humans (and their predictable helper AIs) should come to build (partial, incomplete) mental models of the AI whose logits they are working to predict.
One way this argument fails is that, given some misprediction tolerance, there are a range of algorithms which produce the given logits. Maybe predicting 200 logit distributions doesn’t pin that down enough to actually be confident in one’s understanding. I agree with that critique. And I still think there’s something quite interesting and valuable about this eval, which I (perhaps wrongly) perceive you to dismiss.