Reactions to the paper were mostly positive, but discussion was minimal and the ideas largely failed to gain traction. I suspect that muted reception was in part due to the size of the paper, which tried to both establish the research area (predictive models) and develop a novel contribution (conditioning them).
I think the proposed approach to safety doesn’t make much sense and seems unlikely to be very useful direction. I haven’t written up a review because it didn’t seem like that many people were interested in pursuing this direction.
I think CPM does do somewhat interesting conceptual work with two main contributions:
It notes that “LLMs might generalize in a way which is reasonably well interpreted as conditioning and this be important and useful”. I think this seems like one of the obvious baseline hypotheses for how LLMs (or similarly trained models) generalize and it seems good to point this out.
It notes various implications of this to varying degrees of speculativeness.
But, the actual safety proposal seems extremely dubious IMO.
For the proposed safety strategy (conditioning models to generate safety research based on alternative future worlds) to beat naive baselines (RLHF), you need:
The CPM abstraction to hold extremely strongly in unlikely ways. E.g., models need to generalize basically like this.
The advantage has to be coming from understanding exactly what conditional you’re getting. In other words, the key property is an interpretability type property where you have a more mechanistic understanding of what’s going on. Let’s suppose you’re getting the conditional via prompting. If you just look at the output and then iterate on prompts until you get outputs that seem to perform better where most of the optimization isn’t understood, then your basically back in the RL case.
It seems actually hard to understand what conditional you’ll get from a prompt. This also might be limited by the model’s overall understanding.
I think it’s quite unlikely that extracting human understandable conditionals is competitive with other training methods (RL, SFT). This is particularly because it will be hard to understand exactly what conditional you’re getting.
I think you probably get wrecked by models needing to understand that they are AIs to at least some extent.
I think you also plausibly get wrecked by models detecting that they are AIs and then degrading to GPT-3.5 level performance.
You could hope for substantial coordination to wait for even bigger models that you only use via CPM, but I think bigger models are much riskier than making transformatively useful ai via well elicited smaller models so this seems to just make the situation worse putting aside coordination feasibility.
TBC, I think that some insight like “models might generalize in a conditioning-ish sort of way even after RL, maybe we should make some tweaks to our training process to improve safety based on this hypothesis ” seems like a good idea. But this isn’t really an overall safety proposal IMO and a bunch of the other ideas in the Conditioning Predictive Models paper seem pretty dubious or at least overconfident to me.
Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives.
You could hope for substantial coordination to wait for bigger models that you only use via CPM, but I think bigger models are much riskier than well elicited small models so this seems to just make the situation worse putting aside coordination feasibility.
If we’re looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree with your critiques. I also agree that bigger models are much riskier, but I have the expectation that we’re going to get them anyway. With those more powerful models come new potential issues, like predicting manipulated observations and performative prediction, that we don’t see in current systems. Strategies like RLHF also become riskier, as deceptive alignment becomes more of a live possibility with greater capabilities.
My motivation for this approach is in raising awareness and addressing the risks that seem likely to arise in future predictive models, regardless of the ends to which they’re used. Then, success in avoiding the dangers from powerful predictive models would open the possibility of using them to reduce all-cause existential risk.
I also agree that bigger models are much riskier, but I have the expectation that we’re going to get them anyway
I think I was a bit unclear. Suppose that by default GPT-6 if maximally elicited would be transformatively useful (e.g. capable of speeding up AI safety R&D by 10x). Then I’m saying CPM would require coordinating to not use these models and instead wait for GPT-8 to hit this same level of transformative usefulness. But GPT-8 is actually much riskier via being much smarter.
I think the proposed approach to safety doesn’t make much sense and seems unlikely to be very useful direction. I haven’t written up a review because it didn’t seem like that many people were interested in pursuing this direction.
I think CPM does do somewhat interesting conceptual work with two main contributions:
It notes that “LLMs might generalize in a way which is reasonably well interpreted as conditioning and this be important and useful”. I think this seems like one of the obvious baseline hypotheses for how LLMs (or similarly trained models) generalize and it seems good to point this out.
It notes various implications of this to varying degrees of speculativeness.
But, the actual safety proposal seems extremely dubious IMO.
I’d be very interested in hearing the reasons why you’re skeptical of the approach, even a bare-bones outline if that’s all you have time for.
For the proposed safety strategy (conditioning models to generate safety research based on alternative future worlds) to beat naive baselines (RLHF), you need:
The CPM abstraction to hold extremely strongly in unlikely ways. E.g., models need to generalize basically like this.
The advantage has to be coming from understanding exactly what conditional you’re getting. In other words, the key property is an interpretability type property where you have a more mechanistic understanding of what’s going on. Let’s suppose you’re getting the conditional via prompting. If you just look at the output and then iterate on prompts until you get outputs that seem to perform better where most of the optimization isn’t understood, then your basically back in the RL case.
It seems actually hard to understand what conditional you’ll get from a prompt. This also might be limited by the model’s overall understanding.
I think it’s quite unlikely that extracting human understandable conditionals is competitive with other training methods (RL, SFT). This is particularly because it will be hard to understand exactly what conditional you’re getting.
I think you probably get wrecked by models needing to understand that they are AIs to at least some extent.
I think you also plausibly get wrecked by models detecting that they are AIs and then degrading to GPT-3.5 level performance.
You could hope for substantial coordination to wait for even bigger models that you only use via CPM, but I think bigger models are much riskier than making transformatively useful ai via well elicited smaller models so this seems to just make the situation worse putting aside coordination feasibility.
TBC, I think that some insight like “models might generalize in a conditioning-ish sort of way even after RL, maybe we should make some tweaks to our training process to improve safety based on this hypothesis ” seems like a good idea. But this isn’t really an overall safety proposal IMO and a bunch of the other ideas in the Conditioning Predictive Models paper seem pretty dubious or at least overconfident to me.
Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives.
If we’re looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree with your critiques. I also agree that bigger models are much riskier, but I have the expectation that we’re going to get them anyway. With those more powerful models come new potential issues, like predicting manipulated observations and performative prediction, that we don’t see in current systems. Strategies like RLHF also become riskier, as deceptive alignment becomes more of a live possibility with greater capabilities.
My motivation for this approach is in raising awareness and addressing the risks that seem likely to arise in future predictive models, regardless of the ends to which they’re used. Then, success in avoiding the dangers from powerful predictive models would open the possibility of using them to reduce all-cause existential risk.
I think I was a bit unclear. Suppose that by default GPT-6 if maximally elicited would be transformatively useful (e.g. capable of speeding up AI safety R&D by 10x). Then I’m saying CPM would require coordinating to not use these models and instead wait for GPT-8 to hit this same level of transformative usefulness. But GPT-8 is actually much riskier via being much smarter.
(I also edited my comment to improve clarity.)