For the proposed safety strategy (conditioning models to generate safety research based on alternative future worlds) to beat naive baselines (RLHF), you need:
The CPM abstraction to hold extremely strongly in unlikely ways. E.g., models need to generalize basically like this.
The advantage has to be coming from understanding exactly what conditional you’re getting. In other words, the key property is an interpretability type property where you have a more mechanistic understanding of what’s going on. Let’s suppose you’re getting the conditional via prompting. If you just look at the output and then iterate on prompts until you get outputs that seem to perform better where most of the optimization isn’t understood, then your basically back in the RL case.
It seems actually hard to understand what conditional you’ll get from a prompt. This also might be limited by the model’s overall understanding.
I think it’s quite unlikely that extracting human understandable conditionals is competitive with other training methods (RL, SFT). This is particularly because it will be hard to understand exactly what conditional you’re getting.
I think you probably get wrecked by models needing to understand that they are AIs to at least some extent.
I think you also plausibly get wrecked by models detecting that they are AIs and then degrading to GPT-3.5 level performance.
You could hope for substantial coordination to wait for even bigger models that you only use via CPM, but I think bigger models are much riskier than making transformatively useful ai via well elicited smaller models so this seems to just make the situation worse putting aside coordination feasibility.
TBC, I think that some insight like “models might generalize in a conditioning-ish sort of way even after RL, maybe we should make some tweaks to our training process to improve safety based on this hypothesis ” seems like a good idea. But this isn’t really an overall safety proposal IMO and a bunch of the other ideas in the Conditioning Predictive Models paper seem pretty dubious or at least overconfident to me.
Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives.
You could hope for substantial coordination to wait for bigger models that you only use via CPM, but I think bigger models are much riskier than well elicited small models so this seems to just make the situation worse putting aside coordination feasibility.
If we’re looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree with your critiques. I also agree that bigger models are much riskier, but I have the expectation that we’re going to get them anyway. With those more powerful models come new potential issues, like predicting manipulated observations and performative prediction, that we don’t see in current systems. Strategies like RLHF also become riskier, as deceptive alignment becomes more of a live possibility with greater capabilities.
My motivation for this approach is in raising awareness and addressing the risks that seem likely to arise in future predictive models, regardless of the ends to which they’re used. Then, success in avoiding the dangers from powerful predictive models would open the possibility of using them to reduce all-cause existential risk.
I also agree that bigger models are much riskier, but I have the expectation that we’re going to get them anyway
I think I was a bit unclear. Suppose that by default GPT-6 if maximally elicited would be transformatively useful (e.g. capable of speeding up AI safety R&D by 10x). Then I’m saying CPM would require coordinating to not use these models and instead wait for GPT-8 to hit this same level of transformative usefulness. But GPT-8 is actually much riskier via being much smarter.
For the proposed safety strategy (conditioning models to generate safety research based on alternative future worlds) to beat naive baselines (RLHF), you need:
The CPM abstraction to hold extremely strongly in unlikely ways. E.g., models need to generalize basically like this.
The advantage has to be coming from understanding exactly what conditional you’re getting. In other words, the key property is an interpretability type property where you have a more mechanistic understanding of what’s going on. Let’s suppose you’re getting the conditional via prompting. If you just look at the output and then iterate on prompts until you get outputs that seem to perform better where most of the optimization isn’t understood, then your basically back in the RL case.
It seems actually hard to understand what conditional you’ll get from a prompt. This also might be limited by the model’s overall understanding.
I think it’s quite unlikely that extracting human understandable conditionals is competitive with other training methods (RL, SFT). This is particularly because it will be hard to understand exactly what conditional you’re getting.
I think you probably get wrecked by models needing to understand that they are AIs to at least some extent.
I think you also plausibly get wrecked by models detecting that they are AIs and then degrading to GPT-3.5 level performance.
You could hope for substantial coordination to wait for even bigger models that you only use via CPM, but I think bigger models are much riskier than making transformatively useful ai via well elicited smaller models so this seems to just make the situation worse putting aside coordination feasibility.
TBC, I think that some insight like “models might generalize in a conditioning-ish sort of way even after RL, maybe we should make some tweaks to our training process to improve safety based on this hypothesis ” seems like a good idea. But this isn’t really an overall safety proposal IMO and a bunch of the other ideas in the Conditioning Predictive Models paper seem pretty dubious or at least overconfident to me.
Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives.
If we’re looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree with your critiques. I also agree that bigger models are much riskier, but I have the expectation that we’re going to get them anyway. With those more powerful models come new potential issues, like predicting manipulated observations and performative prediction, that we don’t see in current systems. Strategies like RLHF also become riskier, as deceptive alignment becomes more of a live possibility with greater capabilities.
My motivation for this approach is in raising awareness and addressing the risks that seem likely to arise in future predictive models, regardless of the ends to which they’re used. Then, success in avoiding the dangers from powerful predictive models would open the possibility of using them to reduce all-cause existential risk.
I think I was a bit unclear. Suppose that by default GPT-6 if maximally elicited would be transformatively useful (e.g. capable of speeding up AI safety R&D by 10x). Then I’m saying CPM would require coordinating to not use these models and instead wait for GPT-8 to hit this same level of transformative usefulness. But GPT-8 is actually much riskier via being much smarter.
(I also edited my comment to improve clarity.)