The OP doesn’t explicitly make this jump, but it’s dangerous to conflate the claims “specialized models seem most likely” and “short-term motivated safety research should be evaluated in terms of these specialized models”.
I agree with the former statement, but at the same time, the highest x-risk/ highest EV short-term safety opportunity is probably different. For instance, a less likely but higher impact scenario: a future code generation LM either directly or indirectly* creates an unaligned, far improved architecture. Researchers at the relevant org do not recognize this discontinuity and run the model, followed by disaster.
*E.g. a model proposing an improved Quoc Le style architecture search seems quite plausible to me.
Great point. I agree and should have said something like that in the post.
To expand on this a bit more, studying these specialized models will be valuable for improving their robustness and performance. It is possible that this research will be useful for alignment in general, but it’s not the most promising approach. That being said, I want to see alignment researchers working on diverse approaches.
The OP doesn’t explicitly make this jump, but it’s dangerous to conflate the claims “specialized models seem most likely” and “short-term motivated safety research should be evaluated in terms of these specialized models”.
I agree with the former statement, but at the same time, the highest x-risk/ highest EV short-term safety opportunity is probably different. For instance, a less likely but higher impact scenario: a future code generation LM either directly or indirectly* creates an unaligned, far improved architecture. Researchers at the relevant org do not recognize this discontinuity and run the model, followed by disaster.
*E.g. a model proposing an improved Quoc Le style architecture search seems quite plausible to me.
Great point. I agree and should have said something like that in the post.
To expand on this a bit more, studying these specialized models will be valuable for improving their robustness and performance. It is possible that this research will be useful for alignment in general, but it’s not the most promising approach. That being said, I want to see alignment researchers working on diverse approaches.