It might also be worth thinking about additional ways to make models smarter safely (especially e.g. without increasing the compute used in a single forward pass, to disincentivize deceptive alignment); e.g. tool use, if one thinks it is probably safe because interpretable (similar to CoT + guarding against steganography). This recent review seems like a good starting point: AI capabilities can be significantly improved without expensive retraining.
It might also be worth thinking about additional ways to make models smarter safely (especially e.g. without increasing the compute used in a single forward pass, to disincentivize deceptive alignment); e.g. tool use, if one thinks it is probably safe because interpretable (similar to CoT + guarding against steganography). This recent review seems like a good starting point: AI capabilities can be significantly improved without expensive retraining.
Speculatively, applying IDA-like schemes might also be useful for the above, e.g. this recent paper: Chain-of-Thought Reasoning is a Policy Improvement Operator.