purple fire comments on AGI with RL is Bad News for Safety

purple fire 23 Dec 2024 17:55 UTC
1 point
0
I agree that this might be our crux, so I’ll try to briefly explain my side. My view is still more or less training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. I think this is only true with open-ended RL because:
- Regular CoT/prompt engineering is effective, but not so effective that I expect it to meaningfully change the incentive landscape for creating base models. For example, when people figured out that CoT improved benchmarks for GPT-3.5, I don’t think that disincentivized the development of GPT-4. In contrast, I do think the creation of the o-series models (with open-ended RL) is actively disincentivizing the development of GPT-5, which I see as a good thing.
- Open-ended RL might make compositional AI safer, not less safe. Done right, it discourages models from learning to reason strongly in the forward pass, which is imo the most dangerous capability.
Again, I agree that you don’t need open-ended RL for CoT systems, but if you aren’t using RL on the entire output then you need a more capable forward pass, and this seems bad. In effect, your options are:
1. Build a regular model and then do CoT during inference (e.g. via prompt engineering)
2. Build a model and reward it based on CoT during training with RL
Option 1 creates much more capable forward passes, Option 2 does not. I think we have a much better shot at aligninf models built the second way.