Nadav Brandes comments on AGI with RL is Bad News for Safety

Nadav Brandes 23 Dec 2024 14:59 UTC
1 point
0
I think our crux is that, for a reason I don’t fully understand yet, you equate “CoT systems”^[1] with RL training. Again, I fully agree with you that systems based on CoT (or other hard-coded software and prompt engineering over weaker underlying models) are much safer than black-box end-to-end models. But why do you assume that this requires open-ended RL^[2]? Wouldn’t you agree that it would be safer with simple LLMs that haven’t undergone open-ended RL? I also like your idea of a CoT system with two parts, but again, I’d argue that you don’t need open-ended RL for that.
Other points I somewhat disagree with, but I don’t think are the core of the crux, so we probably shouldn’t delve into:
1. I still think that priority number one should be to not exceed a dangerous level of capabilities, even with robust safety features.
2. I find both “we shouldn’t build dangerous AI” and “we should reduce homelessness” to be useful statements. To come up with good plans, we first need to know what we’re aiming for.
1. ^
  I used the term “compositional AI systems”, but for the sake of ease of communication, let’s use your terminology.
2. ^
  See my definition above for open-ended RL.
- purple fire 23 Dec 2024 17:55 UTC
  1 point
  0
  Parent
  I agree that this might be our crux, so I’ll try to briefly explain my side. My view is still more or less training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. I think this is only true with open-ended RL because:
  - Regular CoT/prompt engineering is effective, but not so effective that I expect it to meaningfully change the incentive landscape for creating base models. For example, when people figured out that CoT improved benchmarks for GPT-3.5, I don’t think that disincentivized the development of GPT-4. In contrast, I do think the creation of the o-series models (with open-ended RL) is actively disincentivizing the development of GPT-5, which I see as a good thing.
  - Open-ended RL might make compositional AI safer, not less safe. Done right, it discourages models from learning to reason strongly in the forward pass, which is imo the most dangerous capability.
  Again, I agree that you don’t need open-ended RL for CoT systems, but if you aren’t using RL on the entire output then you need a more capable forward pass, and this seems bad. In effect, your options are:
  1. Build a regular model and then do CoT during inference (e.g. via prompt engineering)
  2. Build a model and reward it based on CoT during training with RL
  Option 1 creates much more capable forward passes, Option 2 does not. I think we have a much better shot at aligninf models built the second way.