Nadav Brandes comments on AGI with RL is Bad News for Safety

Nadav Brandes 25 Dec 2024 16:17 UTC
1 point
0
Before I attempt to respond to your objections, I want to first make sure that I understand your reasoning.
I think you’re saying that in theory it would be better to have CoT systems based on pure LLMs, but you don’t expect these to be powerful enough without open-ended RL, so this approach won’t be incentivized and die out from competition against AI labs who do use open-ended RL. Is it a faithful summary of (part of) your view?
You are also saying that if done right, open-ended RL discourages models from learning to reason strongly in the forward pass. Can you explain what you mean exactly and why you think that?
I think you are also saying that models trained with open-ended RL are easier to align than pure LLMs. Is it because you expect them to be overall more capable (and therefore easier to do anything with, including alignment), or for another reason?
In case it helps to clarify our crux, I’d like to add that I agree with you that AI systems without open-ended RL would likely be much weaker than those with it, so I’m definitely expecting incentives to push more and more AI labs to use this technique. I just wish we could somehow push against these incentives. Pure LLMs producing weaker AI systems is in my opinion a feature, not a bug. I think our society would benefit from slower progress in frontier AGI.
- purple fire 27 Dec 2024 4:33 UTC
  1 point
  0
  Parent
  These are close but not quite the claims I believe.
  I do think that CoT systems based on pure LLMs will never be that good at problem-solving because a webtext-trained assistant just isn’t that good at working with long chains of reasoning. I think any highly capable CoT system will require at least some RL (or be pre-trained on synthetic data from another CoT system that was trained with RL, but I’m not sure it makes a difference here). I’m a little less confident about whether pure LLMs will be disincentivized—for example, labs might stop developing CoT systems if inference-time compute requirements are too expensive—but I think labs will generally move more resources toward CoT systems.
  I think the second two points are best explained with an example, which might clarify how I’m approaching the question.
  Suppose I make two LLMs, GPT large (with more parameters) and GPT small (with fewer). I pre-train them on webtext and then I want to teach them how to do modular addition, so I create a bunch of synthetic data of input-output pairs like {6 + 2 mod 5, 3} and finetune the LLMs with the synthetic data to output a single answer, using the difference between their output and the answer as a loss function. GPT large becomes very good at this task, and GPT small does not.
  So I create a new dataset of input-output pairs like {Solve 6 + 2 mod 5 step-by-step, writing out your reasoning at each step. Plan your approach ahead of time and periodically reflect on your work., 3}. I train GPT small on this dataset, but when it gets the answer right I reward the entire chain of thought, not just the token with the answer. This approach incentivizes GPT small to use a CoT to solve the problem, and now it performs as well as GPT large did with regular finetuning.^[1]
  In the end, I have two equally capable (at modular arithmetic) systems—GPT large, which was trained only with finetuning, and GPT small, which was trained with finetuning + open-ended RL. I have a few claims here:
  - GPT small’s CoT is likely to reflect how it’s actually solving the problem. It couldn’t do the problems pre-RL, so we know it isn’t just solving them internally and backfilling a plausible explanation. We can prevent steganography by doing things like periodically paraphrasing the CoT or translating it between languages. We can also verify this by altering the CoT to plausible but incorrect explanations and ensuring that task performance is degraded.
  - For this reason, GPT small is much more interpretable, since we can look at the CoT to understand how it solved the problem. GPT large, on the other hand, is still a complete black box—we don’t know how it’s solving the problem. When we finetuned it, GPT large learned how to do these problems in a single forward pass, making it incredibly hard to understand its reasoning.
  - And for this reason, GPT small is also easier to align. We can monitor the CoT to make sure it’s actually doing modular arithmetic. In contrast, GPT large might be doing something that locally approximates modular arithmetic but behaves unpredictably outside the training distribution. In fact, if we deploy GPT small in out-of-distribution contexts (such as inputting negative numbers), the CoT will likely provide useful information about how it plans to adapt and what its problem-solving approach will be.
  I am much more excited about building systems like GPT small than I am about building systems like GPT large. Do you disagree (or disagree about any subpoints, or about this example’s generality?)
  P.S. I am enjoying this discussion, I feel that you’ve been very reasonable and I continue to be open to changing my mind about this if I see convincing evidence :)
  1. ^
    Oversimplified obviously, but details shouldn’t matter here