purple fire comments on AGI with RL is Bad News for Safety

purple fire 23 Dec 2024 0:09 UTC
1 point
0
I don’t disagree that there remains a lot of work to be done, I understand that COT can be unfatihful, and I am generally against building very capable models that do CoT in latent space,^[1] like the Meta paper does. Emphatically, I do not think “alignment is solved” just because o3 reasons out loud, or something.
But, in my view, the research that needs to happen between here and aligned AGI is much more tractable with a weak forward pass and RL-trained CoT as opposed to a highly capable forward pass without RL. I can see an actual path forward to aligning AGI that works like the o-series model, and considering how recently this even became a research topic I think the work that’s already been done is quite promising, including many of Daniel’s proposals.
1. ^
  This is a very general statement, there are lots of caveats and nuances, but I suspect we already agree on the broad strokes.
- Seth Herd 23 Dec 2024 0:42 UTC
  3 points
  1
  Parent
  I think we are in agreement. I see a path to successful technical alignment through agentized LLMs/foundation models. The brief take is at Agentized LLMs will change the alignment landscape, and there’s more on the several overlapping alignment techniques in Internal independent review for language model agent alignment and reasons to think they’ll advance rapidly at Capabilities and alignment of LLM cognitive architectures.
  I think it’s very useful to have a faithful chain of thought, but I also think all isn’t lost without it.
  Here’s my current framing: we are actually teaching LLMs mostly to answer questions correctly, and to follow our instructions as we intended them. While optimizing those things too much would probably cause our doom by creating interpretations of those pressures we didn’t intend or foresee. This is the classical agent foundations’ worry, and I think it’s valid. But we won’t do infinite optimization all in one go; we’ll have a human-level agent that’s on our side to help us with its improvement or next generation, and so on. If we keep focusing on instruction-following as not only the training pressure but the explicit top-level goal we give it, we have the AGI on our side to anticipate unexpected consequences of optimization, and it will be “corribible” as long as it’s approximately following instructions as intended. This gives us second chances and superhuman help maintaining alignment. So I think that alignment target is more important than the technical approach: Instruction-following AGI is easier and more likely than value aligned AGI