Signer comments on Thoughts on the impact of RLHF research

Signer 25 Jan 2023 19:53 UTC
4 points
0

Avoiding RLHF at best introduces an important overhang

But it would be better if we collectively then decided not to rush forward anyway, right?

And I still don’t get why do you expect the future environment, where somewhat-aligned superhuman AIs are available, to be better for alignment work. Like, sure, automatic idea generator and verifier may be useful, but it’s also useful for reckless people. And, intuitively, the more advanced AI is, the less I would trust it. So “lets try as hard as we can to advance civilization, because more advanced civilization will be better at alignment” seem like a very risky plan.
- paulfchristiano 26 Jan 2023 17:25 UTC
  5 points
  0
  Parent
  But it would be better if we collectively then decided not to rush forward anyway, right?
  Yes, that seems consistent with my post.
  And I still don’t get why do you expect the future environment, where somewhat-aligned superhuman AIs are available, to be better for alignment work. Like, sure, automatic idea generator and verifier may be useful, but it’s also useful for reckless people. And, intuitively, the more advanced AI is, the less I would trust it.
  I mostly think that AI doing research will accelerate both risk and alignment, so we’re aiming for it to be roughly a wash.
  But having nearly-risky AI to study seems incredibly important for doing good alignment work. I think this is a pretty robust bet.
  So “lets try as hard as we can to advance civilization, because more advanced civilization will be better at alignment” seem like a very risky plan.
  That’s not the plan. I’m saying to do the work that seems most useful for alignment even if it has modest capability benefits, and that for some kinds of capability benefits the apparent cost is less than you’d think because of these overhang effects.
  - Signer 26 Jan 2023 19:34 UTC
    3 points
    0
    Parent
    
    I mostly think that AI doing research will accelerate both risk and alignment, so we’re aiming for it to be roughly a wash.
    
    Yeah, I don’t understand why it would be a wash, when destructive capabilities are easier than alignment (humans already figured out nukes, but not alignment) and alignment is expected to be harder for more advanced AI. Even without straight misalignment risk, giving superhuman AI to the current civilization doesn’t sound like stability improvement. So without specific plan to stop everyone from misusing AI it still sounds safer to solve alignment without anyone building nearly-risky AI.