Neel Nanda comments on OpenAI’s Alignment Plans

Neel Nanda 27 Aug 2022 7:17 UTC
LW: 3 AF: 2
3
AF
I’m pretty unconvinced by this. I do not think that any substantial fraction of AI x-risk comes from an alignment research who thinks carefully about x-risk deciding that a GPT-3 level system isn’t scary enough to take significant precautions with re boxing.
I think taking frivolous risks is bad, but that risk aversion to the point of not being able to pursue otherwise promising research directions seems pretty costly, while the benefits of averting risks >1e-9 is pretty negligible in comparison.
(To be clear, this argument does not apply to more powerful systems! As systems get smarter we should be more capable, and try to be very conservative! But ultimately everything is a trade-off—letting GPT-3 talk to human contractors giving feedback is a way of letting it out of the box!)
- Signer 27 Aug 2022 15:18 UTC
  −2 points
  −3
  Parent
  I just want the trade-off to be made explicitly. If it turns out that −7 people in expectation is better than thinking about utility functions and all other alternatives—fine. But that’s the argument that depends on actual numbers. Yes, it’s possible to think informally and correctly. But maybe “an alignment research who thinks carefully about x-risk” wasn’t what was happening.
  
  To be clear, this argument does not apply to more powerful systems!
  
  Before running InstructGPT what was the technical reason why it wouldn’t be powerful?