Rob Bensinger comments on A central AI alignment problem: capabilities generalization, and the sharp left turn

Rob Bensinger 15 Jun 2022 23:40 UTC
5 points
0
NS has not optimized for alignment, which is why it’s bad at alignment compared to what it has optimized for.
I don’t think the “which is why” claim here is true, if you mean ‘this is the only reason’. ‘Alignment is exactly as easy as capabilities if you’re not myopic’ seems like a claim that needs to be argued for positively.
My answer would be the same for NS and humans: alignment is simply not optimized for! People spend countless more resources on capabilities than alignment.
NS didn’t optimize for humans to be good at biochemistry, nuclear physics, or chess, either. NS produces many things that it wasn’t specifically optimizing for. One of the main things that Nate is pointing out in the OP is that alignment isn’t on that list, even though a huge number of other things are. “NS doesn’t produce things it didn’t optimize for” is an overly general response, because it would rule out things like ‘humans landing on the Moon’.
If the resources invested in capability vs alignment ratio was reversed, would you still expect alignment to fare so much worse than capabilities?
This would obviously be an incredibly positive development, and would increase our success odds a ton! Nate isn’t arguing ‘when you actually try to do alignment, you can never make any headway’.
But ‘alignment is tractable when you actually work on it’ doesn’t imply ‘the only reason capabilities outgeneralized alignment in our evolutionary history was that evolution was myopic and therefore not able to do long-term planning aimed at alignment desiderata’.
Evolution was also myopic with respect to capabilities, and not able to do long-term planning aimed at capabilities desiderata; and yet capabilities generalized amazingly well, far beyond evolution’s wildest dreams. If you’re myopically optimizing for two things (‘make the agent want to pursue the intended goal’ and ‘make the agent capable at pursuing the intended goal’) and one generalizes vastly better than the other, this points toward a difference between the two myopically-optimized targets.
- Disposable Identity 16 Jun 2022 6:56 UTC
  1 point
  0
  Parent
  But ‘alignment is tractable when you actually work on it’ doesn’t imply ‘the only reason capabilities outgeneralized alignment in our evolutionary history was that evolution was myopic and therefore not able to do long-term planning aimed at alignment desiderata’.
  I am not claiming evolution is ‘not able to do long-term planning aimed at alignment desiderata’.
  I am claiming it did not even try.
  If you’re myopically optimizing for two things (‘make the agent want to pursue the intended goal’ and ‘make the agent capable at pursuing the intended goal’) and one generalizes vastly better than the other, this points toward a difference between the two myopically-optimized targets.
  This looks like a strong steelman of the post, which I gladly accept.
  
  But it seemed to me that the post was arguing:
  1. That alignment was hard (it mentions that technical alignment contains the hard bits, multiple specific problems in alignment), etc.
  2. That current approaches do not work
  
  That you do not get alignment by default looks like a much weaker thesis than 1&2, one that I agree with.
  This would obviously be an incredibly positive development, and would increase our success odds a ton! Nate isn’t arguing ‘when you actually try to do alignment, you can never make any headway’.
  This unfortunately didn’t answer my question. We all agree that it would be a positive development, my question was how much. But from my point of view, it could even be enough.
  
  The question that I was trying to ask was: “What is the difficulty ratio that you see between alignment and capabilities?”
  I understood the post as making a claim (among others) that “Alignment is very more difficult than capabilities, as evidenced by Natural Selection”.