Jonathan Paulson comments on Discussion with Nate Soares on a key alignment difficulty

Jonathan Paulson 14 Mar 2023 14:38 UTC
5 points
2
Nate’s view here seems similar to “To do cutting-edge alignment research, you need to do enough self-reflection that you might go crazy”. This seems really wrong to me. (I’m not sure if he means all scientific breakthroughs require this kind of reflection, or if alignment research is special).

I don’t think many top scientists are crazy, especially not in a POUDA way. I don’t think top scientists have done a huge amount of self-reflection/philosophy.

On the other hand, my understanding is that some rationalists have driven themselves crazy via too much self-reflection in an effort to become more productive. Perhaps Nate is overfitting to this experience?

“Just do normal incremental science; don’t try to do something crazy” still seems like a good default strategy to me (especially for an AI).

Thanks for this write up; it was unusually clear/productive IMO.

(I’m worried this comment comes off as mean or reductive. I’m not trying to be. Sorry)
- baturinsky 14 Mar 2023 16:59 UTC
  4 points
  2
  Parent
  From my experience, just realising how high stakes are and how unprepared we are could be enough to put a strain on someone’s mental health.
- PeterMcCluskey 14 Mar 2023 16:27 UTC
  2 points
  2
  Parent
  Some top scientists are crazy enough that it would be disastrous to give them absolute power.
  
  I mostly agree with Holden, but think he’s aiming to use AIs with more CIS than is needed or safe.
  - Noosphere89 14 Mar 2023 17:14 UTC
    4 points
    3
    Parent
    Writ power differentials, one of my go to examples of real world horrific misalignment is human’s relationships to the rest of the animal kingdom, and the unfortunate fact that as humans got more power via science and capitalism, things turned massively negative for animals. Science and capitalism didn’t create these negative impacts (They’ve been around since the founding of humans), but they supercharged them into S-risks and X-risks for animals. The alignment mechanisms that imperfectly align interspecies relations don’t exist at all in the interspecies case, which lends at least some support to the thesis that alignment will not happen by default.
    
    Now this section is less epistemically sound than the first section, but my own theory of why alignment fails in the interspecies case basically boils down to the following:
    
    Alignment can only happen right now when the capabilities differentials are very limited, and this is roughly the case re intraspecies vs interspecies differences, that is the difference in capabilities from being a different species is quite a bit more heavy tailed and way more different than the differences between the same species.
    
    Now I haven’t made any claim on how difficult alignment turns out to be, only that it probably won’t be achieved by default.