Noosphere89 comments on The Engineer’s Interpretability Sequence (EIS) I: Intro

Noosphere89 11 Feb 2023 13:48 UTC
1 point
−1

Hmm, I actually don’t think this is uncontroversial if by ‘interpretability’ you mean mechanistic interpretability. I think there’s a pretty plausible argument that doing anything other than running your AI (and training it) will end up being irrelevant. And this argument could extend to thinking that the expected value of working on (mechanistic) interpretability is considerably lower than other domains.

Conditional on this argument being correct, my response is that I would start advocating for simply slowing down or even stopping AI, because this is a world where we have to get very lucky to succeed at aligning AI.
- ryan_greenblatt 12 Feb 2023 1:32 UTC
  1 point
  0
  Parent
  I guess I’m considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking. (Or possibly you’re more optimistic about slowing down AI)
  - Noosphere89 12 Feb 2023 1:44 UTC
    1 point
    1
    Parent
    
    I guess I’m considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking.
    
    Basically this. I am a lot more pessimistic around black box alignment than I am around white box alignment.
    - ryan_greenblatt 12 Feb 2023 1:48 UTC
      1 point
      0
      Parent
      FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
      - Noosphere89 12 Feb 2023 2:08 UTC
        1 point
        0
        Parent
        I was using it as a synonym for alignment with interpretability compared to without interpretability.