Noosphere89 comments on The Engineer’s Interpretability Sequence (EIS) I: Intro

Noosphere89 12 Feb 2023 1:44 UTC
1 point
1

I guess I’m considerably more optimistic on avoiding AI takeover without humans understanding what the models are thinking.

Basically this. I am a lot more pessimistic around black box alignment than I am around white box alignment.
- ryan_greenblatt 12 Feb 2023 1:48 UTC
  1 point
  0
  Parent
  FWIW, white box alignment doesn’t imply humans understand what the models are thinking. There are other ways to leverage the fact that we have access to the internals.
  - Noosphere89 12 Feb 2023 2:08 UTC
    1 point
    0
    Parent
    I was using it as a synonym for alignment with interpretability compared to without interpretability.