RussellThor comments on RussellThor’s Shortform

RussellThor 21 Jul 2024 3:40 UTC
9 points
0
Self modelling in NN https://arxiv.org/pdf/2407.10188 Is this good news for mech interpretability? If the model makes it easily predictable, then that really seems to limit the possibilities for deceptive alignment
- Gunnar_Zarncke 21 Jul 2024 20:34 UTC
  3 points
  0
  Parent
  It makes it easier, but consider this: The human brain also does this—when we conform to expectations, we make ourselves more predictable and model ourselves. But this also doesn’t prevent deception. People still lie and some of the deception is pushed into the subconscious.
  - RussellThor 21 Jul 2024 20:59 UTC
    3 points
    2
    Parent
    Sure it doesn’t prevent a deceptive model being made, but if AI engineers made NN with such self awareness at all levels from the ground up, that wouldn’t happen in their models. The encouraging thing if it holds up is that there is little to no “alignment tax” to make the models understandable—they are also better.
    - Gunnar_Zarncke 22 Jul 2024 8:20 UTC
      3 points
      0
      Parent
      Indeed, engineering readability at multiple levels may solve this.