David Scott Krueger (formerly: capybaralet) comments on capybaralet’s Shortform

David Scott Krueger (formerly: capybaralet) 17 Sep 2020 19:39 UTC
5 points
I’m frustrated with the meme that “mesa-optimization/pseudo-alignment is a robustness (i.e. OOD) problem”. IIUC, this is definitionally true in the mesa-optimization paper, but I think this misses the point.
In particular, this seems to exclude an important (maybe the most important) threat model: the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.
This is exactly how I imagine a treacherous turn from a boxed superintelligent AI agent to occur, for instance. It secretly begins breaking out of the box (e.g. via manipulating humans) and we don’t notice until its too late.
- evhub 17 Sep 2020 19:58 UTC
  6 points
  Parent
  
  the AI understands how to appear aligned, and does so, while covertly pursues its own objective on-distribution, during training.
  
  Sure, but the fact that it defects in deployment and not in training is a consequence of distributional shift, specifically the shift from a situation where it can’t break out of the box to a situation where it can.
  - David Scott Krueger (formerly: capybaralet) 18 Sep 2020 23:06 UTC
    3 points
    Parent
    No, I’m talking about it breaking out during training. The only “shifts” here are:
    1) the AI gets smarter
    2) (perhaps) the AI covertly influences its external environment (i.e. breaks out of the box a bit).
    We can imagine scenarios where it’s only (1) and not (2). I find them a bit more far-fetched, but this is the classic vision of the treacherous turn… the AI makes a plan, and then suddenly executes it to attain DSA. Once it starts to execute, ofc there is distributional shift, but:
    A) it is auto-induced distributional shift
    B) the developers never decided to deploy