Rohin Shah comments on Formal Solution to the Inner Alignment Problem

Rohin Shah 19 Feb 2021 1:51 UTC
LW: 20 AF: 15
AF
While I share your position that this mostly isn’t addressing the things that make inner alignment hard / risky in practice, I agree with Vanessa that this does not assume the inner alignment problem away, unless you have a particularly contorted definition of “inner alignment”.
There’s an optimization procedure (Bayesian updating) that is selecting models (the model of the demonstrator) that can themselves be optimizers, and you could get the wrong one (e.g. the model that simulates an alien civilization that realizes it’s in a simulation and predicts well to be selected by the Bayesian updating but eventually executes a treacherous turn). The algorithm presented precludes this from happening with some probability. We can debate the significance, but it seems to me like it is clearly doing something solution-like with respect to the inner alignment problem.
- TurnTrout 19 Feb 2021 2:40 UTC
  LW: 5 AF: 5
  AF Parent
  civilization that realizes it’s in a civilization
  “in a simulation”, no?
  - Rohin Shah 19 Feb 2021 4:32 UTC
    LW: 4 AF: 4
    AF Parent
    Lol yes fixed
- evhub 19 Feb 2021 5:15 UTC
  LW: 4 AF: 3
  AF Parent
  Hmmm… I think I just misunderstood the setup then. It seemed to me like the Bayesian updating was supposed to represent the model rather than the training process, but under the framing that you just said, I agree that it’s relevant. It’s worth noting that I do think most of the danger lies in the ways in which gradient descent is not a perfect Bayesian, but I agree that modeling gradient descent as Bayesian is certainly relevant.