While I share your position that this mostly isn’t addressing the things that make inner alignment hard / risky in practice, I agree with Vanessa that this does not assume the inner alignment problem away, unless you have a particularly contorted definition of “inner alignment”.
There’s an optimization procedure (Bayesian updating) that is selecting models (the model of the demonstrator) that can themselves be optimizers, and you could get the wrong one (e.g. the model that simulates an alien civilization that realizes it’s in a simulation and predicts well to be selected by the Bayesian updating but eventually executes a treacherous turn). The algorithm presented precludes this from happening with some probability. We can debate the significance, but it seems to me like it is clearly doing something solution-like with respect to the inner alignment problem.
Hmmm… I think I just misunderstood the setup then. It seemed to me like the Bayesian updating was supposed to represent the model rather than the training process, but under the framing that you just said, I agree that it’s relevant. It’s worth noting that I do think most of the danger lies in the ways in which gradient descent is not a perfect Bayesian, but I agree that modeling gradient descent as Bayesian is certainly relevant.
While I share your position that this mostly isn’t addressing the things that make inner alignment hard / risky in practice, I agree with Vanessa that this does not assume the inner alignment problem away, unless you have a particularly contorted definition of “inner alignment”.
There’s an optimization procedure (Bayesian updating) that is selecting models (the model of the demonstrator) that can themselves be optimizers, and you could get the wrong one (e.g. the model that simulates an alien civilization that realizes it’s in a simulation and predicts well to be selected by the Bayesian updating but eventually executes a treacherous turn). The algorithm presented precludes this from happening with some probability. We can debate the significance, but it seems to me like it is clearly doing something solution-like with respect to the inner alignment problem.
“in a simulation”, no?
Lol yes fixed
Hmmm… I think I just misunderstood the setup then. It seemed to me like the Bayesian updating was supposed to represent the model rather than the training process, but under the framing that you just said, I agree that it’s relevant. It’s worth noting that I do think most of the danger lies in the ways in which gradient descent is not a perfect Bayesian, but I agree that modeling gradient descent as Bayesian is certainly relevant.