However, I think I don’t quite understand the distinction between inner alignment and outer alignment, as they’re being used here. In particular, why would the possible malignity of the universal prior be an example of outer alignment rather than inner?
I was thinking of outer alignment as being about whether, if a system achieves its objective, is that what you wanted. Whereas inner alignment was about whether it’s secretly optimizing for something other than the stated objective in the first place.
From that perspective, wouldn’t malignity in the universal prior be a classic example of inner misalignment? You wanted unbiased prediction, and if you got it that would be good (so it’s outer-aligned). But it turns out you got something that looked like a predictor up to a point, and then turned out to be an optimizer (inner misalignment).
Have I misunderstood outer or inner alignment, or what malignity of the universal prior would mean?
The way I’m using outer alignment here is to refer to outer alignment at optimum. Under that definition, optimal loss on a predictive objective should require doing something like Bayesian inference on the universal prior, making the question of outer alignment in such a case basically just the question of whether Bayesian inference on the universal prior is aligned.
I see. So, restating in my own terms—outer alignment is in fact about whether getting what you asked for is good, but for the case of prediction, the malign universal prior argument says that “perfect” prediction is actually malign. So this would be a case of getting what you wanted / asked for / optimized for, but that not being good. So it is an outer alignment failure.
Whereas an inner alignment failure would necessarily involve not hitting optimal performance at your objective. (Otherwise it would be an inner alignment success, and an outer alignment failure.)
Great post, thank you!
However, I think I don’t quite understand the distinction between inner alignment and outer alignment, as they’re being used here. In particular, why would the possible malignity of the universal prior be an example of outer alignment rather than inner?
I was thinking of outer alignment as being about whether, if a system achieves its objective, is that what you wanted. Whereas inner alignment was about whether it’s secretly optimizing for something other than the stated objective in the first place.
From that perspective, wouldn’t malignity in the universal prior be a classic example of inner misalignment? You wanted unbiased prediction, and if you got it that would be good (so it’s outer-aligned). But it turns out you got something that looked like a predictor up to a point, and then turned out to be an optimizer (inner misalignment).
Have I misunderstood outer or inner alignment, or what malignity of the universal prior would mean?
The way I’m using outer alignment here is to refer to outer alignment at optimum. Under that definition, optimal loss on a predictive objective should require doing something like Bayesian inference on the universal prior, making the question of outer alignment in such a case basically just the question of whether Bayesian inference on the universal prior is aligned.
I see. So, restating in my own terms—outer alignment is in fact about whether getting what you asked for is good, but for the case of prediction, the malign universal prior argument says that “perfect” prediction is actually malign. So this would be a case of getting what you wanted / asked for / optimized for, but that not being good. So it is an outer alignment failure.
Whereas an inner alignment failure would necessarily involve not hitting optimal performance at your objective. (Otherwise it would be an inner alignment success, and an outer alignment failure.)
Is that about right?
Yep—at least that’s how I’m generally thinking about it in this post.
Got it, thank you!