If I imagine each level A[n] as maximizing the expected value of some simple utility function, I agree that it would be surprising if the result was not one of your first three cases. Intuitively, either we already have all of the friendly utility function, and we didn’t need induction, or we didn’t and bad things happen, which corresponds to cases 1 and 3.
But it seems like one of the main points of iterated amplification is that at least the initial levels need not be maximizing the expected value of some simple utility. In that case, there seems to be a much wider space of possible designs.
For example, we could have a system that has the epistemic state of wanting to help humans but knowing that it doesn’t know how best to do that, and so asking humans for feedback and deferring to them when appropriate. Such a system with amplification might eventually learn the friendly utility function and start maximizing that, but it seems like there could be many iterations before that point, during which it is corrigible in the sense of deferring to humans and not maximizing its current conception of what is best.
I don’t have a strong sense at the moment what would happen, but it seems plausible that the induction will go through and will have “actually mattered”.
If I imagine each level A[n] as maximizing the expected value of some simple utility function, I agree that it would be surprising if the result was not one of your first three cases. Intuitively, either we already have all of the friendly utility function, and we didn’t need induction, or we didn’t and bad things happen, which corresponds to cases 1 and 3.
But it seems like one of the main points of iterated amplification is that at least the initial levels need not be maximizing the expected value of some simple utility. In that case, there seems to be a much wider space of possible designs.
For example, we could have a system that has the epistemic state of wanting to help humans but knowing that it doesn’t know how best to do that, and so asking humans for feedback and deferring to them when appropriate. Such a system with amplification might eventually learn the friendly utility function and start maximizing that, but it seems like there could be many iterations before that point, during which it is corrigible in the sense of deferring to humans and not maximizing its current conception of what is best.
I don’t have a strong sense at the moment what would happen, but it seems plausible that the induction will go through and will have “actually mattered”.