Reasoning about utility functions, ie restricting deontological to consequentalist mindspace, seems a misstep, because slightly changing utility functions tends to change alignment a lot, and slightly changing deontological injunctions might not, making it easier for us to hillclimb mindspace.
Perhaps we should have some mathematical discussion of utilityfunction-space, mindspace, its consequentialist subspace, the injection turing-machines → mindspace, the function mindspace → alignment, how well that function can be optimized, properties that make for good lemmata about the previous such as continuity, mindspace modulo equal utility functions, etc.
My gut response is that hillclimbing is itself consequentialist, so this doesn’t really help with fragility of value; if you get the hillclimbing direction slightly wrong, you’ll still end up somewhere very wrong. On the other hand, Paul’s approach rests on something which we could call a deontological approach to the hillclimbing part (IE, amplification steps do not rely on throwing more optimization power at a pre-specified function).
We are doing the hillclimbing, and implementing other object-level strategies does not help. Paul proposes something, we estimate the design’s alignment, he tweaks the design to improve it. That’s the hill-climbing I mean.
Reasoning about utility functions, ie restricting deontological to consequentalist mindspace, seems a misstep, because slightly changing utility functions tends to change alignment a lot, and slightly changing deontological injunctions might not, making it easier for us to hillclimb mindspace.
Perhaps we should have some mathematical discussion of utilityfunction-space, mindspace, its consequentialist subspace, the injection turing-machines → mindspace, the function mindspace → alignment, how well that function can be optimized, properties that make for good lemmata about the previous such as continuity, mindspace modulo equal utility functions, etc.
Aaand I’ve started it. What shape has mindspace?
My gut response is that hillclimbing is itself consequentialist, so this doesn’t really help with fragility of value; if you get the hillclimbing direction slightly wrong, you’ll still end up somewhere very wrong. On the other hand, Paul’s approach rests on something which we could call a deontological approach to the hillclimbing part (IE, amplification steps do not rely on throwing more optimization power at a pre-specified function).
We are doing the hillclimbing, and implementing other object-level strategies does not help. Paul proposes something, we estimate the design’s alignment, he tweaks the design to improve it. That’s the hill-climbing I mean.