My gut response is that hillclimbing is itself consequentialist, so this doesn’t really help with fragility of value; if you get the hillclimbing direction slightly wrong, you’ll still end up somewhere very wrong. On the other hand, Paul’s approach rests on something which we could call a deontological approach to the hillclimbing part (IE, amplification steps do not rely on throwing more optimization power at a pre-specified function).
We are doing the hillclimbing, and implementing other object-level strategies does not help. Paul proposes something, we estimate the design’s alignment, he tweaks the design to improve it. That’s the hill-climbing I mean.
My gut response is that hillclimbing is itself consequentialist, so this doesn’t really help with fragility of value; if you get the hillclimbing direction slightly wrong, you’ll still end up somewhere very wrong. On the other hand, Paul’s approach rests on something which we could call a deontological approach to the hillclimbing part (IE, amplification steps do not rely on throwing more optimization power at a pre-specified function).
We are doing the hillclimbing, and implementing other object-level strategies does not help. Paul proposes something, we estimate the design’s alignment, he tweaks the design to improve it. That’s the hill-climbing I mean.