RHollerith comments on G Gordon Worley III’s Shortform

RHollerith 25 Jun 2022 20:21 UTC
0 points
The reason Eliezer’s 2004 “coherent extrapolated volition” (CEV) proposal is immune to Goodharting is probably because being immune to it was probably one of the main criteria for its creation. I.e., Eliezer came up with it through a process of looking for a design immune to Goodharting. It may very well be that all other published proposals for aligning super-intelligent AI are vulnerable to Goodharting.

Goodhart’s law basically says that if we put too much optimization pressure on criterion X, then as a side effect, the optimization process drives criteria Y and Z, which we also care about, higher or lower than we consider reasonable. But that doesn’t apply when criterion X is “everything we value” or “the reflective equilibrium of everything we value”.

The problem of course being that although the CEV plan is probably within human capabilities to implement (and IMHO Scott Garrabrant’s work is probably a step forward) unaligned AI is probably significantly easier to implement, so will likely arrive first.