Hmm, I somehow never saw this reply, sorry about that.
you get something like Paul’s going out with a whimper where our easy-to-specify values win out over our other values [...] it’s very important that your AGI not be better at optimizing some of your values over others, as that will shift the distribution of value/resources/etc. away from the real human preference distribution that we want.
Why can’t we tell it not to overoptimize the aspects that it understands until it figures out the other aspects?
value-neutrality verification isn’t just about strategy-stealing: it’s also about inner alignment, since it could help you separate optimization processes from objectives in a natural way that makes it easier to verify alignment properties (such as compatibility with strategy-stealing, but also possibly corrigibility) on those objects.
As you (now) know, my main crux is that I don’t expect to be able to cleanly separate optimization and objectives, though I also am unclear whether value-neutral optimization is even a sensible concept taken separately from the environment in which the agent is acting (see this comment).
Hmm, I somehow never saw this reply, sorry about that.
Why can’t we tell it not to overoptimize the aspects that it understands until it figures out the other aspects?
As you (now) know, my main crux is that I don’t expect to be able to cleanly separate optimization and objectives, though I also am unclear whether value-neutral optimization is even a sensible concept taken separately from the environment in which the agent is acting (see this comment).