There’s a distinction worth mentioning between the fragility of human value in concept space, and the fragility induced by a hard maximizer running after its proxy as fast as possible.
Like, we could have a distance metric whereby human value is discontinuously sensitive to nudges in concept space, while still being OK practically (if we figure out eg mild optimization). Likewise, if we have a really hard maximizer pursuing a mostly-robust proxy of human values, and human value is pretty robust itself, bad things might still happen due to implementation errors (the AI is incorrigibly trying to accrue human value for itself, instead of helping us do it).
There’s a distinction worth mentioning between the fragility of human value in concept space, and the fragility induced by a hard maximizer running after its proxy as fast as possible.
Like, we could have a distance metric whereby human value is discontinuously sensitive to nudges in concept space, while still being OK practically (if we figure out eg mild optimization). Likewise, if we have a really hard maximizer pursuing a mostly-robust proxy of human values, and human value is pretty robust itself, bad things might still happen due to implementation errors (the AI is incorrigibly trying to accrue human value for itself, instead of helping us do it).