Good point. I can see a couple main ways to give an AI slack in its pursuit of optimization:
Set the level of precision on the various features of the world that it optimizes for. Lower precision would then mean the AI puts forth less effort toward optimizing it and puts up less of a fight against other agents (i.e., humans) who try to steer it away from the optimum. The AI might also have a goal-generating module that tries to predict what goals would produce what level of utility. Having lower precision would mean that it would sample goal states from a broader region of latent space and therefore be willing to settle for reaching the sub-optimal goals that result.
Seek Pareto improvements. When the AI is trying to optimize simultaneously for an entire ensemble of human preferences (or heuristics of those preferences) with no strict ordering among them, it would only fight for goals that improve on multiple dimensions simultaneously. World states on the same Pareto frontier would be equivalent from its perspective, and it could then work with humans who want to improve things according to their hidden preferences by just constraining any changes to lie on that same manifold.
Good point. I can see a couple main ways to give an AI slack in its pursuit of optimization:
Set the level of precision on the various features of the world that it optimizes for. Lower precision would then mean the AI puts forth less effort toward optimizing it and puts up less of a fight against other agents (i.e., humans) who try to steer it away from the optimum. The AI might also have a goal-generating module that tries to predict what goals would produce what level of utility. Having lower precision would mean that it would sample goal states from a broader region of latent space and therefore be willing to settle for reaching the sub-optimal goals that result.
Seek Pareto improvements. When the AI is trying to optimize simultaneously for an entire ensemble of human preferences (or heuristics of those preferences) with no strict ordering among them, it would only fight for goals that improve on multiple dimensions simultaneously. World states on the same Pareto frontier would be equivalent from its perspective, and it could then work with humans who want to improve things according to their hidden preferences by just constraining any changes to lie on that same manifold.