My impression is that in full generality it is unsolvable, but something like starting with an imprecise model of approval / utility function learned via ambitious value learning and restricting explanations/questions/manipulation by that may be work.
Yep. As so often, I think these things are not fully value agnostic, but don’t need full human values to be defined.
Yep. As so often, I think these things are not fully value agnostic, but don’t need full human values to be defined.