For example, if the human values are determined by human feedback, then we can forbid the AI from coercing the human in any way, or restrict it to only using some methods (such as relaxed conversation).
It seems to me the natural way to do this is by looking for coherence among all the possible ways the AI could ask the relevant questions. This seems like something you’d want anyway, so if there’s a meta-level step at the beginning where you consult humans on how to consult their true selves, you’d get it for free. Maybe the meta-level step itself can be hacked through some sort of manipulation, but it seems at least harder (and probably there’d be multiple levels of meta).
It seems to me the natural way to do this is by looking for coherence among all the possible ways the AI could ask the relevant questions. This seems like something you’d want anyway, so if there’s a meta-level step at the beginning where you consult humans on how to consult their true selves, you’d get it for free. Maybe the meta-level step itself can be hacked through some sort of manipulation, but it seems at least harder (and probably there’d be multiple levels of meta).
See my next post—human values are contradictory, meta-values especially so.
It seems to me that the fact that we’re having conversations like this implies that there’s some meta level where we agree on the “rules of the game”.