steven0461 comments on Values determined by “stopping” properties

steven0461 15 Mar 2018 15:35 UTC
4 points
For example, if the human values are determined by human feedback, then we can forbid the AI from coercing the human in any way, or restrict it to only using some methods (such as relaxed conversation).
It seems to me the natural way to do this is by looking for coherence among all the possible ways the AI could ask the relevant questions. This seems like something you’d want anyway, so if there’s a meta-level step at the beginning where you consult humans on how to consult their true selves, you’d get it for free. Maybe the meta-level step itself can be hacked through some sort of manipulation, but it seems at least harder (and probably there’d be multiple levels of meta).
- Stuart_Armstrong 16 Mar 2018 3:48 UTC
  3 points
  Parent
  See my next post—human values are contradictory, meta-values especially so.
  - steven0461 16 Mar 2018 15:49 UTC
    3 points
    Parent
    It seems to me that the fact that we’re having conversations like this implies that there’s some meta level where we agree on the “rules of the game”.