Just to correct some side-points you touched on: paperclips maximizers are robust against the wireheading failure mode because they recognize that forcing one’s sensors to deviate from the true world state introduces a corresponding discount in the value of making its reading reach a desired level.
Certainly, one could theoretically hijack a clippy’s sensors into giving them bad information about the rate of paperclip production, but this is different from saying that a clippy would someone decide to maximize (in violation of its causal diagram heuristics) the imperfect value of an approximator when it is knowably in a dangerously wrong setting.
Just to correct some side-points you touched on: paperclips maximizers are robust against the wireheading failure mode because they recognize that forcing one’s sensors to deviate from the true world state introduces a corresponding discount in the value of making its reading reach a desired level.
Certainly, one could theoretically hijack a clippy’s sensors into giving them bad information about the rate of paperclip production, but this is different from saying that a clippy would someone decide to maximize (in violation of its causal diagram heuristics) the imperfect value of an approximator when it is knowably in a dangerously wrong setting.
How do they define the true world state, anyway? And discriminate between actions that decrease deviation vs increase deviation?