I am looking for papers that support or attack the argument that sufficiently intelligent AIs will be easier to safe because their world model will understand that we don’t want it to take our instructions as ruthlessly technical / precise, nor received in bad faith.
My argument that I want either supported or dis-proven is that it would know that we don’t want an outcome that looks good but one that is good by our mental definitions. It will be able to look at human decisions through history and in the present to understand this fuzziness and moderation.
I am looking for papers that support or attack the argument that sufficiently intelligent AIs will be easier to safe because their world model will understand that we don’t want it to take our instructions as ruthlessly technical / precise, nor received in bad faith.
My argument that I want either supported or dis-proven is that it would know that we don’t want an outcome that looks good but one that is good by our mental definitions. It will be able to look at human decisions through history and in the present to understand this fuzziness and moderation.