That’s not a simple problem.First you have to specify “not killing everyone” robustly (outer alignment) and then you have to train the AI to have this goal and not an approximation of it (inner alignment).
Anyway, the rest of your response is spent talking about the case where AI cares about its perception of the paperclips rather than the paperclips themselves. I’m not sure how severity level 1 would come about, given that the AI should only care about its reward score. Once you admit that the AI cares about worldly things like “am I turned on”, it seems pretty natural that the AI would care about the paperclips themselves rather than its perception of the paperclips. Nevertheless, even in severity level 1, there is still no incentive for the AI to care about future AIs, which contradicts concerns that non-superintelligent AIs would fake alignment during training so that future superintelligent AIs would be unaligned.
See my other comment for the response.
Anyway, the rest of your response is spent talking about the case where AI cares about its perception of the paperclips rather than the paperclips themselves. I’m not sure how severity level 1 would come about, given that the AI should only care about its reward score. Once you admit that the AI cares about worldly things like “am I turned on”, it seems pretty natural that the AI would care about the paperclips themselves rather than its perception of the paperclips. Nevertheless, even in severity level 1, there is still no incentive for the AI to care about future AIs, which contradicts concerns that non-superintelligent AIs would fake alignment during training so that future superintelligent AIs would be unaligned.