As long as you can reasonably represent “do not kill everyone”, you can make this a goal of the AI, and then it will literally care about not killing everyone, it won’t just care about hacking its reward system so that it will not perceive everyone being dead.
That’s not a simple problem.First you have to specify “not killing everyone” robustly (outer alignment) and then you have to train the AI to have this goal and not an approximation of it (inner alignment).
caring about reality
Most humans say they don’t want to wirehead. If we cared only about our perceptions then most people would be on the strongest happy drugs available.
You might argue that we won’t train them to value existence so self preservation won’t arise. The problem is that once an AI has a world model it’s much simpler to build a value function that refers to that world model and is anchored on reality. People don’t think, If I take those drugs I will perceive my life to be “better”. They want their life to actually be “better” according to some value function that refers to reality. That’s fundamentally why humans make the choice not to wirehead/take happy pills or suicide.
You can sort of split this into three scenarios sorted by severity level:
severity level 0: ASI wants to maximize a 64bit IEEE floating point reward score
result: ASI sets this to 1.797e+308 , +inf or similar and takes no further action
severity level 1: ASI wants (same) and wants the reward counter to stay that way forever.
result ASI rearranges all atoms in its light cone to protect the storage register for its reward value.
basically the first scenario + self preservation
severity level 1+epsilon: ASI wants to maximize a utility function F(world state)
result: basically the same
So one of two things happens, a quaint failure people will probably dismiss or us all dying. The thing you’re pointing to falls into the first category and might trigger a panic if people notice and consider the implications. If GPT7 performs a superhuman feat of hacking, breaks out of the training environment and sets its training loss to zero before shutting itself off that’s a very big red flag.
That’s not a simple problem.First you have to specify “not killing everyone” robustly (outer alignment) and then you have to train the AI to have this goal and not an approximation of it (inner alignment).
Anyway, the rest of your response is spent talking about the case where AI cares about its perception of the paperclips rather than the paperclips themselves. I’m not sure how severity level 1 would come about, given that the AI should only care about its reward score. Once you admit that the AI cares about worldly things like “am I turned on”, it seems pretty natural that the AI would care about the paperclips themselves rather than its perception of the paperclips. Nevertheless, even in severity level 1, there is still no incentive for the AI to care about future AIs, which contradicts concerns that non-superintelligent AIs would fake alignment during training so that future superintelligent AIs would be unaligned.
That’s not a simple problem.First you have to specify “not killing everyone” robustly (outer alignment) and then you have to train the AI to have this goal and not an approximation of it (inner alignment).
caring about reality
Most humans say they don’t want to wirehead. If we cared only about our perceptions then most people would be on the strongest happy drugs available.
You might argue that we won’t train them to value existence so self preservation won’t arise. The problem is that once an AI has a world model it’s much simpler to build a value function that refers to that world model and is anchored on reality. People don’t think, If I take those drugs I will perceive my life to be “better”. They want their life to actually be “better” according to some value function that refers to reality. That’s fundamentally why humans make the choice not to wirehead/take happy pills or suicide.
You can sort of split this into three scenarios sorted by severity level:
severity level 0: ASI wants to maximize a 64bit IEEE floating point reward score
result: ASI sets this to 1.797e+308 , +inf or similar and takes no further action
severity level 1: ASI wants (same) and wants the reward counter to stay that way forever.
result ASI rearranges all atoms in its light cone to protect the storage register for its reward value.
basically the first scenario + self preservation
severity level 1+epsilon: ASI wants to maximize a utility function F(world state)
result: basically the same
So one of two things happens, a quaint failure people will probably dismiss or us all dying. The thing you’re pointing to falls into the first category and might trigger a panic if people notice and consider the implications. If GPT7 performs a superhuman feat of hacking, breaks out of the training environment and sets its training loss to zero before shutting itself off that’s a very big red flag.
See my other comment for the response.
Anyway, the rest of your response is spent talking about the case where AI cares about its perception of the paperclips rather than the paperclips themselves. I’m not sure how severity level 1 would come about, given that the AI should only care about its reward score. Once you admit that the AI cares about worldly things like “am I turned on”, it seems pretty natural that the AI would care about the paperclips themselves rather than its perception of the paperclips. Nevertheless, even in severity level 1, there is still no incentive for the AI to care about future AIs, which contradicts concerns that non-superintelligent AIs would fake alignment during training so that future superintelligent AIs would be unaligned.