we don’t think that shutdown-seeking avoids every possible problem involved with reward misspecification
Seems like this is basically the alignment problem all over again, with the complexity just moved to “what does it mean to ‘shut down’ in the AI’s inner model”.
For example, if the inner-aligned goal is to prevent its own future operation, it might choose to say, start a nuclear war so nobody is around to start it back up, repair it, provide power, etc.
Seems like this is basically the alignment problem all over again, with the complexity just moved to “what does it mean to ‘shut down’ in the AI’s inner model”.
For example, if the inner-aligned goal is to prevent its own future operation, it might choose to say, start a nuclear war so nobody is around to start it back up, repair it, provide power, etc.