A deliberately easy to hack system is physically adjacent that tracks the AI’s reward. Say it has a no password shell and is accessible via IP.
AI becomes too smart, and hacks itself so it now has infinite reward and it has a clock register it can tamper with so it believes infinite time has already passed.
AI is now dead. Since no action it can take beats infinite reward it does nothing more. Sorta like a heroin overdose.
AI is now dead. Since no action it can take beats infinite reward it does nothing more. Sorta like a heroin overdose
Just watch out for an AI that is optimizing for long-term reward. If it wants to protect its infinite reward fountain then the AI would be incentivized to neutralize any possible threats to that situation (e.g. humans).
To be specific to a “toy model”.
AI has a goal: collect stamps/build paperclips.
A deliberately easy to hack system is physically adjacent that tracks the AI’s reward. Say it has a no password shell and is accessible via IP.
AI becomes too smart, and hacks itself so it now has infinite reward and it has a clock register it can tamper with so it believes infinite time has already passed.
AI is now dead. Since no action it can take beats infinite reward it does nothing more. Sorta like a heroin overdose.
Just watch out for an AI that is optimizing for long-term reward. If it wants to protect its infinite reward fountain then the AI would be incentivized to neutralize any possible threats to that situation (e.g. humans).