“maximize the expected amount of money in this bank account until 1 day from now, then maximize the probability of shutting yourself off”… might cause a lot of damage, but it probably wouldn’t end the world in pursuit of these objectives.
This is not actually a limited-horizon agent; you’ve just set a time at which it changes objectives. And wouldn’t ending the world be the most reliable way to ensure pesky humans never turn you back on?...
(unfortunately thinking about constraints you can place on an unaligned agent never leads anywhere useful; alignment is the only workable solution in the long term)
Sorry, to clarify, I’m not saying it should change objectives. If we’re assuming it’s maximizing long-term expected reward, then it will not be rewarded for adding more money to the bank beyond the relevant window. So its optimal behavior is “make as much money as possible right now and then shut myself off”. It could be that “ensuring the ability to shut oneself off” involves killing all humans, but that seems… unlikely? Relative to the various ways one could make more money. It seems like there could be a reasonable parameter choice that would make money almost definitionally more appealing than moving certainty of being able to shutoff from 0.9999 to 0.99999. Especially if we just gave it access to a self-shutdown switch and made it very difficult for us to prevent it from shutting itself down.
Also, the reward isn’t conditioned on “stay turned off”, so there’s no reason to ensure humans never turn you back on.
I think there’s a separate argument to be made that the AI might want to extend the period of time where it can earn a reward other than the shutdown reward, and thus that it’d try to push the “shutdown-reward” time back, but there are ways of making this really, really hard.
For example, if the AI is only “alive” for a day, and we launch a space probe 10 years in advance with a private key, and the probe emits the current timestamp signed with the private key. The AI is hardcoded to only accept timestamps by decrypting a timestamp-candidate using the corresponding public key, and to reject invalid timestamps (e.g. longer than the length before the last sun goes out).
Maybe the one additional wrinkle I should add is that the shutdown reward decreases over time, at first very gradually (to prevent it from going nuts trying to shut itself down immediately) but then much more quickly as time goes on, e.g. a month out from the beginning. Importantly, for runaway scenarios like “capability gain”, the question is not, “could the AI decide to chase the probe down”, but rather, “when it is slightly-above-human-level but not yet god-level, would it plausibly think that becoming god-level would substantially increase its probability of chasing the probe down, when it would cost it a great deal of time that would reduce its shutdown reward?”
I suppose this relies on us actually believing that cryptographic one-way functions are hard enough that they’d take longer than the age of the universe to break. But lattice-based crypto is quantum-resilient, so an AI could plausibly have a prior that it’d be better to shut itself off now than to go for a riskier strategy that would rely on a lot of R&D risk.
But also, couldn’t “keeping your reward function intact” be extended to the timer used within the reward function?
Again, I’m not advocating this as “definitely 100% safe”, but as a heuristic to build into basically all experiments as we near AGI, it seems worthwhile as a decent failsafe.
This is not actually a limited-horizon agent; you’ve just set a time at which it changes objectives. And wouldn’t ending the world be the most reliable way to ensure pesky humans never turn you back on?...
(unfortunately thinking about constraints you can place on an unaligned agent never leads anywhere useful; alignment is the only workable solution in the long term)
Sorry, to clarify, I’m not saying it should change objectives. If we’re assuming it’s maximizing long-term expected reward, then it will not be rewarded for adding more money to the bank beyond the relevant window. So its optimal behavior is “make as much money as possible right now and then shut myself off”. It could be that “ensuring the ability to shut oneself off” involves killing all humans, but that seems… unlikely? Relative to the various ways one could make more money. It seems like there could be a reasonable parameter choice that would make money almost definitionally more appealing than moving certainty of being able to shutoff from 0.9999 to 0.99999. Especially if we just gave it access to a self-shutdown switch and made it very difficult for us to prevent it from shutting itself down.
Also, the reward isn’t conditioned on “stay turned off”, so there’s no reason to ensure humans never turn you back on.
I think there’s a separate argument to be made that the AI might want to extend the period of time where it can earn a reward other than the shutdown reward, and thus that it’d try to push the “shutdown-reward” time back, but there are ways of making this really, really hard.
For example, if the AI is only “alive” for a day, and we launch a space probe 10 years in advance with a private key, and the probe emits the current timestamp signed with the private key. The AI is hardcoded to only accept timestamps by decrypting a timestamp-candidate using the corresponding public key, and to reject invalid timestamps (e.g. longer than the length before the last sun goes out).
Maybe the one additional wrinkle I should add is that the shutdown reward decreases over time, at first very gradually (to prevent it from going nuts trying to shut itself down immediately) but then much more quickly as time goes on, e.g. a month out from the beginning. Importantly, for runaway scenarios like “capability gain”, the question is not, “could the AI decide to chase the probe down”, but rather, “when it is slightly-above-human-level but not yet god-level, would it plausibly think that becoming god-level would substantially increase its probability of chasing the probe down, when it would cost it a great deal of time that would reduce its shutdown reward?”
I suppose this relies on us actually believing that cryptographic one-way functions are hard enough that they’d take longer than the age of the universe to break. But lattice-based crypto is quantum-resilient, so an AI could plausibly have a prior that it’d be better to shut itself off now than to go for a riskier strategy that would rely on a lot of R&D risk.
But also, couldn’t “keeping your reward function intact” be extended to the timer used within the reward function?
Again, I’m not advocating this as “definitely 100% safe”, but as a heuristic to build into basically all experiments as we near AGI, it seems worthwhile as a decent failsafe.