(This is my first post, sorry if this is covered elsewhere.)
Implicit in the problem of a superhuman AI’s reward being misspecified and turning everything into paperclips is the fact that the agent is optimizing over e.g. “number of paperclips” without any particular time-bound, area-of-effect bound, or probability confidence bound. For example, imagine if a MuZero+++++ agent were given the reward function “maximize the expected amount of money in this bank account until 1 day from now, then maximize the probability of shutting yourself off”, where e.g. “1 day from now” was set by a consensus of satellites and/or deep-space probes. The agent could do a lot of bad things via its misspecified reward, but at least the blast-radius seems like it’d probably be limited. This agent might cause a lot of damage, but it probably wouldn’t end the world in pursuit of these objectives.
This is not a strategy for executing a Yudkowskian “pivotal act”, which by necessity would need to have longer-lasting effects. So it wouldn’t stop some other group from creating unfriendly AGI. But as a general AI safety heuristic to try and spread across AGI practitioners, “do this as much as possible for a fixed time and then shutoff” seems like a promising social intervention. We also know how to implement it today.
As far as specific risks from the pursuit of convergent instrumental goals: “capability gain” and “make sure you don’t get shutoff” seem much less likely to be convergent instrumental goals when you are planning over a short fixed horizon. “Create hidden copies of yourself” is still possibly a deployable strategy, but there’s no reason for those copies to pursue a reward beyond the time-bound described, so I’d hold out hope for us to find a patch. “Deception” is again possible in the short term, but given this reward function there’s no clear reason to deceive beyond a fixed horizon.
More broadly, this is a result of my thinking about AI safety social heuristics/memes that could be spreadable/enforceable by centralized power structures (e.g. governments, companies, militaries). If others have thoughts about similar heuristics, I’d be very interested to hear them.
I’m assuming I’m not the first person to bring this up, so I’m wondering whether someone can point me to existing discussion on this sort of fixed-window reward. If it is novel in any sense, feedback extremely welcome. This is my first contribution to this community, so please be gentle but also direct.
[Question] Are limited-horizon agents a good heuristic for the off-switch problem?
(This is my first post, sorry if this is covered elsewhere.)
Implicit in the problem of a superhuman AI’s reward being misspecified and turning everything into paperclips is the fact that the agent is optimizing over e.g. “number of paperclips” without any particular time-bound, area-of-effect bound, or probability confidence bound. For example, imagine if a MuZero+++++ agent were given the reward function “maximize the expected amount of money in this bank account until 1 day from now, then maximize the probability of shutting yourself off”, where e.g. “1 day from now” was set by a consensus of satellites and/or deep-space probes. The agent could do a lot of bad things via its misspecified reward, but at least the blast-radius seems like it’d probably be limited. This agent might cause a lot of damage, but it probably wouldn’t end the world in pursuit of these objectives.
This is not a strategy for executing a Yudkowskian “pivotal act”, which by necessity would need to have longer-lasting effects. So it wouldn’t stop some other group from creating unfriendly AGI. But as a general AI safety heuristic to try and spread across AGI practitioners, “do this as much as possible for a fixed time and then shutoff” seems like a promising social intervention. We also know how to implement it today.
As far as specific risks from the pursuit of convergent instrumental goals: “capability gain” and “make sure you don’t get shutoff” seem much less likely to be convergent instrumental goals when you are planning over a short fixed horizon. “Create hidden copies of yourself” is still possibly a deployable strategy, but there’s no reason for those copies to pursue a reward beyond the time-bound described, so I’d hold out hope for us to find a patch. “Deception” is again possible in the short term, but given this reward function there’s no clear reason to deceive beyond a fixed horizon.
More broadly, this is a result of my thinking about AI safety social heuristics/memes that could be spreadable/enforceable by centralized power structures (e.g. governments, companies, militaries). If others have thoughts about similar heuristics, I’d be very interested to hear them.
I’m assuming I’m not the first person to bring this up, so I’m wondering whether someone can point me to existing discussion on this sort of fixed-window reward. If it is novel in any sense, feedback extremely welcome. This is my first contribution to this community, so please be gentle but also direct.