It seems to me that there are some serious practical problems in trying to train this sort of behaviour. After all, a successful execution shuts the system off and it never updates on the training signal. You could train it for something like “when the date from the clock input exceeds the date on input SDD, output high on output SDN (which in the live system will feed to a shutdown switch)”, but that’s a distant proxy. It seems unlikely to generalize correctly to what you really want, which is much fuzzier.
For example, what you really want is more along the lines of determining the actual date (by unaltered human standards) and comparing with the actual human-desired shutdown date (without manipulating what the humans want), and actually shut down (by means that don’t harm any humans or anything else they value). Except that this messy statement isn’t nearly tight enough, and a superintelligent system would eat the world in a billion possible ways even assuming that the training was done in a way that the system actually tried to meet this objective.
How are we going to train a system to generalize to this sort of objective without it already being Friendly AGI?
To clarify, this is intended to be a test-time objective; I’m assuming the system was trained in simulation and/or by observing the environment. In general, this reward wouldn’t need to be “trained” – it could just be hardcoded into the system. If you’re asking how the system would understand its reward without having experienced it already, I’m assuming that sufficiently-advanced AIs have the ability to “understand” their reward function and optimize on that basis. For example, “create two identical strawberries on the cellular level” can only be plausibly achieved via understanding, rather than encountering the reward often enough in simulation to learn from it, since it’d be so rare even in simulation.
Modern reinforcement learning systems receive a large positive reward (or, more commonly, an end to negative rewards) when ending the episode, and this incentivizes them to end the episode quickly (sometimes suicidally). If you only provide this “shutdown reward”, I’d expect to see the same behavior, but only after a certain time period.
It seems to me that there are some serious practical problems in trying to train this sort of behaviour. After all, a successful execution shuts the system off and it never updates on the training signal. You could train it for something like “when the date from the clock input exceeds the date on input SDD, output high on output SDN (which in the live system will feed to a shutdown switch)”, but that’s a distant proxy. It seems unlikely to generalize correctly to what you really want, which is much fuzzier.
For example, what you really want is more along the lines of determining the actual date (by unaltered human standards) and comparing with the actual human-desired shutdown date (without manipulating what the humans want), and actually shut down (by means that don’t harm any humans or anything else they value). Except that this messy statement isn’t nearly tight enough, and a superintelligent system would eat the world in a billion possible ways even assuming that the training was done in a way that the system actually tried to meet this objective.
How are we going to train a system to generalize to this sort of objective without it already being Friendly AGI?
To clarify, this is intended to be a test-time objective; I’m assuming the system was trained in simulation and/or by observing the environment. In general, this reward wouldn’t need to be “trained” – it could just be hardcoded into the system. If you’re asking how the system would understand its reward without having experienced it already, I’m assuming that sufficiently-advanced AIs have the ability to “understand” their reward function and optimize on that basis. For example, “create two identical strawberries on the cellular level” can only be plausibly achieved via understanding, rather than encountering the reward often enough in simulation to learn from it, since it’d be so rare even in simulation.
Modern reinforcement learning systems receive a large positive reward (or, more commonly, an end to negative rewards) when ending the episode, and this incentivizes them to end the episode quickly (sometimes suicidally). If you only provide this “shutdown reward”, I’d expect to see the same behavior, but only after a certain time period.