For example, if an ASI is told to ”make as many paperclips as possible”, when is it maximising for? The next second? The next year? Indefinitely?
If a paperclip maximiser only cared about making as many paperclips as possible over the next hour say, and every hour this goal restarts, maybe it would never be optimal to spend the time to do things such as disempower humanity because it only ever cares about the next hour and disempowering humanity would take too long.
Would a paperclip maximiser rather make 1 thousand paperclips today, or disempower humanity, takeover, and make 1 billion paperclips tomorrow?
Is there perhaps some way in which an ASI could be given something to maximise for a set point in the future, and that time is gradually increased so that it might be easier to spot when it is going towards undesirable actions.
For example, if a paperclip maximiser is told to “make as many paperclips as possible in the next hour”, it might just use the tools it has available, without bothering with extreme actions like human extinction, because that would take too long. We could gradually increase the time, even by the second if necessary. If, in this hypothetical, 10 hours is the point at which human disempowerment, extinction, etc is optimal, perhaps 9.5 hours is the point at which bad, but less bad than extinction actions are optimal. This might mean that we have a kind of warning shot.
There are problems I see with this. Just because it wasn’t optimal to wipe out humanity when maximising for the next 5 hours one day, doesn’t mean it is necessarily not optimal when maximising for the next 5 hours some other day. Also, it might be that there is a point at which what is optimal goes from completely safe to terrible simply by adding another minute to the time limit, with very little or no shades of grey in between.
I think the assumption is that unless it’s specified, then without limit. Like if you said “make me as much money as you can”—you probably don’t want it stopping any time soon. The same would apply to the colour of the paperclips—seeing as you didn’t say they should be red, you shouldn’t assume they will be.
The issue with maximisers is precisely that they maximise. They were introduced to illustrate the problems with just trying to get a number as high as possible. At some point they’ll sacrifice something else of value, just to get a tiny advantage. You could try to provide a perfect utility function, which always gives exactly correct weights for every possible action, but then you’d have solved alignment.
Does “make as many paperclips as possible in the next hour” mean “undergo such actions that in one hours time will result in as many paperclips as possible” or “for the next hour do whatever will result in the most paperclips overall, including in the far future”?
When I say “make as many paperclips as possible in the next hour” I basically mean “undergo such actions that in one hours time will result in as many paperclips as possible” so if you tell the AI to do this at 12:00 it only cares about how many paperclips it has made when the time hits 13:00 and does not care at all about a time past 13:00.
If you make a paperclip maximiser and you don’t specify any time limit or anything, how much does it care about WHEN the paperclips are made. I assume it would rather have 20 now than 20 in a months time, but would it rather have 20 now or 40 in a months time?
Would a paperclip maximiser first workout the absolute best way to maximise paperclips, before actually making any?
If this is the case or if the paperclip maximiser cares more about the amount of paperclips in the far future compared to now, perhaps it would spend a few millennia studying the deepest secrets of reality and then through some sort of ridiculously advanced means turn all matter in the universe into paperclips instantaneously. And perhaps this would end up with a higher amount of paperclips faster than spending those millennia actually making them.
As a side note: Would a paperclip maximiser eventually (presumably after using up all other possible atoms) self-destruct as it to is made up of atoms that could be used for paperclips?
By the way, I have very little technical knowledge so most of the things I say are far more thought-experiments or philosophy based on limited knowledge. There may be very basic reasons I am unaware of that many parts of my thought process make no sense.
The answer to all of these is a combination of “dunno” and “it depends”, in that implementation details would be critical.
In general, you shouldn’t read too much into the paperclip maximiser, or rather shouldn’t go too deep into its specifics. Mainly because it doesn’t exist. It’s fun to think about, but always remember that each additional detail makes the overall scenario less likely.
I was unclear about why I asked for clarification of “make as many paperclips as possible in the next hour”. My point there was that you should assume that whatever is not specified should be interpreted in whatever way is most likely to blow up in your face.
When do maximisers maximise for?
For example, if an ASI is told to ”make as many paperclips as possible”, when is it maximising for? The next second? The next year? Indefinitely?
If a paperclip maximiser only cared about making as many paperclips as possible over the next hour say, and every hour this goal restarts, maybe it would never be optimal to spend the time to do things such as disempower humanity because it only ever cares about the next hour and disempowering humanity would take too long.
Would a paperclip maximiser rather make 1 thousand paperclips today, or disempower humanity, takeover, and make 1 billion paperclips tomorrow?
Is there perhaps some way in which an ASI could be given something to maximise for a set point in the future, and that time is gradually increased so that it might be easier to spot when it is going towards undesirable actions.
For example, if a paperclip maximiser is told to “make as many paperclips as possible in the next hour”, it might just use the tools it has available, without bothering with extreme actions like human extinction, because that would take too long. We could gradually increase the time, even by the second if necessary. If, in this hypothetical, 10 hours is the point at which human disempowerment, extinction, etc is optimal, perhaps 9.5 hours is the point at which bad, but less bad than extinction actions are optimal. This might mean that we have a kind of warning shot.
There are problems I see with this. Just because it wasn’t optimal to wipe out humanity when maximising for the next 5 hours one day, doesn’t mean it is necessarily not optimal when maximising for the next 5 hours some other day. Also, it might be that there is a point at which what is optimal goes from completely safe to terrible simply by adding another minute to the time limit, with very little or no shades of grey in between.
Good questions.
I think the assumption is that unless it’s specified, then without limit. Like if you said “make me as much money as you can”—you probably don’t want it stopping any time soon. The same would apply to the colour of the paperclips—seeing as you didn’t say they should be red, you shouldn’t assume they will be.
The issue with maximisers is precisely that they maximise. They were introduced to illustrate the problems with just trying to get a number as high as possible. At some point they’ll sacrifice something else of value, just to get a tiny advantage. You could try to provide a perfect utility function, which always gives exactly correct weights for every possible action, but then you’d have solved alignment.
Does “make as many paperclips as possible in the next hour” mean “undergo such actions that in one hours time will result in as many paperclips as possible” or “for the next hour do whatever will result in the most paperclips overall, including in the far future”?
When I say “make as many paperclips as possible in the next hour” I basically mean “undergo such actions that in one hours time will result in as many paperclips as possible” so if you tell the AI to do this at 12:00 it only cares about how many paperclips it has made when the time hits 13:00 and does not care at all about a time past 13:00.
If you make a paperclip maximiser and you don’t specify any time limit or anything, how much does it care about WHEN the paperclips are made. I assume it would rather have 20 now than 20 in a months time, but would it rather have 20 now or 40 in a months time?
Would a paperclip maximiser first workout the absolute best way to maximise paperclips, before actually making any?
If this is the case or if the paperclip maximiser cares more about the amount of paperclips in the far future compared to now, perhaps it would spend a few millennia studying the deepest secrets of reality and then through some sort of ridiculously advanced means turn all matter in the universe into paperclips instantaneously. And perhaps this would end up with a higher amount of paperclips faster than spending those millennia actually making them.
As a side note: Would a paperclip maximiser eventually (presumably after using up all other possible atoms) self-destruct as it to is made up of atoms that could be used for paperclips?
By the way, I have very little technical knowledge so most of the things I say are far more thought-experiments or philosophy based on limited knowledge. There may be very basic reasons I am unaware of that many parts of my thought process make no sense.
The answer to all of these is a combination of “dunno” and “it depends”, in that implementation details would be critical.
In general, you shouldn’t read too much into the paperclip maximiser, or rather shouldn’t go too deep into its specifics. Mainly because it doesn’t exist. It’s fun to think about, but always remember that each additional detail makes the overall scenario less likely.
I was unclear about why I asked for clarification of “make as many paperclips as possible in the next hour”. My point there was that you should assume that whatever is not specified should be interpreted in whatever way is most likely to blow up in your face.