Then one day the AI says “Okay now, today my goalX code says I should do this…” and it describes an action that is VIOLENTLY inconsistent with the previously described class of results, X. This action violates every one of the features of the class that were previously given.
The onlookers are astonished. They ask the AI if it UNDERSTANDS that this new action will be in violent conflict with all of those features of class X, and it replies that it surely does. But it adds that it is going to do that anyway.
I (notice that I) am confused by this comment. This seems obviously impossible, yes; so obviously impossible, in fact, that only one example springs to mind (surely the AI will be smart enough to realize it’s programmed goals are wrong!)
In particular, this really doesn’t seem to apply to the example of the “Dopamine Drip scenario” plan, which, if I’m reading you correctly, it was intended to.
What am I missing here? I know there must be something.
[ And by the way: one important feature that is OBVIOUSLY going to be in the goalX code is this: that the outcome of any actions that the goalX code prescribes, should always be checked to see if they are as consistent as possible with the verbal description of the class of results X, and if any inconsistency occurs the goalX code should be deemed defective, and be shut down for adjustment.]
So … you come up with the optimal plan, and then check with puny humans to see if that’s what they would have decided anyway? And if they say “no, that’s a terrible idea” then you assume they knew better than you? Why would anyone even bother building such a superintelligent AI? Isn’t the whole point of creating a superintelligence that it can understand things we can’t, and come up with plans we would never conceive of, or take centuries to develop?
I’m afraid you have lost me: when you say “This seems obviously impossible...” I am not clear which aspect strikes you as obviously impossible.
Before you answer that, though: remember that I am describing someone ELSE’S suggestion about how the AI will behave ….. I am not advocating this as a believable scenario! In fact I am describing that other person’s suggestion in such a way that the impossibility is made transparent. So I, too, believe that this hypothetical AI is fraught with contradictions.
The Dopamine Drip scenario is that the AI knows that it has a set of goals designed to achieve a certain set of results, and since it has an extreme level of intelligence it is capable of understanding that very often a “target set of results” can be described, but not enumerated as a closed set. It knows that very often in its behavior it (or someone else) will design some goal code that is supposed to achieve that “target set of results”, but because of the limitations of goal code writing, the goal code can malfunction. The Dopamine Drip scenario is only one example of how a discrepancy can arise—in that case, the “target set of results” is the promotion of human happiness, and then the rest of the scenario follows straightforwardly. Nobody I have talked to so far misunderstands what the DD scenario implies, and how it fits that pattern. So could you clarify how you think it does not?
I’m afraid you have lost me: when you say “This seems obviously impossible...” I am not clear which aspect strikes you as obviously impossible.
AI: Yes, this is in complete contradiction of my programmed goals. Ha ha, I’m gonna do it anyway.
Before you answer that, though: remember that I am describing someone ELSE’S suggestion about how the AI will behave ….. I am not advocating this as a believable scenario! In fact I am describing that other person’s suggestion in such a way that the impossibility is made transparent. So I, too, believe that this hypothetical AI is fraught with contradictions.
Of course, yeah. I’m basically accusing you of failure to steelman/misinterpreting someone; I, for one, have never heard this suggested (beyond the one example I gave, which I don’t think is what you had in mind.)
The Dopamine Drip scenario is that the AI knows that it has a set of goals designed to achieve a certain set of results, and since it has an extreme level of intelligence it is capable of understanding that very often a “target set of results” can be described, but not enumerated as a closed set.
uhuh. So, any AI smart enough to understand it’s creators, right?
It knows that very often in its behavior it (or someone else) will design some goal code that is supposed to achieve that “target set of results”, but because of the limitations of goal code writing, the goal code can malfunction.
waaait I think I know where this is going. Are you saying an AI would somehow want to do what it’s programmers intended rather than what they actually programmed it to do?
The Dopamine Drip scenario is only one example of how a discrepancy can arise—in that case, the “target set of results” is the promotion of human happiness, and then the rest of the scenario follows straightforwardly. Nobody I have talked to so far misunderstands what the DD scenario implies, and how it fits that pattern. So could you clarify how you think it does not?
Yeah, sorry, I can see how programmers might accidentally write code that creates dopamine world and not eutopia. I just don’t see how this is supposed to connect to the idea of an AI spontaneously violating it’s programmed goals. In this case, surely that would look like “hey guys, you know your programming said to maximise happiness? You guys should be more careful, that actually means “drug everybody”. Anyway, I’m off to torture some people.”
Yeah, I can think of two general ways to interpret this:
In a variant of CEV, the AI uses our utterances as evidence for what we would have told it if we thought more quickly etc. No single utterance carries much risk because the AI will collect lots of evidence and this will likely correct any misleading effects.
Having successfully translated the quoted instruction into formal code, we add another possible point of failure.
I (notice that I) am confused by this comment. This seems obviously impossible, yes; so obviously impossible, in fact, that only one example springs to mind (surely the AI will be smart enough to realize it’s programmed goals are wrong!)
In particular, this really doesn’t seem to apply to the example of the “Dopamine Drip scenario” plan, which, if I’m reading you correctly, it was intended to.
What am I missing here? I know there must be something.
So … you come up with the optimal plan, and then check with puny humans to see if that’s what they would have decided anyway? And if they say “no, that’s a terrible idea” then you assume they knew better than you? Why would anyone even bother building such a superintelligent AI? Isn’t the whole point of creating a superintelligence that it can understand things we can’t, and come up with plans we would never conceive of, or take centuries to develop?
I’m afraid you have lost me: when you say “This seems obviously impossible...” I am not clear which aspect strikes you as obviously impossible.
Before you answer that, though: remember that I am describing someone ELSE’S suggestion about how the AI will behave ….. I am not advocating this as a believable scenario! In fact I am describing that other person’s suggestion in such a way that the impossibility is made transparent. So I, too, believe that this hypothetical AI is fraught with contradictions.
The Dopamine Drip scenario is that the AI knows that it has a set of goals designed to achieve a certain set of results, and since it has an extreme level of intelligence it is capable of understanding that very often a “target set of results” can be described, but not enumerated as a closed set. It knows that very often in its behavior it (or someone else) will design some goal code that is supposed to achieve that “target set of results”, but because of the limitations of goal code writing, the goal code can malfunction. The Dopamine Drip scenario is only one example of how a discrepancy can arise—in that case, the “target set of results” is the promotion of human happiness, and then the rest of the scenario follows straightforwardly. Nobody I have talked to so far misunderstands what the DD scenario implies, and how it fits that pattern. So could you clarify how you think it does not?
AI: Yes, this is in complete contradiction of my programmed goals. Ha ha, I’m gonna do it anyway.
Of course, yeah. I’m basically accusing you of failure to steelman/misinterpreting someone; I, for one, have never heard this suggested (beyond the one example I gave, which I don’t think is what you had in mind.)
uhuh. So, any AI smart enough to understand it’s creators, right?
waaait I think I know where this is going. Are you saying an AI would somehow want to do what it’s programmers intended rather than what they actually programmed it to do?
Yeah, sorry, I can see how programmers might accidentally write code that creates dopamine world and not eutopia. I just don’t see how this is supposed to connect to the idea of an AI spontaneously violating it’s programmed goals. In this case, surely that would look like “hey guys, you know your programming said to maximise happiness? You guys should be more careful, that actually means “drug everybody”. Anyway, I’m off to torture some people.”
Yeah, I can think of two general ways to interpret this:
In a variant of CEV, the AI uses our utterances as evidence for what we would have told it if we thought more quickly etc. No single utterance carries much risk because the AI will collect lots of evidence and this will likely correct any misleading effects.
Having successfully translated the quoted instruction into formal code, we add another possible point of failure.