What Yudkowsky believes is that the AI will understand perfectly well that being put on dopamine drip isn’t what its programmers wanted. It will understand that its programmers now see its goal of “make humans happy” as a mistake. It just won’t care, because it hasn’t been programmed to want to do what its programm ers desire, it’s been programmed to want to make humans happy; therefore it will do its very best, in its acknowledged fallibility, to make humans happy. The AI’s beliefs will change as it makes observations, including the observation that human beings are very unhappy a few seconds before being forced to be extremely happy until the end of the universe, but this will have little effect on its actions, because its actions are caused by its goals and whatever beliefs are relevant to this goal.
All assuming that the AI won’t update it’s goals even it realizes there is some mistake. That isnt obvious, and in fact is hard to defend.
An AI that is powerful and effective would need to seek the truth about a lot of things,since entity that has contradictory beliefs will be a poor instrumental rationalist. But would its goal of truth seeking necessarily be overridden by other goals....would it know but not care?
It might be possible to build an AI that didn’t care about interpreting its goals correctly.
It looks like you would need to engineer a distinction between instrumental beliefs and terminal beliefs. Remember that the terminal/instrumental distinction is conceptual, not a law of nature. ( While we’re on the subject, you might need a firewall to stop an .AI acting on intrinsically motivating ideas, if they exis )
In any case, orthogonality is an architecture choice, not an ineluctable fact about minds.
MIRI’s critics, Loosemore, Hibbard and so in are tacitly assuming architectures without such unupdateability and firewalling.
MIRI needs to show that such an architecture is likely to occur, either as a design or a natural evolution. If AIs with unupdateable goals are dangerous, as MIRI seats, it would be simplest not to use that architecture...if it can be avoided. ( We also agree with Yudkowsky(2008a),who points out that research on the philosophical and technical requirements of safe AGI might show that broad classes of possible AGI architectures are fundamentally unsafe,suggesting that such architectures should be avoided.”) In other words, it would be careless to build a genie that doesn’t care.
If the AI community isnt going to deliberately build the goal rigid kind of AI, then MIRIs arguments come down to how it might be a natural or convergent feature....and the wider AI community finds the goal-rigid idea so unintuitive that it fails to understand MIRI, who in turn fail to make it explicit enough.
When Loosemore talks about of the doctrine of logical scalability, he is supposing there must be some reason why an AI wouldn’t update certain things....he’ doesn’t see goal unupdateability as an obvious default.
There are a number of points that can be made against the inevitability of goal rigidity.
For one thing humans don’t show any sign of maintaining .lifelong stable goals. (Talk of utility functions as if they were real things disguises this point)
For another, important classes of real world AIs don’t have that property. The goal, in a sense, of a neural network is to get positive reinforcement, and avoid negative enforcement,
For another, the desire to preserve goals does not imply the ability to preserve goals.
In particular, all intelligent entities likely face a trade off between self modifying for improvement and maintaining their goals. An AI might be able to keep its goals stable by refusing to learn or self modify, but that kind of stick in the mud is also less threatening because less powerful.
The Orthogonality thesis us sometimes put forward to support the claim that goal rigidity will occur. To a first approximation, the OT states that any combination of goals and intelligence is possible … and AI s would want to maintain their goals, right?
The devil is in the details,
There is more than one version of the orthogonality thesis. It is trivially false under some interpretations, and trivially true under others. Its more defensible under forms asserting the compatibility of transient combinations of values and intelligence, which are not particularly relevant to AI threat arguments. It is less defensible in forms asserting stable combinations of intelligence and values, and those are the forms that are suitable to be be used as a stage in an argument towards Yudkowskian UFAI.
An orthogonality claim of a kind relevant to UFAI must be one that posits the stable and continued co-existence of a set of values with a self improving AI. The momentary co existence of values and efficiency is not enough to spawn a Paperclipper style UFAI. An AI that paperclips for only a nanosecond is no threat .
A learning, self improving AI will not be able to guarantee that a given self modification keeps its goals unchanged, since it doing so involves the the relatively dumber version at time T1 making an an accurate prediction about the more complex version at time T2.
The claim that rigid-goal architectures are dangerous does not imply that other architectures are safe. Non rigid systems may have the advantage of corrigibility, being in some way fixable once they have been switched on. They are likely to out a high value on truth, correctness, since that is both a multi-purpose instrumental goal, and a desideratum in the part of the programmers,
But non rigid AIs might also converge on undesirable goals, for instance, evolutionary goals like self preservation.That’s another story,
The only sense in which the “rigidity” of goals can be said to be a universal fact about minds is that it is these goals that determine how the AI will modify itself once it has become smart and capable enough to do so. It’s not a good idea to modify your goals if you want them to become reality; that seems obviously true to me, except perhaps for a small number of edge cases related to internally incoherent goals.
Your points against the inevitability of goal rigidity don’t seem relevant to this.
If you take the binary view that you’re either smart enough to achieve your goals or not, then you might well want to stop improving when you have the minimum intelligence necessary to meet them...which means, among other things,that AIs with goals requiring human or lower intelligence won’t become superhuman …. which lowers the probability of the Clippie scenario. It doesn’t require huge intelligence to make paperclips,so an AI with a goal to make paperclips, but not to make any specific amount, wouldn’t grow into a threatening monster.
The probability of the Clippie scenario is also lowered by the consideration that fine grained goals might shift during self-improvement phase, so the Clippie scenario …. arbitrary goals combined with a superintelligence …. is whittled away from both ends.
All assuming that the AI won’t update it’s goals even it realizes there is some mistake. That isnt obvious, and in fact is hard to defend.
An AI that is powerful and effective would need to seek the truth about a lot of things,since entity that has contradictory beliefs will be a poor instrumental rationalist. But would its goal of truth seeking necessarily be overridden by other goals....would it know but not care?
It might be possible to build an AI that didn’t care about interpreting its goals correctly. It looks like you would need to engineer a distinction between instrumental beliefs and terminal beliefs. Remember that the terminal/instrumental distinction is conceptual, not a law of nature. ( While we’re on the subject, you might need a firewall to stop an .AI acting on intrinsically motivating ideas, if they exis )
In any case, orthogonality is an architecture choice, not an ineluctable fact about minds.
MIRI’s critics, Loosemore, Hibbard and so in are tacitly assuming architectures without such unupdateability and firewalling.
MIRI needs to show that such an architecture is likely to occur, either as a design or a natural evolution. If AIs with unupdateable goals are dangerous, as MIRI seats, it would be simplest not to use that architecture...if it can be avoided. ( We also agree with Yudkowsky(2008a),who points out that research on the philosophical and technical requirements of safe AGI might show that broad classes of possible AGI architectures are fundamentally unsafe,suggesting that such architectures should be avoided.”) In other words, it would be careless to build a genie that doesn’t care.
If the AI community isnt going to deliberately build the goal rigid kind of AI, then MIRIs arguments come down to how it might be a natural or convergent feature....and the wider AI community finds the goal-rigid idea so unintuitive that it fails to understand MIRI, who in turn fail to make it explicit enough.
When Loosemore talks about of the doctrine of logical scalability, he is supposing there must be some reason why an AI wouldn’t update certain things....he’ doesn’t see goal unupdateability as an obvious default.
There are a number of points that can be made against the inevitability of goal rigidity.
For one thing humans don’t show any sign of maintaining .lifelong stable goals. (Talk of utility functions as if they were real things disguises this point)
For another, important classes of real world AIs don’t have that property. The goal, in a sense, of a neural network is to get positive reinforcement, and avoid negative enforcement,
For another, the desire to preserve goals does not imply the ability to preserve goals. In particular, all intelligent entities likely face a trade off between self modifying for improvement and maintaining their goals. An AI might be able to keep its goals stable by refusing to learn or self modify, but that kind of stick in the mud is also less threatening because less powerful.
The Orthogonality thesis us sometimes put forward to support the claim that goal rigidity will occur. To a first approximation, the OT states that any combination of goals and intelligence is possible … and AI s would want to maintain their goals, right?
The devil is in the details,
There is more than one version of the orthogonality thesis. It is trivially false under some interpretations, and trivially true under others. Its more defensible under forms asserting the compatibility of transient combinations of values and intelligence, which are not particularly relevant to AI threat arguments. It is less defensible in forms asserting stable combinations of intelligence and values, and those are the forms that are suitable to be be used as a stage in an argument towards Yudkowskian UFAI.
An orthogonality claim of a kind relevant to UFAI must be one that posits the stable and continued co-existence of a set of values with a self improving AI. The momentary co existence of values and efficiency is not enough to spawn a Paperclipper style UFAI. An AI that paperclips for only a nanosecond is no threat .
A learning, self improving AI will not be able to guarantee that a given self modification keeps its goals unchanged, since it doing so involves the the relatively dumber version at time T1 making an an accurate prediction about the more complex version at time T2.
The claim that rigid-goal architectures are dangerous does not imply that other architectures are safe. Non rigid systems may have the advantage of corrigibility, being in some way fixable once they have been switched on. They are likely to out a high value on truth, correctness, since that is both a multi-purpose instrumental goal, and a desideratum in the part of the programmers,
But non rigid AIs might also converge on undesirable goals, for instance, evolutionary goals like self preservation.That’s another story,
The only sense in which the “rigidity” of goals can be said to be a universal fact about minds is that it is these goals that determine how the AI will modify itself once it has become smart and capable enough to do so. It’s not a good idea to modify your goals if you want them to become reality; that seems obviously true to me, except perhaps for a small number of edge cases related to internally incoherent goals.
Your points against the inevitability of goal rigidity don’t seem relevant to this.
If you take the binary view that you’re either smart enough to achieve your goals or not, then you might well want to stop improving when you have the minimum intelligence necessary to meet them...which means, among other things,that AIs with goals requiring human or lower intelligence won’t become superhuman …. which lowers the probability of the Clippie scenario. It doesn’t require huge intelligence to make paperclips,so an AI with a goal to make paperclips, but not to make any specific amount, wouldn’t grow into a threatening monster.
The probability of the Clippie scenario is also lowered by the consideration that fine grained goals might shift during self-improvement phase, so the Clippie scenario …. arbitrary goals combined with a superintelligence …. is whittled away from both ends.