There is more than one version of the orthogonality thesis. It is trivially false under some interpretations, and trivially true under others, which is true because only some versions can be used as a stage in an argument towards Yudkowskian UFAI.
It is admitted from the outset that some versions of the OT are not logically possible, those being the ones that involve a Godelian or Lobian contradiction.
It is also admitted that the standard OT does not deal with any dynamic or developmental aspects of agents. However, the UFAI argument is posited on agents which have stable goals, and the ability to self improve, so trajectories in mindspace are crucial.
Goal stability is not a given: it is not possessed by all mental architectures, and may not be possessed by any, since noone knows his to engineer it, and humans appear not to have it. It is plausible that an agent would desire to preserve its goals, but the desire to preserve goals does not imply the ability to preserve goals. Therefore, no goal stable system of any complexity exists on this planet, and goal inability cannot be assumed as a default or given.
Self improvement is likewise not a given, since the long and disappointing history of AGI research is largely a history of failure to achieve adequate self improvement. Algorithmspace is densely populated with non self improvers.
An orthogonality claim of a kind relevant to UFAI must be one that posits the stable and continued co-existence of an arbitrary set of values in a self improving AI. However, the version of the OT that is obviously true is one that maintains the momentary co-existence of arbitrary values and level of intelligence.
We have stated that goal stability and self impairment, separately, may well be rare in mindspace.Furthermore, it is not clear arbitrary values are compatible with long term self improvement as a combination: a learning, self improving AI will not be able to guarantee that a given self modification keeps its goals unchanged, since it doing so involves the the relatively dumber version at time T1 making an an accurate prediction about the more complex version at time T2. This has been formalised into a proof that less powerful formal systems cannot predict the abilities of more formal ones.
“Suppose you’re trying to build a self-modifying AGI called “Lucy”. Lucy works by considering possible actions and looking for formal proofs that taking one of them will increase expected utility. In particular, it has self-modifying actions in its strategy space. A self-modifying action creates essentially a new agent: Lucy2. How can Lucy decide that becoming Lucy2 is a good idea? Well, a good step in this direction would be proving that Lucy2 would only take actions that are “good”. I.e., we would like Lucy to reason as follows “Lucy2 uses the same formal system as I, so if she decides to take action a, it’s because she has a proof p of the sentence s(a) that ‘a increases expected utility’. Since such a proof exits, a does increase expected utility, which is good news!” Problem: Lucy is using L in there, applied to her own formal system! That cannot work! So, Lucy would have a hard time self-modifying in a way which doesn’t make its formal system weaker. As another example where this poses a problem, suppose Lucy observes another agent called “Kurt”. Lucy knows, by analyzing her sensory evidence, that Kurt proves theorems using the same formal system as Lucy. Suppose Lucy found out that Kurt proved theorem s, but she doesn’t know how. We would like Lucy to be able to conclude s is, in fact, true (at least with the probability that her model of physical reality is correct). ”
Squark thinks that goal stable self improvement can be rescued btpy probablist reasoning. I would rather explore the consequences of goal instability,
An AI that opts for goal stability over self improvement will probably not become smart enough to be dangerous.
An AI that opts for self improvement over goal stability might visit paperclippping, or any of a large number of other goals on its random walk. However, paperclippers aren’t dangerous unless they are fairly stable paperclippers. An AI that paperclips for a short time is no threat: the low hanging fruit is to just buy them, or make them out of steel.
Would an AI evolve into goal stability? Something as arbitrary as papercliping is a very poor candidate for an attractor. The good candidates are quasi evolutionary goals that promote survival and reproduction. That’s doesn’t strongly imply friendliness, but inasmuch as it implies unfriendliness, it implies a kind we are familiar with, being outcompeted for resources by entities with a drive for survival, not the alien, Lovecraftian horror of the paperclippers scenario.
(To backtrack a little: I am not arguing that goal instability is particularly likely. I can’t quantify the proportion of AIs that will opt for the conservative approach of not self modifying).
Goal stability is a prerequisite for MiRIs favoured method of achieving AI safety, but it is also a prerequisite for MiRIs favourite example of unsafe AI, the paperclipper, so it’s loss does not appear to make AI more dangerous.
If goal stability is unavailable to AIs, or at least to the potentially dangerous ones—we don’t have worry to much about the non-improvers—then the standard MIRI solution of solving friendliness, and coding it in as unupdateable goals, is unavailable. That is not entirely bad news, as the approach based on rigid goals is quite problematical. It entails having to get something exactly right first time, which is not a situation you want to be in if you can avoid it—particularly when the stakes are so high.
There is more than one version of the orthogonality thesis. It is trivially false under some interpretations, and trivially true under others, which is true because only some versions can be used as a stage in an argument towards Yudkowskian UFAI.
It is admitted from the outset that some versions of the OT are not logically possible, those being the ones that involve a Godelian or Lobian contradiction.
It is also admitted that the standard OT does not deal with any dynamic or developmental aspects of agents. However, the UFAI argument is posited on agents which have stable goals, and the ability to self improve, so trajectories in mindspace are crucial.
Goal stability is not a given: it is not possessed by all mental architectures, and may not be possessed by any, since noone knows his to engineer it, and humans appear not to have it. It is plausible that an agent would desire to preserve its goals, but the desire to preserve goals does not imply the ability to preserve goals. Therefore, no goal stable system of any complexity exists on this planet, and goal inability cannot be assumed as a default or given.
Self improvement is likewise not a given, since the long and disappointing history of AGI research is largely a history of failure to achieve adequate self improvement. Algorithmspace is densely populated with non self improvers.
An orthogonality claim of a kind relevant to UFAI must be one that posits the stable and continued co-existence of an arbitrary set of values in a self improving AI. However, the version of the OT that is obviously true is one that maintains the momentary co-existence of arbitrary values and level of intelligence.
We have stated that goal stability and self impairment, separately, may well be rare in mindspace.Furthermore, it is not clear arbitrary values are compatible with long term self improvement as a combination: a learning, self improving AI will not be able to guarantee that a given self modification keeps its goals unchanged, since it doing so involves the the relatively dumber version at time T1 making an an accurate prediction about the more complex version at time T2. This has been formalised into a proof that less powerful formal systems cannot predict the abilities of more formal ones.
From Squarks article
http://lesswrong.com/lw/jw7/overcoming_the_loebian_obstacle_using_evidence/
“Suppose you’re trying to build a self-modifying AGI called “Lucy”. Lucy works by considering possible actions and looking for formal proofs that taking one of them will increase expected utility. In particular, it has self-modifying actions in its strategy space. A self-modifying action creates essentially a new agent: Lucy2. How can Lucy decide that becoming Lucy2 is a good idea? Well, a good step in this direction would be proving that Lucy2 would only take actions that are “good”. I.e., we would like Lucy to reason as follows “Lucy2 uses the same formal system as I, so if she decides to take action a, it’s because she has a proof p of the sentence s(a) that ‘a increases expected utility’. Since such a proof exits, a does increase expected utility, which is good news!” Problem: Lucy is using L in there, applied to her own formal system! That cannot work! So, Lucy would have a hard time self-modifying in a way which doesn’t make its formal system weaker. As another example where this poses a problem, suppose Lucy observes another agent called “Kurt”. Lucy knows, by analyzing her sensory evidence, that Kurt proves theorems using the same formal system as Lucy. Suppose Lucy found out that Kurt proved theorem s, but she doesn’t know how. We would like Lucy to be able to conclude s is, in fact, true (at least with the probability that her model of physical reality is correct). ”
Squark thinks that goal stable self improvement can be rescued btpy probablist reasoning. I would rather explore the consequences of goal instability,
An AI that opts for goal stability over self improvement will probably not become smart enough to be dangerous.
An AI that opts for self improvement over goal stability might visit paperclippping, or any of a large number of other goals on its random walk. However, paperclippers aren’t dangerous unless they are fairly stable paperclippers. An AI that paperclips for a short time is no threat: the low hanging fruit is to just buy them, or make them out of steel.
Would an AI evolve into goal stability? Something as arbitrary as papercliping is a very poor candidate for an attractor. The good candidates are quasi evolutionary goals that promote survival and reproduction. That’s doesn’t strongly imply friendliness, but inasmuch as it implies unfriendliness, it implies a kind we are familiar with, being outcompeted for resources by entities with a drive for survival, not the alien, Lovecraftian horror of the paperclippers scenario.
(To backtrack a little: I am not arguing that goal instability is particularly likely. I can’t quantify the proportion of AIs that will opt for the conservative approach of not self modifying).
Goal stability is a prerequisite for MiRIs favoured method of achieving AI safety, but it is also a prerequisite for MiRIs favourite example of unsafe AI, the paperclipper, so it’s loss does not appear to make AI more dangerous.
If goal stability is unavailable to AIs, or at least to the potentially dangerous ones—we don’t have worry to much about the non-improvers—then the standard MIRI solution of solving friendliness, and coding it in as unupdateable goals, is unavailable. That is not entirely bad news, as the approach based on rigid goals is quite problematical. It entails having to get something exactly right first time, which is not a situation you want to be in if you can avoid it—particularly when the stakes are so high.