Seems like X is (or includes) the ability to think about self-modification: awareness of its own internal details and modelling their possible changes.
Note that without this ability the tool could invent a plan which leads to its own accidental destruction (and possibly not completing the plan), because it does not realize it could be destroyed or damaged.
I think of agents having goals and pursuing them by default. I dont see how self reflexive abilities.… ” think about self-modification: awareness of its own internal details and modelling their possible changes.”...add up to goals. It might be intuitive that a self aware entity would want to preserve its existence...but that intuition could be driven by anthropomorphism, (or zoomorphism , or biomorphism)
With self-reflective abilities, the system can also consider paths including self-modification in reaching its goal. Some of those paths may be highly unintuitive for humans, so we wouldn’t notice some possible dangers. Self-modification may also remove some safety mechanisms.
A system that explores many paths can find a solutions humans woudln’t notice. Such “creativity” at object level is relatively harmless. Google Maps may find you a more efficient path to your work than the one you use now, but that’s okay. Maybe the path is wrong for some reasons that Google Maps does not understand (e.g. it leads through a neighborhood with high crime), but at least on general level you understand that such is the risk of following the outputs blindly. However, similar “creativity” at self-modification level can have unexpected serious consequences.
“the system can also”, “some of those paths may be”, “may also remove”. Those are some highly conditional statements. Quantify, please, or else this is no different than “the LHC may destroy us all with a mini black hole!”
I’d need to have a specific description of the system, what exactly it can do, and how exactly it can modify itself, to give you a specific example of self-modification that contributes to the specific goal in a perverse way.
I can invent an example, but then you can just say “okay, I wouldn’t use that specific system”.
As an example: Imagine that you have a machine with two modules (whatever they are) called Module-A and Module-B. Module-A is only useful for solving Type-A problems. Module-B is only useful for solving Type-B problems. At this moment, you have a Type-A problem, and you ask the machine to solve it as cheaply as possible. The machine has no Type-B problem at the moment. So the machine decides to sell its Module-B on ebay, because it is not necessary now, and the gained money will reduce the total cost of solving your problem. This is short-sighted, because tomorrow you may need to solve a Type-B problem. But the machine does not predict your future wishes.
I can invent an example, but then you can just say “okay, I wouldn’t use that specific system”.
But can’t you see, that’s entirely the point!
If you design systems whereby the Scary Idea has no more than a vanishing likelihood of occurring, it no longer becomes an active concern. It’s like saying “bridges won’t survive earthquakes! you are crazy and irresponsible to build a bridge in an area with earthquakes!” And then I design a bridge that can survive earthquakes smaller than magnitude X, where X magnitude earthquakes have a likelihood of occurring less than 1 in 10,000 years, then on top of that throw an extra safety margin of 20% on because we have the extra steel available. Now how crazy and irresponsible is it?
If you design systems whereby the Scary Idea has no more than a vanishing likelihood of occurring, it no longer becomes an active concern.
Yeah, and the whole problem is how specifically will you do it.
If I (or anyone else) will give you examples of what could go wrong, of course you can keep answering by “then I obviously wouldn’t use that design”. But at the end of the day, if you are going to build an AI, you have to make some design—just refusing designs given by other people will not do the job.
There are plenty of perfectly good designs out there, e.g. CogPrime + GOLUM. You could be calculating probabilistic risk based on these designs, rather than fear mongering based on a naïve Bayes net optimizer.
Seems like X is (or includes) the ability to think about self-modification: awareness of its own internal details and modelling their possible changes.
Note that without this ability the tool could invent a plan which leads to its own accidental destruction (and possibly not completing the plan), because it does not realize it could be destroyed or damaged.
An agent can also accidentally pursue a plan which leads to its self-destruction. People do it now and then by not modelling the world well enough.
I think of agents having goals and pursuing them by default. I dont see how self reflexive abilities.… ” think about self-modification: awareness of its own internal details and modelling their possible changes.”...add up to goals. It might be intuitive that a self aware entity would want to preserve its existence...but that intuition could be driven by anthropomorphism, (or zoomorphism , or biomorphism)
With self-reflective abilities, the system can also consider paths including self-modification in reaching its goal. Some of those paths may be highly unintuitive for humans, so we wouldn’t notice some possible dangers. Self-modification may also remove some safety mechanisms.
A system that explores many paths can find a solutions humans woudln’t notice. Such “creativity” at object level is relatively harmless. Google Maps may find you a more efficient path to your work than the one you use now, but that’s okay. Maybe the path is wrong for some reasons that Google Maps does not understand (e.g. it leads through a neighborhood with high crime), but at least on general level you understand that such is the risk of following the outputs blindly. However, similar “creativity” at self-modification level can have unexpected serious consequences.
“the system can also”, “some of those paths may be”, “may also remove”. Those are some highly conditional statements. Quantify, please, or else this is no different than “the LHC may destroy us all with a mini black hole!”
I’d need to have a specific description of the system, what exactly it can do, and how exactly it can modify itself, to give you a specific example of self-modification that contributes to the specific goal in a perverse way.
I can invent an example, but then you can just say “okay, I wouldn’t use that specific system”.
As an example: Imagine that you have a machine with two modules (whatever they are) called Module-A and Module-B. Module-A is only useful for solving Type-A problems. Module-B is only useful for solving Type-B problems. At this moment, you have a Type-A problem, and you ask the machine to solve it as cheaply as possible. The machine has no Type-B problem at the moment. So the machine decides to sell its Module-B on ebay, because it is not necessary now, and the gained money will reduce the total cost of solving your problem. This is short-sighted, because tomorrow you may need to solve a Type-B problem. But the machine does not predict your future wishes.
But can’t you see, that’s entirely the point!
If you design systems whereby the Scary Idea has no more than a vanishing likelihood of occurring, it no longer becomes an active concern. It’s like saying “bridges won’t survive earthquakes! you are crazy and irresponsible to build a bridge in an area with earthquakes!” And then I design a bridge that can survive earthquakes smaller than magnitude X, where X magnitude earthquakes have a likelihood of occurring less than 1 in 10,000 years, then on top of that throw an extra safety margin of 20% on because we have the extra steel available. Now how crazy and irresponsible is it?
Yeah, and the whole problem is how specifically will you do it.
If I (or anyone else) will give you examples of what could go wrong, of course you can keep answering by “then I obviously wouldn’t use that design”. But at the end of the day, if you are going to build an AI, you have to make some design—just refusing designs given by other people will not do the job.
There are plenty of perfectly good designs out there, e.g. CogPrime + GOLUM. You could be calculating probabilistic risk based on these designs, rather than fear mongering based on a naïve Bayes net optimizer.