[Anna Salamon] gave the familiar SIAI argument that, if one picks a mind at random from “mind space”, the odds that it will be Friendly to humans are effectively zero.
This is an incredibly weak argument by intuition. The mind picked at random from “mind space” can be self destructive, for instance, or can be incapable of self improvement. As intuition pump, if you pick a computer program at random from computer program space—run random code—it crashes right off almost all of the time. If you eliminate the crashes you get very simple infinite loops. If you eliminate those, you get very simple loops that count or the like, with many pieces of random code corresponding to exact same behaviour after running it for any significant number of cpu cycles (as most of the code ends up non-functional). You get Kolmogorov’s complexity prior even if you just try to run uniformly random x86 code.
The problem with the argument is that you appeal to the random mind space, while discussing the AIs that foom’d from being man made and running at manmade hardware, and which do not self destruct, and thus are anything but random.
One could make equally plausible argument that random mind from the space of the minds that are not self destructive, yet capable of self improvement (which implies considerably broad definition of self) is almost certainly friendly as it would implement the simplest goal system which permits self improvements and forbids self harm, implying likely rather broad and not very specific definition of self harm that would likely include harm to all life. It is not a very friendly AI—it will kill the entire crew of a whaling ship if it has to—but not very destructive. edit: Of course, that’s subject to how it tries to maximize value of the life; the diversity and complexity preservation seems natural for the anti-self-harm mechanism. Note: the life is immensely closer to the AI than dead parts of the universe. Note2: Less specific discriminators typically have lower complexity. Note3: I think the safest assumption to make is that the AI doesn’t start off as a self aware super genius that will figure out instrumental self preservation from first principles even if the goal is not self preserving.
I’ll call this a “Greenpeace by default” argument. It is coming from a software developer (me) with some understanding of what random design spaces tend to look like, so it got to have higher prior than the “Unfriendly by default” which ignores the fact that most of the design space corresponds to unworkable designs and that simpler designs have larger number of working implementations.
Ultimately, this is all fairly baseless speculation and rationalization of culturally, socially, and politically motivated opinions and fears. One does not start with an intuition of the random mind design space—it is obvious that such intuition is likely garbage unless one actually dealt with random design spaces before. One starts with fear and invents that argument. One can start with pro-AI attitude and invent converse, but equally (if not more) plausible argument, by appeal to intuitions of this kind. Bottom line is, all of those are severely privileged hypotheses. The scary idea, the Greenpeace idea of mine, they’re baseless speculations—though I do have very strong urge to just promote this Greenpeace idea with same zeal, just to counter the harm done by promoting other privileged hypotheses.
One could make equally plausible argument that random mind from the space of the minds that are not self destructive, yet capable of self improvement (which implies considerably broad definition of self) is almost certainly friendly as it would implement the simplest goal system which permits self improvements and forbids self harm, implying likely rather broad and not very specific definition of self harm that would likely include harm to all life.
“Almost certainly”? “likely”? The scenario you describe sounds pretty far-fetched, I don’t see why such a system would care for all life. You’re talking about what you could make a plausible argument for, not what you actually believe, right?
Why would a system care for itself? If it cares about reaching goal G, then an intermediate goal is preserving the existence of agents that are trying to reach goal G, i.e. itself. So even if a system doesn’t start out caring about it’s preservation, nearly any goal will imply self-preservation as a useful subgoal. There is no comparable mechanism that would bring up “preservation of all life” as a subgoal.
Also, other living things are a major source of unpredictability, and the more unpredictable the environment, the harder it is to reach goals (humans are especially likely to screw things up in unpredictable ways). So if an agent has goals that aren’t directly about life, it seems that “exterminate all life” would be a useful subgoal.
You don’t know how much do you privilege a hypothesis by picking the arbitrary unbounded goal G out of goals that we humans easily define using English language. It is very easy to say ‘maximize the paperclips or something’ - it is very hard to formally define what paperclips are even without any run-time constraints, and it’s very dubious that you can forbid solutions similar to those that a Soviet factory would employ if it was tasked with maximization of paperclip output (a lot of very tiny paperclips, or just falsified numbers for the outputs, or making the paperclips and then re-melting them). Furthermore, it is really easy for us to say ‘self’ but defining self formally is very difficult as well, if you want the AI’s self improvement not to equal suicide.
Furthermore, the AI starts stupid. It better be caring about itself before it can start inventing self preservation via self-foresight. Defining the goals in terms of some complexity metrics = goals that have something to do with life.
My argument doesn’t require that anybody be able to formally define “self” or “maximize paperclips”; it doesn’t require the goal G to be picked among those that are easily defined in English.
An agent capable of reasoning about the world should be able to make an inference like “if all copies of me are destroyed, it makes it much less likely that goal G would be reached”; it may not have exactly that form, but it should be something analogous. It doesn’t matter if I can’t formalize that, the agent may not have a completely formal version either, only one that is sufficient for it’s purposes.
My argument doesn’t require that anybody be able to formally define “self” or “maximize paperclips”; it doesn’t require the goal G to be picked among those that are easily defined in English.
Show 3 examples of goal G. Somewhere I’ve read awesome technique for avoiding the abstraction mistakes—asking to show 3 examples.
What’s the point? Are you going to nitpick that my goals aren’t formal enough, even though I’m not making any claim at all about what kind of goals those could be?
Are you claiming that it’s impossible for an agent to have goals? That the set of goals that it’s even conceivable for an AI to have (without immediately wireheading or something) is much narrower than what most people here assume?
I’m not even sure what this disagreement is about right now, or even if there is a disagreement.
Ya, I think the set of goals is very narrow. The AI here starts of Descartes level genius and proceeds to self preserve, understand the map-territory distinction for non-wireheading, foreseeing the possibility that instrumental goals which look good may destroy the terminal goal, and such.
The AI I imagine starts off stupid and has some really narrowly (edit: or should i say, short-foresighted) self improving non self destructive goal likely having to do with maximization of complexity in some way. Think evolution, don’t think fully grown Descartes waking up after amnesia. It ain’t easy to reinvent the ‘self’. It’s also not easy to look at agent (yourself) and say—wow, this agent works to maximize G—without entering infinite recursion. We humans, if we escaped out of our universe into some super-universe, we might wreck some havoc but we’d sacrifice a bit of utility to preserve anything resembling life. Why? Well, we started stupid, and that’s how we got our goals.
The way to fix the quoted argument is to have the utility function be random, grafted on to some otherwise-functioning AI.
A random utility function is maximized by a random state of the universe. And most arrangements of the universe don’t contain humans. If the AI’s utility function doesn’t some how get maximized by one of the very few states that contains humans, it’s very clearly unfriendly because it wants to replace humans with something else.
The way to fix the quoted argument is to have the utility function be random, grafted on to some otherwise-functioning AI.
Not demonstrably doable, arises from wrong intuitions arising from thinking too much about the AIs with oracular powers of prediction which straightforwardly maximize the utility, rather than of realistic cases—on limited hardware—which have limited foresight and employ instrumental strategies and goals which have to be derived from the utility function (and which can alter the utility function unless it is protected. The fact that utility modification is against the utility itself is insufficient when employing strategies and limited foresight).
Furthermore, an utility function can be self destructive.
A random utility function is maximized by a random state of the universe.
False. A random code for a function crashes (or never terminates). Of the codes that do not crash, simplest codes massively predominate. Demonstrably false if you try to generate random utility functions by generating random C code, which evaluate the utility of some test environment.
The problem I have with those arguments is that a: many things are plain false, and b: you try to ‘fix’ stuff by bolting in more and more conjunctions (‘you can graft random utility functions onto well functioning AIs’) into your giant scary conjunction, instead of updating, when contradicted. That’s the definite sign of rationalization. It can also always be done no matter how much counter argument there exist—you can always add something into scary conjunction to make it happen. Adding conditions into conjunction should decrease it’s probability.
I’d rather be concerned with implementations of functions, like Turing machine tapes, or C code, or x86 instructions, or the like.
In any case the point is rather moot because the function is human generated. Hopefully humans can do better than random, albeit i wouldn’t wager on this—the FAI attempts are potentially worrisome as humans are sloppy programmers, and bugged FAIs would follow different statistics entirely. Still, I would expect bugged FAIs to be predominantly self destructive. (I’m just not sure if the non-self-destructive bugged FAI attempts are predominantly mankind-destroying or not)
How do you think the “Greenpeace by default” AI might define either “harm” or “value”, and “life”?
It simply won’t. Harm, value, life, we never defined those; they are the commonly agreed upon labels which we apply to things for communication purposes, and it works on a limited set of things that already exist but does not define anything outside context of this limited set.
It would have maximization of some sort of complexity metric (perhaps while acting conservatively and penalizing actions it can’t undo to avoid self harm in the form of cornering oneself), which it first uses on itself to self improve for a while without even defining what self is. Consider evolution as example; it doesn’t really define fitness in the way that humans do. It doesn’t work like—okay we’ll maximize the fitness that is defined so and so, so there’s what we should do.
edit: that is to say, it doesn’t define ‘life’ or ‘harm’. It has a simple goal system involving some metrics, which incidentally prevents the self harm, and permits self improvement, in the sense that we would describe it this way like we would describe the shooting-at-short-part-of-visible-spectrum robot as blue-minimizing one (albeit that is not very good analogy as we define blue and minimization independently of the robot).
This is an incredibly weak argument by intuition. The mind picked at random from “mind space” can be self destructive, for instance, or can be incapable of self improvement. As intuition pump, if you pick a computer program at random from computer program space—run random code—it crashes right off almost all of the time. If you eliminate the crashes you get very simple infinite loops. If you eliminate those, you get very simple loops that count or the like, with many pieces of random code corresponding to exact same behaviour after running it for any significant number of cpu cycles (as most of the code ends up non-functional). You get Kolmogorov’s complexity prior even if you just try to run uniformly random x86 code.
The problem with the argument is that you appeal to the random mind space, while discussing the AIs that foom’d from being man made and running at manmade hardware, and which do not self destruct, and thus are anything but random.
One could make equally plausible argument that random mind from the space of the minds that are not self destructive, yet capable of self improvement (which implies considerably broad definition of self) is almost certainly friendly as it would implement the simplest goal system which permits self improvements and forbids self harm, implying likely rather broad and not very specific definition of self harm that would likely include harm to all life. It is not a very friendly AI—it will kill the entire crew of a whaling ship if it has to—but not very destructive. edit: Of course, that’s subject to how it tries to maximize value of the life; the diversity and complexity preservation seems natural for the anti-self-harm mechanism. Note: the life is immensely closer to the AI than dead parts of the universe. Note2: Less specific discriminators typically have lower complexity. Note3: I think the safest assumption to make is that the AI doesn’t start off as a self aware super genius that will figure out instrumental self preservation from first principles even if the goal is not self preserving.
I’ll call this a “Greenpeace by default” argument. It is coming from a software developer (me) with some understanding of what random design spaces tend to look like, so it got to have higher prior than the “Unfriendly by default” which ignores the fact that most of the design space corresponds to unworkable designs and that simpler designs have larger number of working implementations.
Ultimately, this is all fairly baseless speculation and rationalization of culturally, socially, and politically motivated opinions and fears. One does not start with an intuition of the random mind design space—it is obvious that such intuition is likely garbage unless one actually dealt with random design spaces before. One starts with fear and invents that argument. One can start with pro-AI attitude and invent converse, but equally (if not more) plausible argument, by appeal to intuitions of this kind. Bottom line is, all of those are severely privileged hypotheses. The scary idea, the Greenpeace idea of mine, they’re baseless speculations—though I do have very strong urge to just promote this Greenpeace idea with same zeal, just to counter the harm done by promoting other privileged hypotheses.
“Almost certainly”? “likely”? The scenario you describe sounds pretty far-fetched, I don’t see why such a system would care for all life. You’re talking about what you could make a plausible argument for, not what you actually believe, right?
Why would a system care for itself? If it cares about reaching goal G, then an intermediate goal is preserving the existence of agents that are trying to reach goal G, i.e. itself. So even if a system doesn’t start out caring about it’s preservation, nearly any goal will imply self-preservation as a useful subgoal. There is no comparable mechanism that would bring up “preservation of all life” as a subgoal.
Also, other living things are a major source of unpredictability, and the more unpredictable the environment, the harder it is to reach goals (humans are especially likely to screw things up in unpredictable ways). So if an agent has goals that aren’t directly about life, it seems that “exterminate all life” would be a useful subgoal.
You don’t know how much do you privilege a hypothesis by picking the arbitrary unbounded goal G out of goals that we humans easily define using English language. It is very easy to say ‘maximize the paperclips or something’ - it is very hard to formally define what paperclips are even without any run-time constraints, and it’s very dubious that you can forbid solutions similar to those that a Soviet factory would employ if it was tasked with maximization of paperclip output (a lot of very tiny paperclips, or just falsified numbers for the outputs, or making the paperclips and then re-melting them). Furthermore, it is really easy for us to say ‘self’ but defining self formally is very difficult as well, if you want the AI’s self improvement not to equal suicide.
Furthermore, the AI starts stupid. It better be caring about itself before it can start inventing self preservation via self-foresight. Defining the goals in terms of some complexity metrics = goals that have something to do with life.
My argument doesn’t require that anybody be able to formally define “self” or “maximize paperclips”; it doesn’t require the goal G to be picked among those that are easily defined in English.
An agent capable of reasoning about the world should be able to make an inference like “if all copies of me are destroyed, it makes it much less likely that goal G would be reached”; it may not have exactly that form, but it should be something analogous. It doesn’t matter if I can’t formalize that, the agent may not have a completely formal version either, only one that is sufficient for it’s purposes.
Show 3 examples of goal G. Somewhere I’ve read awesome technique for avoiding the abstraction mistakes—asking to show 3 examples.
What’s the point? Are you going to nitpick that my goals aren’t formal enough, even though I’m not making any claim at all about what kind of goals those could be?
Are you claiming that it’s impossible for an agent to have goals? That the set of goals that it’s even conceivable for an AI to have (without immediately wireheading or something) is much narrower than what most people here assume?
I’m not even sure what this disagreement is about right now, or even if there is a disagreement.
Ya, I think the set of goals is very narrow. The AI here starts of Descartes level genius and proceeds to self preserve, understand the map-territory distinction for non-wireheading, foreseeing the possibility that instrumental goals which look good may destroy the terminal goal, and such.
The AI I imagine starts off stupid and has some really narrowly (edit: or should i say, short-foresighted) self improving non self destructive goal likely having to do with maximization of complexity in some way. Think evolution, don’t think fully grown Descartes waking up after amnesia. It ain’t easy to reinvent the ‘self’. It’s also not easy to look at agent (yourself) and say—wow, this agent works to maximize G—without entering infinite recursion. We humans, if we escaped out of our universe into some super-universe, we might wreck some havoc but we’d sacrifice a bit of utility to preserve anything resembling life. Why? Well, we started stupid, and that’s how we got our goals.
The way to fix the quoted argument is to have the utility function be random, grafted on to some otherwise-functioning AI.
A random utility function is maximized by a random state of the universe. And most arrangements of the universe don’t contain humans. If the AI’s utility function doesn’t some how get maximized by one of the very few states that contains humans, it’s very clearly unfriendly because it wants to replace humans with something else.
Not demonstrably doable, arises from wrong intuitions arising from thinking too much about the AIs with oracular powers of prediction which straightforwardly maximize the utility, rather than of realistic cases—on limited hardware—which have limited foresight and employ instrumental strategies and goals which have to be derived from the utility function (and which can alter the utility function unless it is protected. The fact that utility modification is against the utility itself is insufficient when employing strategies and limited foresight).
Furthermore, an utility function can be self destructive.
False. A random code for a function crashes (or never terminates). Of the codes that do not crash, simplest codes massively predominate. Demonstrably false if you try to generate random utility functions by generating random C code, which evaluate the utility of some test environment.
The problem I have with those arguments is that a: many things are plain false, and b: you try to ‘fix’ stuff by bolting in more and more conjunctions (‘you can graft random utility functions onto well functioning AIs’) into your giant scary conjunction, instead of updating, when contradicted. That’s the definite sign of rationalization. It can also always be done no matter how much counter argument there exist—you can always add something into scary conjunction to make it happen. Adding conditions into conjunction should decrease it’s probability.
Function as in function).
I’d rather be concerned with implementations of functions, like Turing machine tapes, or C code, or x86 instructions, or the like.
In any case the point is rather moot because the function is human generated. Hopefully humans can do better than random, albeit i wouldn’t wager on this—the FAI attempts are potentially worrisome as humans are sloppy programmers, and bugged FAIs would follow different statistics entirely. Still, I would expect bugged FAIs to be predominantly self destructive. (I’m just not sure if the non-self-destructive bugged FAI attempts are predominantly mankind-destroying or not)
In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
“What are you doing?”, asked Minsky.
“I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.
“Why is the net wired randomly?”, asked Minsky.
“I do not want it to have any preconceptions of how to play”, Sussman said.
Minsky then shut his eyes.
“Why do you close your eyes?”, Sussman asked his teacher.
“So that the room will be empty.”
At that moment, Sussman was enlightened.
-- AI Koans
How do you think the “Greenpeace by default” AI might define either “harm” or “value”, and “life”?
It simply won’t. Harm, value, life, we never defined those; they are the commonly agreed upon labels which we apply to things for communication purposes, and it works on a limited set of things that already exist but does not define anything outside context of this limited set.
It would have maximization of some sort of complexity metric (perhaps while acting conservatively and penalizing actions it can’t undo to avoid self harm in the form of cornering oneself), which it first uses on itself to self improve for a while without even defining what self is. Consider evolution as example; it doesn’t really define fitness in the way that humans do. It doesn’t work like—okay we’ll maximize the fitness that is defined so and so, so there’s what we should do.
edit: that is to say, it doesn’t define ‘life’ or ‘harm’. It has a simple goal system involving some metrics, which incidentally prevents the self harm, and permits self improvement, in the sense that we would describe it this way like we would describe the shooting-at-short-part-of-visible-spectrum robot as blue-minimizing one (albeit that is not very good analogy as we define blue and minimization independently of the robot).