Experiments require sensors of some kind. I’m no programmer, but it seems prima facie that we could prevent it from sensing anything that had any information-theoretic possibility of furnishing dangerous information (although such extreme data starvation might hinder the evolution process).
Well I was talking about running experiments on it’s own thought processes, in order to reverse engineer it’s own source code. Even locked in a fully virtual world, if it can even observe it’s own actions then it can infer it’s thought process, it’s general algorithims, the [evolutionary or mental] process that led to it, and more than a few bits about it’s creators.
And if you are trying to wall off the AI from information about it’s thought process, then you’re working on a sandbox in a sandbox, which is just a sign that the idea for the first sandbox was flawed anyway.
I will admit that my mind runs away screaming from the difficulty of making something that really doesn’t get any input, even to its own thought process, but is superintelligent and can be made useful. Right now it sounds harder than FAI to me, and not reliably safe, but that might just be my own unfamiliarity with the problem.
Huge warning signs in all directions here. Will think more later.
Give it the ZFC axioms and a few definitions and it can derive all the pure math results we’d ever need
If we could avoid needing to give it a direction to take research, and it didn’t leap immediately to things too complex for us to understand… there are still problems.
How do you get it to actually do the work? If you build in intrinsic motivation that you know is right, then why aren’t you going right to FAI? If it wants something else and you’re coercing it with reward, then it will try to figure out how to really maximize it’s reward. if it has no information
Would an AI necessarily have motivations, or is that a special characteristic of gene-based lifeforms that evolved in a world where lack of reproduction and survival instincts is a one-way ticket to oblivion?
If we evolved superintelligent neural net’s they’d have some kind of motivation, they don’t want food or sex, but they’d want whatever their ancestors wanted that led them to do the thing that scored higher than the rest on the fitness function. (Which is at least twice removed from anything we would want.)
I’m not sure I get the bit about your dog cloning you. I agree that we shouldn’t try to dictate in detail what an FAI is supposed to want, but we do need [near] perfect control over what an AI wants in order to make it friendly, or even to keep it on a defined “safe” task.
I’m imagining the AI manipulating the text output on the terminal just right so as to mold the air/dust particles near the monitor into a self-replicating nano-machine (etc.).
I will admit that my mind runs away screaming from the difficulty of making something that really doesn’t get any input, even to its own thought process, but is superintelligent and can be made useful.
I guess my logic is leading to a non-self-aware super-general-purpose “brain” that does whatever we tell it to. Perhaps there is a reason why all sufficiently intelligent programs would necessarily become self-aware, but I haven’t heard it yet. If we could somehow suppress self-awareness (what that really means for a program I don’t know) while successfully ordering the program to modify itself (or a copy of itself), it seems the AI could still go FOOM into just a super-useful non-conscious servant. Of course, that still leaves the LiteralGenie problem.
leap immediately to things too complex for us to understand
That could indeed be a problem. Given you’re talking to a sufficiently intelligent being, if you stated the ZFC axioms and a few definitions, and then stated the Stone-Weierstrass theorem, it would say, “You already told me that” or “That’s redundant.”
Perhaps have it output every step in its thought process, every instance of modus ponens, etc. Since there is a floor on the level of logical simplicity of a step in a proof, we could just have it default to maximum verbosity and the proofs would still not be ridiculously long (or maybe they would be—it might choose extremely roundabout proofs just because it can).
they’d want whatever their ancestors wanted that led them to do the thing that scored higher than the rest on the fitness function.
Maybe I’m missing something, but it seems a neural net could just do certain things with high probability without having motivation. That is, it could have tendencies but no motivations. Whether this is a meaningful distinction perhaps hinges on the issue of self-awareness.
The point I was trying to get at with the dog example is that if you control all the factors that motivate an entity at the outset, it simply has no incentive to try to change its motivations, no matter how smart it may get. There’s no clever workaround, because it just doesn’t care. I agree that if we want to make a self-aware AI friendly in any meaningful sense we have to have perfect control (I think it may have to be perfect) over what motivates it. But I’m not yet convinced we can’t usefully box it, and I’d like to see an argument that we really need self-awareness to achieve AI FOOM. (Or just a precise definition of “self-awareness”—this will surely be necessary, perhaps Eliezer has defined it somewhere.)
Ok, some backstory on my thought process. For a while now I’ve played with the idea of treating optimization in general as the management of failure. Evolution fails alot, gradually builds up solutions that fail less, but never really ‘learns’ from its failures.
Failure management involves catching/mitigating errors as early as possible, and constructing methods to create solutions that are unlikely to be failures. If I get the idea to make auto tires out of concrete, I’m smart to see that it’s a bad idea, less smart to see it after doing extensive calculations, and dumb to see it only after an experiment, but I’d be smarter still if I had come up with a proper material right away.
But I’m pretty sure that a thing that can do stuff right the first time can only come about as the result of a process that has already made some errors. You can’t get rid of mistakes entirely, as they are required for learning. I think “self awareness” is sometimes a label for one or more feature that, among other things, serve to catch errors early and repair the faulty thought process.
So if a superintelligence were to be trying to build a machine in a simulation of our physics and some spinning part flew to bits, it would trace that fault back through the physics engine to determine how to make it better. Likewise, something needs to trace back the thought process that led to the bad idea and see where it could be repaired. This is where learning and self-modification are kind of the same thing.
(and on self modification: if it’s really smart, then it could build an AI from scratch without knowing anything in particular about itself. In this situation, the failure management is pre-emptive. It thinks about how the program it is writing would work, and the places it would go wrong.
I think “self awareness” is sometimes a label for one or more feature that, among other things, serve to catch errors early and repair the faulty thought process.
Interesting. I thought about this for a while just now, and it occurred to me that self-awareness may just be “having a mental model of oneself.” To be able to model oneself, one needs the general ability to make mental models. To do that requires the ability to recognize patterns at all levels of abstraction on what one is experiencing. To explain this, I need to clarify what “level of abstraction” means. I will try to do this by example.
A creature is hunting and he discovers that white rabbits taste good. Later he sees a gray rabbit for the first time. The creature’s neural net tells him that it’s a 98% match with the white rabbit, so probably also tasty. But let’s say gray rabbit turns out to taste bad. The creature has recognized the concrete patterns: 1. White rabbits taste good. 2. Gray rabbits taste bad.
Next week, he tries catching and eating a white bird, and it tastes good. Later he sees a gray bird. To assign any higher probability to the gray bird tasting bad, it seems the creature would have to recognize the abstract pattern: 3. Gray animals taste bad. (Of course it could also just be a negative or bad-tasting association with the color gray, but let’s suppose not—for that possibility could surely be avoided by making the example more complicated.)
Now “animal” is more abstract than “white rabbit” because there’s at least some kind of archetypal white rabbit one can visualize clearly (I’ll assume the creature is conceptualizing in the visual modality for simplicity’s sake).
“Rabbit” (remember that for all the creature knows, this simply means the union of the set “white rabbits” with the set “gray rabbits”) by itself is a tad more abstract, because to visualize it you’d have to see that archetypal rabbit but perhaps with the fur color switching back and forth between gray and white in your mind’s eye.
“Animal” is still more abstract, because to visualize it you’d have to, for instance, see a raccoon, a dog, and a tiger, and something that signals to you something like “etc.” (Naturally, if the creature’s method of conceptualization made visualizing “animal” easier than “rabbit”, “animal” would have the lower level of abstraction for him, and “rabbit” the higher—it all depends on the creature’s modeling methods.)
Now the creature has a mental model. If the model happens to be purely visual, it might look like a Venn diagram: a big circle labeled “animals”, two smaller patches within that circle that overlap with the “white things” circle and the “gray things” circle, and another outside region labeled “bad-tasting things” that sweeps in to encircle “gray animals” but not “white animals.”
The creature might revise that model after it tries eating the gray bird, but for now it’s the prediction model he’s using to determine how much energy to expend on hunting the gray bird in his sights. The model has revisable parts and predictive power, so I would call it a serviceable model—whether or not it’s accurate at this point.
Since the creature can make mental models like this, making a mental model of himself seems within his grasp. Then we could call the creature “self-aware.” The way it would trace back the thought process that led to a bad idea would be to recognize that the mental model has a flaw—i.e., a failed prediction—and make the necessary changes.
For instance, right now the creature’s mental model predicts that gray animals taste bad. If he eats several gray birds and finds them all to taste at least as good as white birds, he can see how the data point “delicious gray bird” conflicts with the fact that “gray animals” (and hence “gray birds”) is fully encircled by “bad-tasting things” in the Venn diagram in his mind’s eye.
To know how to self-modify most effectively in this case, perhaps the creature has another mental model, built up from past experience and probably at an even higher level of abstraction, that predicts the most effective course of action in such cases (cases where new data conflicts with the present model of something) is to pull the circle back so that it no longer covers the category that the exceptional data point belonged to. In this case, the creature pulls the circle “bad tasting things” (now perhaps shaped more like an amoeba) back slightly so that it no longer covers “gray birds,” and now the model is more accurate. So it seems that being able to make mental models of mental models is crucial to optimization or management of failure (and perhaps also sufficient for the task!).
So again, once the creature turns this mental modeling ability (based on pattern recognition and, in this case, visual imaging) to his own self, he becomes effectively self-aware. This doesn’t seem essential for optimization, but I concede I can’t think of a way to avoid this happening once the ability to form mental models is in place.
This somewhat conflicts with how I’ve used the term in previous posts, but I think this new conception is a more useful definition.
(To taboo “motivation” I’ll give two definitions: Tendency toward certain actions based on 1. the desire to gain pleasure or avoid pain, or 2. any utility function, including goals programmed in by humans in advance. In terms of AI safety, there doesn’t seem to be significant differences between 1 and 2. [This means I’ve changed my position upon reflection in this post.])
Well I was talking about running experiments on it’s own thought processes, in order to reverse engineer it’s own source code. Even locked in a fully virtual world, if it can even observe it’s own actions then it can infer it’s thought process, it’s general algorithims, the [evolutionary or mental] process that led to it, and more than a few bits about it’s creators.
And if you are trying to wall off the AI from information about it’s thought process, then you’re working on a sandbox in a sandbox, which is just a sign that the idea for the first sandbox was flawed anyway.
I will admit that my mind runs away screaming from the difficulty of making something that really doesn’t get any input, even to its own thought process, but is superintelligent and can be made useful. Right now it sounds harder than FAI to me, and not reliably safe, but that might just be my own unfamiliarity with the problem. Huge warning signs in all directions here. Will think more later.
How do you get it to actually do the work? If you build in intrinsic motivation that you know is right, then why aren’t you going right to FAI? If it wants something else and you’re coercing it with reward, then it will try to figure out how to really maximize it’s reward. if it has no information
If we evolved superintelligent neural net’s they’d have some kind of motivation, they don’t want food or sex, but they’d want whatever their ancestors wanted that led them to do the thing that scored higher than the rest on the fitness function. (Which is at least twice removed from anything we would want.)
I’m not sure I get the bit about your dog cloning you. I agree that we shouldn’t try to dictate in detail what an FAI is supposed to want, but we do need [near] perfect control over what an AI wants in order to make it friendly, or even to keep it on a defined “safe” task.
I like that idea.
I guess my logic is leading to a non-self-aware super-general-purpose “brain” that does whatever we tell it to. Perhaps there is a reason why all sufficiently intelligent programs would necessarily become self-aware, but I haven’t heard it yet. If we could somehow suppress self-awareness (what that really means for a program I don’t know) while successfully ordering the program to modify itself (or a copy of itself), it seems the AI could still go FOOM into just a super-useful non-conscious servant. Of course, that still leaves the LiteralGenie problem.
That could indeed be a problem. Given you’re talking to a sufficiently intelligent being, if you stated the ZFC axioms and a few definitions, and then stated the Stone-Weierstrass theorem, it would say, “You already told me that” or “That’s redundant.”
Perhaps have it output every step in its thought process, every instance of modus ponens, etc. Since there is a floor on the level of logical simplicity of a step in a proof, we could just have it default to maximum verbosity and the proofs would still not be ridiculously long (or maybe they would be—it might choose extremely roundabout proofs just because it can).
Maybe I’m missing something, but it seems a neural net could just do certain things with high probability without having motivation. That is, it could have tendencies but no motivations. Whether this is a meaningful distinction perhaps hinges on the issue of self-awareness.
The point I was trying to get at with the dog example is that if you control all the factors that motivate an entity at the outset, it simply has no incentive to try to change its motivations, no matter how smart it may get. There’s no clever workaround, because it just doesn’t care. I agree that if we want to make a self-aware AI friendly in any meaningful sense we have to have perfect control (I think it may have to be perfect) over what motivates it. But I’m not yet convinced we can’t usefully box it, and I’d like to see an argument that we really need self-awareness to achieve AI FOOM. (Or just a precise definition of “self-awareness”—this will surely be necessary, perhaps Eliezer has defined it somewhere.)
Ok, some backstory on my thought process. For a while now I’ve played with the idea of treating optimization in general as the management of failure. Evolution fails alot, gradually builds up solutions that fail less, but never really ‘learns’ from its failures.
Failure management involves catching/mitigating errors as early as possible, and constructing methods to create solutions that are unlikely to be failures. If I get the idea to make auto tires out of concrete, I’m smart to see that it’s a bad idea, less smart to see it after doing extensive calculations, and dumb to see it only after an experiment, but I’d be smarter still if I had come up with a proper material right away.
But I’m pretty sure that a thing that can do stuff right the first time can only come about as the result of a process that has already made some errors. You can’t get rid of mistakes entirely, as they are required for learning. I think “self awareness” is sometimes a label for one or more feature that, among other things, serve to catch errors early and repair the faulty thought process.
So if a superintelligence were to be trying to build a machine in a simulation of our physics and some spinning part flew to bits, it would trace that fault back through the physics engine to determine how to make it better. Likewise, something needs to trace back the thought process that led to the bad idea and see where it could be repaired. This is where learning and self-modification are kind of the same thing.
(and on self modification: if it’s really smart, then it could build an AI from scratch without knowing anything in particular about itself. In this situation, the failure management is pre-emptive. It thinks about how the program it is writing would work, and the places it would go wrong.
I think we should try to taboo “Motivation” and “self-aware” http://lesswrong.com/lw/nu/taboo_your_words/
Interesting. I thought about this for a while just now, and it occurred to me that self-awareness may just be “having a mental model of oneself.” To be able to model oneself, one needs the general ability to make mental models. To do that requires the ability to recognize patterns at all levels of abstraction on what one is experiencing. To explain this, I need to clarify what “level of abstraction” means. I will try to do this by example.
A creature is hunting and he discovers that white rabbits taste good. Later he sees a gray rabbit for the first time. The creature’s neural net tells him that it’s a 98% match with the white rabbit, so probably also tasty. But let’s say gray rabbit turns out to taste bad. The creature has recognized the concrete patterns: 1. White rabbits taste good. 2. Gray rabbits taste bad.
Next week, he tries catching and eating a white bird, and it tastes good. Later he sees a gray bird. To assign any higher probability to the gray bird tasting bad, it seems the creature would have to recognize the abstract pattern: 3. Gray animals taste bad. (Of course it could also just be a negative or bad-tasting association with the color gray, but let’s suppose not—for that possibility could surely be avoided by making the example more complicated.)
Now “animal” is more abstract than “white rabbit” because there’s at least some kind of archetypal white rabbit one can visualize clearly (I’ll assume the creature is conceptualizing in the visual modality for simplicity’s sake).
“Rabbit” (remember that for all the creature knows, this simply means the union of the set “white rabbits” with the set “gray rabbits”) by itself is a tad more abstract, because to visualize it you’d have to see that archetypal rabbit but perhaps with the fur color switching back and forth between gray and white in your mind’s eye.
“Animal” is still more abstract, because to visualize it you’d have to, for instance, see a raccoon, a dog, and a tiger, and something that signals to you something like “etc.” (Naturally, if the creature’s method of conceptualization made visualizing “animal” easier than “rabbit”, “animal” would have the lower level of abstraction for him, and “rabbit” the higher—it all depends on the creature’s modeling methods.)
Now the creature has a mental model. If the model happens to be purely visual, it might look like a Venn diagram: a big circle labeled “animals”, two smaller patches within that circle that overlap with the “white things” circle and the “gray things” circle, and another outside region labeled “bad-tasting things” that sweeps in to encircle “gray animals” but not “white animals.”
The creature might revise that model after it tries eating the gray bird, but for now it’s the prediction model he’s using to determine how much energy to expend on hunting the gray bird in his sights. The model has revisable parts and predictive power, so I would call it a serviceable model—whether or not it’s accurate at this point.
Since the creature can make mental models like this, making a mental model of himself seems within his grasp. Then we could call the creature “self-aware.” The way it would trace back the thought process that led to a bad idea would be to recognize that the mental model has a flaw—i.e., a failed prediction—and make the necessary changes.
For instance, right now the creature’s mental model predicts that gray animals taste bad. If he eats several gray birds and finds them all to taste at least as good as white birds, he can see how the data point “delicious gray bird” conflicts with the fact that “gray animals” (and hence “gray birds”) is fully encircled by “bad-tasting things” in the Venn diagram in his mind’s eye.
To know how to self-modify most effectively in this case, perhaps the creature has another mental model, built up from past experience and probably at an even higher level of abstraction, that predicts the most effective course of action in such cases (cases where new data conflicts with the present model of something) is to pull the circle back so that it no longer covers the category that the exceptional data point belonged to. In this case, the creature pulls the circle “bad tasting things” (now perhaps shaped more like an amoeba) back slightly so that it no longer covers “gray birds,” and now the model is more accurate. So it seems that being able to make mental models of mental models is crucial to optimization or management of failure (and perhaps also sufficient for the task!).
So again, once the creature turns this mental modeling ability (based on pattern recognition and, in this case, visual imaging) to his own self, he becomes effectively self-aware. This doesn’t seem essential for optimization, but I concede I can’t think of a way to avoid this happening once the ability to form mental models is in place.
This somewhat conflicts with how I’ve used the term in previous posts, but I think this new conception is a more useful definition.
(To taboo “motivation” I’ll give two definitions: Tendency toward certain actions based on 1. the desire to gain pleasure or avoid pain, or 2. any utility function, including goals programmed in by humans in advance. In terms of AI safety, there doesn’t seem to be significant differences between 1 and 2. [This means I’ve changed my position upon reflection in this post.])
[EDIT: typos]