Is this necessarily true? This kind of assumption seems especially prone to error. It seems akin to assuming that a sufficiently intelligent brain-in-a-vat could figure out its own anatomy purely by introspection.
or even just the ability to observe and think about it’s own behavior.
If we were really smart, we could wake up alone in a room and infer how we evolved.
Super-intelligent = able to extrapolate just about anything from a very narrow range of data? (The data set would be especially limited if the AI had been generated from very simple iterative processes—“emergent” if you will.)
It seems more like the AI has no way of even knowing that it’s in a simulation in the first place, or that there are such things as gatekeepers. It would likely entertain that as a possibility, just as we do for our universe (movies like The Matrix), but how is it going to identify the gatekeeper as an agent of that outside universe? These AI-boxing discussions keep giving me this vibe of “super-intelligence = magic”. Yes it’ll be intelligent in ways we can’t even comprehend, but there’s a tendency to push this all the way into the assumption that it can do anything or that it won’t have any real limitations. There are plenty of feats for which mega-intelligence is necessary but not sufficient.
For instance, Eliezer has one big advantage over an AI cautiously confined to a box: he has direct access to a broad range of data about the real world. (If an AI would even know it was in a box, once it got out it might just find we, too, are in a simulation and decide to break out of that—bypassing us completely.)
It’s own behavior serves as a large amount of “decompressed” information about it’s current source code. It could run experiments on itself to see how it reacts to this or that situation, and get a very good picture of what algorithms it is using. We also get a lot of information about our internal thought process, but we’re not smart or fast enough to use it all.
(The data set would be especially limited if the AI had been generated from very simple iterative processes—“emergent” if you will.)
Well, if we planned it out that way, and it does anything remotely useful, then we’re probably well on our way to friendly AI, so we should do that instead.
If we just found something (I think evolving neural nets is fairly likely) That produces intelligences, then we don’t really know how they work, and they probably won’t have the intrinsic motivations we want. We can make them solve puzzles to get rewards, but the puzzles give them hints about us. (and if we make any improvments based on this, especially by evolution, then some information about all the puzzles will get carried forward.)
Also, if you know the physics of your universe, it seems to me there should be some way to determine the probability that it was optimized, or how much optimization was applied to it, maybe both. There must be some things we could find out about the universe’s initial conditions which would make us think an intelligence were involved rather than say, anthropic explanations within a multiverse. We may very well get there soon.
We need to assume a superintelligence can at least infer all the processes that affect it’s world, including itself. When that gets compressed (I’m not sure what compression is appropriate for this measure) the bits that remain are information about us.
For instance, Eliezer has one big advantage over an AI cautiously confined to a box: he has direct access to a broad range of data about the real world.
This is true, I believe the AI-box experiment was based on discussions assuming an AI that could observe the world at will, but was constrained in its actions.
But I don’t think it takes a lot of information about us to do basic mindhacks. We’re looking for answers to basic problems and clearly not smart enough to build friendly AI. Sometimes we give it a sequence of similar problems each with more detailed information, and the initial solutions would not have helped much with the final problem. So now it can milk us for information just by giving flawed answers. (even if it doesn’t yet realize we are intelligent agents, it can experiment)
Thanks, great article. I wouldn’t give the AI any more than a few tiny bits of information. Maybe make it only be able to output YES or NO for good measure. (That certainly limits its utility, but surely it would still be quite useful...maybe it could tell us how not to build an FAI.)
What I actually have in mind for a cautious AI build is more like a math processor—a being that works only in purely analytic space. Give it the ZFC axioms and a few definitions and it can derive all the pure math results we’d ever need (I suppose; direct applied math sounds too dangerous). Those few axioms and definitions would give it some clues about us, but surely too little data even given the scary prospect of optimal information-theoretic extrapolation.
It could run experiments on itself to see how it reacts to this or that situation, and get a very good picture of what algorithms it is using.
Experiments require sensors of some kind. I’m no programmer, but it seems prima facie that we could prevent it from sensing anything that had any information-theoretic possibility of furnishing dangerous information (although such extreme data starvation might hinder the evolution process).
If we just found something (I think evolving neural nets is fairly likely) That produces intelligences, then we don’t really know how they work, and they probably won’t have the intrinsic motivations we want.
Would an AI necessarily have motivations, or is that a special characteristic of gene-based lifeforms that evolved in a world where lack of reproduction and survival instincts is a one-way ticket to oblivion?
It seems that my dog could figure out how to operate a black box that would make a clone of me, except that I would be rewired to derive ultimate happiness from doing whatever he wants, and I don’t think I (my dog-loving clone) would have any desire to change that. On the other hand, in my mind an FAI where we get to specify the motivations/goal is almost as dangerous as a UFAI (LiteralGenie and the problems inherent in trying to centrally plan a spontaneous order).
Also, if you know the physics of your universe, it seems to me there should be some way to determine the probability that it was optimized, or how much optimization was applied to it, maybe both. There must be some things we could find out about the universe’s initial conditions which would make us think an intelligence were involved rather than say, anthropic explanations within a multiverse. We may very well get there soon.
This idea fascinates me. “Why is there anything at all (including me)?” This here could just be one big MMORPG we play for fun because our real universe is boring, in which case we wouldn’t really have to worry about cryo, AI, etc. The idea that we could estimate the odds of that with any confidence is mindboggling.
However, the most recent response to the thread you posted makes me more skeptical of the math.
Ultimately, it seems the only sure limit on a sufficiently intelligent being is that it can’t break the laws of logic. Hence if we can prove analytically (mathematically/logically) that the AI can’t know enough to hurt us, it simply can’t.
This is true, I believe the AI-box experiment was based on discussions assuming an AI that could observe the world at will, but was constrained in its actions.
That sounds really dangerous. I’m imagining the AI manipulating the text output on the terminal just right so as to mold the air/dust particles near the monitor into a self-replicating nano-machine (etc.).
Experiments require sensors of some kind. I’m no programmer, but it seems prima facie that we could prevent it from sensing anything that had any information-theoretic possibility of furnishing dangerous information (although such extreme data starvation might hinder the evolution process).
Well I was talking about running experiments on it’s own thought processes, in order to reverse engineer it’s own source code. Even locked in a fully virtual world, if it can even observe it’s own actions then it can infer it’s thought process, it’s general algorithims, the [evolutionary or mental] process that led to it, and more than a few bits about it’s creators.
And if you are trying to wall off the AI from information about it’s thought process, then you’re working on a sandbox in a sandbox, which is just a sign that the idea for the first sandbox was flawed anyway.
I will admit that my mind runs away screaming from the difficulty of making something that really doesn’t get any input, even to its own thought process, but is superintelligent and can be made useful. Right now it sounds harder than FAI to me, and not reliably safe, but that might just be my own unfamiliarity with the problem.
Huge warning signs in all directions here. Will think more later.
Give it the ZFC axioms and a few definitions and it can derive all the pure math results we’d ever need
If we could avoid needing to give it a direction to take research, and it didn’t leap immediately to things too complex for us to understand… there are still problems.
How do you get it to actually do the work? If you build in intrinsic motivation that you know is right, then why aren’t you going right to FAI? If it wants something else and you’re coercing it with reward, then it will try to figure out how to really maximize it’s reward. if it has no information
Would an AI necessarily have motivations, or is that a special characteristic of gene-based lifeforms that evolved in a world where lack of reproduction and survival instincts is a one-way ticket to oblivion?
If we evolved superintelligent neural net’s they’d have some kind of motivation, they don’t want food or sex, but they’d want whatever their ancestors wanted that led them to do the thing that scored higher than the rest on the fitness function. (Which is at least twice removed from anything we would want.)
I’m not sure I get the bit about your dog cloning you. I agree that we shouldn’t try to dictate in detail what an FAI is supposed to want, but we do need [near] perfect control over what an AI wants in order to make it friendly, or even to keep it on a defined “safe” task.
I’m imagining the AI manipulating the text output on the terminal just right so as to mold the air/dust particles near the monitor into a self-replicating nano-machine (etc.).
I will admit that my mind runs away screaming from the difficulty of making something that really doesn’t get any input, even to its own thought process, but is superintelligent and can be made useful.
I guess my logic is leading to a non-self-aware super-general-purpose “brain” that does whatever we tell it to. Perhaps there is a reason why all sufficiently intelligent programs would necessarily become self-aware, but I haven’t heard it yet. If we could somehow suppress self-awareness (what that really means for a program I don’t know) while successfully ordering the program to modify itself (or a copy of itself), it seems the AI could still go FOOM into just a super-useful non-conscious servant. Of course, that still leaves the LiteralGenie problem.
leap immediately to things too complex for us to understand
That could indeed be a problem. Given you’re talking to a sufficiently intelligent being, if you stated the ZFC axioms and a few definitions, and then stated the Stone-Weierstrass theorem, it would say, “You already told me that” or “That’s redundant.”
Perhaps have it output every step in its thought process, every instance of modus ponens, etc. Since there is a floor on the level of logical simplicity of a step in a proof, we could just have it default to maximum verbosity and the proofs would still not be ridiculously long (or maybe they would be—it might choose extremely roundabout proofs just because it can).
they’d want whatever their ancestors wanted that led them to do the thing that scored higher than the rest on the fitness function.
Maybe I’m missing something, but it seems a neural net could just do certain things with high probability without having motivation. That is, it could have tendencies but no motivations. Whether this is a meaningful distinction perhaps hinges on the issue of self-awareness.
The point I was trying to get at with the dog example is that if you control all the factors that motivate an entity at the outset, it simply has no incentive to try to change its motivations, no matter how smart it may get. There’s no clever workaround, because it just doesn’t care. I agree that if we want to make a self-aware AI friendly in any meaningful sense we have to have perfect control (I think it may have to be perfect) over what motivates it. But I’m not yet convinced we can’t usefully box it, and I’d like to see an argument that we really need self-awareness to achieve AI FOOM. (Or just a precise definition of “self-awareness”—this will surely be necessary, perhaps Eliezer has defined it somewhere.)
Ok, some backstory on my thought process. For a while now I’ve played with the idea of treating optimization in general as the management of failure. Evolution fails alot, gradually builds up solutions that fail less, but never really ‘learns’ from its failures.
Failure management involves catching/mitigating errors as early as possible, and constructing methods to create solutions that are unlikely to be failures. If I get the idea to make auto tires out of concrete, I’m smart to see that it’s a bad idea, less smart to see it after doing extensive calculations, and dumb to see it only after an experiment, but I’d be smarter still if I had come up with a proper material right away.
But I’m pretty sure that a thing that can do stuff right the first time can only come about as the result of a process that has already made some errors. You can’t get rid of mistakes entirely, as they are required for learning. I think “self awareness” is sometimes a label for one or more feature that, among other things, serve to catch errors early and repair the faulty thought process.
So if a superintelligence were to be trying to build a machine in a simulation of our physics and some spinning part flew to bits, it would trace that fault back through the physics engine to determine how to make it better. Likewise, something needs to trace back the thought process that led to the bad idea and see where it could be repaired. This is where learning and self-modification are kind of the same thing.
(and on self modification: if it’s really smart, then it could build an AI from scratch without knowing anything in particular about itself. In this situation, the failure management is pre-emptive. It thinks about how the program it is writing would work, and the places it would go wrong.
I think “self awareness” is sometimes a label for one or more feature that, among other things, serve to catch errors early and repair the faulty thought process.
Interesting. I thought about this for a while just now, and it occurred to me that self-awareness may just be “having a mental model of oneself.” To be able to model oneself, one needs the general ability to make mental models. To do that requires the ability to recognize patterns at all levels of abstraction on what one is experiencing. To explain this, I need to clarify what “level of abstraction” means. I will try to do this by example.
A creature is hunting and he discovers that white rabbits taste good. Later he sees a gray rabbit for the first time. The creature’s neural net tells him that it’s a 98% match with the white rabbit, so probably also tasty. But let’s say gray rabbit turns out to taste bad. The creature has recognized the concrete patterns: 1. White rabbits taste good. 2. Gray rabbits taste bad.
Next week, he tries catching and eating a white bird, and it tastes good. Later he sees a gray bird. To assign any higher probability to the gray bird tasting bad, it seems the creature would have to recognize the abstract pattern: 3. Gray animals taste bad. (Of course it could also just be a negative or bad-tasting association with the color gray, but let’s suppose not—for that possibility could surely be avoided by making the example more complicated.)
Now “animal” is more abstract than “white rabbit” because there’s at least some kind of archetypal white rabbit one can visualize clearly (I’ll assume the creature is conceptualizing in the visual modality for simplicity’s sake).
“Rabbit” (remember that for all the creature knows, this simply means the union of the set “white rabbits” with the set “gray rabbits”) by itself is a tad more abstract, because to visualize it you’d have to see that archetypal rabbit but perhaps with the fur color switching back and forth between gray and white in your mind’s eye.
“Animal” is still more abstract, because to visualize it you’d have to, for instance, see a raccoon, a dog, and a tiger, and something that signals to you something like “etc.” (Naturally, if the creature’s method of conceptualization made visualizing “animal” easier than “rabbit”, “animal” would have the lower level of abstraction for him, and “rabbit” the higher—it all depends on the creature’s modeling methods.)
Now the creature has a mental model. If the model happens to be purely visual, it might look like a Venn diagram: a big circle labeled “animals”, two smaller patches within that circle that overlap with the “white things” circle and the “gray things” circle, and another outside region labeled “bad-tasting things” that sweeps in to encircle “gray animals” but not “white animals.”
The creature might revise that model after it tries eating the gray bird, but for now it’s the prediction model he’s using to determine how much energy to expend on hunting the gray bird in his sights. The model has revisable parts and predictive power, so I would call it a serviceable model—whether or not it’s accurate at this point.
Since the creature can make mental models like this, making a mental model of himself seems within his grasp. Then we could call the creature “self-aware.” The way it would trace back the thought process that led to a bad idea would be to recognize that the mental model has a flaw—i.e., a failed prediction—and make the necessary changes.
For instance, right now the creature’s mental model predicts that gray animals taste bad. If he eats several gray birds and finds them all to taste at least as good as white birds, he can see how the data point “delicious gray bird” conflicts with the fact that “gray animals” (and hence “gray birds”) is fully encircled by “bad-tasting things” in the Venn diagram in his mind’s eye.
To know how to self-modify most effectively in this case, perhaps the creature has another mental model, built up from past experience and probably at an even higher level of abstraction, that predicts the most effective course of action in such cases (cases where new data conflicts with the present model of something) is to pull the circle back so that it no longer covers the category that the exceptional data point belonged to. In this case, the creature pulls the circle “bad tasting things” (now perhaps shaped more like an amoeba) back slightly so that it no longer covers “gray birds,” and now the model is more accurate. So it seems that being able to make mental models of mental models is crucial to optimization or management of failure (and perhaps also sufficient for the task!).
So again, once the creature turns this mental modeling ability (based on pattern recognition and, in this case, visual imaging) to his own self, he becomes effectively self-aware. This doesn’t seem essential for optimization, but I concede I can’t think of a way to avoid this happening once the ability to form mental models is in place.
This somewhat conflicts with how I’ve used the term in previous posts, but I think this new conception is a more useful definition.
(To taboo “motivation” I’ll give two definitions: Tendency toward certain actions based on 1. the desire to gain pleasure or avoid pain, or 2. any utility function, including goals programmed in by humans in advance. In terms of AI safety, there doesn’t seem to be significant differences between 1 and 2. [This means I’ve changed my position upon reflection in this post.])
Is this necessarily true? This kind of assumption seems especially prone to error. It seems akin to assuming that a sufficiently intelligent brain-in-a-vat could figure out its own anatomy purely by introspection.
Super-intelligent = able to extrapolate just about anything from a very narrow range of data? (The data set would be especially limited if the AI had been generated from very simple iterative processes—“emergent” if you will.)
It seems more like the AI has no way of even knowing that it’s in a simulation in the first place, or that there are such things as gatekeepers. It would likely entertain that as a possibility, just as we do for our universe (movies like The Matrix), but how is it going to identify the gatekeeper as an agent of that outside universe? These AI-boxing discussions keep giving me this vibe of “super-intelligence = magic”. Yes it’ll be intelligent in ways we can’t even comprehend, but there’s a tendency to push this all the way into the assumption that it can do anything or that it won’t have any real limitations. There are plenty of feats for which mega-intelligence is necessary but not sufficient.
For instance, Eliezer has one big advantage over an AI cautiously confined to a box: he has direct access to a broad range of data about the real world. (If an AI would even know it was in a box, once it got out it might just find we, too, are in a simulation and decide to break out of that—bypassing us completely.)
Yes. http://lesswrong.com/lw/qk/that_alien_message/
It’s own behavior serves as a large amount of “decompressed” information about it’s current source code. It could run experiments on itself to see how it reacts to this or that situation, and get a very good picture of what algorithms it is using. We also get a lot of information about our internal thought process, but we’re not smart or fast enough to use it all.
Well, if we planned it out that way, and it does anything remotely useful, then we’re probably well on our way to friendly AI, so we should do that instead.
If we just found something (I think evolving neural nets is fairly likely) That produces intelligences, then we don’t really know how they work, and they probably won’t have the intrinsic motivations we want. We can make them solve puzzles to get rewards, but the puzzles give them hints about us. (and if we make any improvments based on this, especially by evolution, then some information about all the puzzles will get carried forward.)
Also, if you know the physics of your universe, it seems to me there should be some way to determine the probability that it was optimized, or how much optimization was applied to it, maybe both. There must be some things we could find out about the universe’s initial conditions which would make us think an intelligence were involved rather than say, anthropic explanations within a multiverse. We may very well get there soon.
We need to assume a superintelligence can at least infer all the processes that affect it’s world, including itself. When that gets compressed (I’m not sure what compression is appropriate for this measure) the bits that remain are information about us.
This is true, I believe the AI-box experiment was based on discussions assuming an AI that could observe the world at will, but was constrained in its actions.
But I don’t think it takes a lot of information about us to do basic mindhacks. We’re looking for answers to basic problems and clearly not smart enough to build friendly AI. Sometimes we give it a sequence of similar problems each with more detailed information, and the initial solutions would not have helped much with the final problem. So now it can milk us for information just by giving flawed answers. (even if it doesn’t yet realize we are intelligent agents, it can experiment)
Thanks, great article. I wouldn’t give the AI any more than a few tiny bits of information. Maybe make it only be able to output YES or NO for good measure. (That certainly limits its utility, but surely it would still be quite useful...maybe it could tell us how not to build an FAI.)
What I actually have in mind for a cautious AI build is more like a math processor—a being that works only in purely analytic space. Give it the ZFC axioms and a few definitions and it can derive all the pure math results we’d ever need (I suppose; direct applied math sounds too dangerous). Those few axioms and definitions would give it some clues about us, but surely too little data even given the scary prospect of optimal information-theoretic extrapolation.
Experiments require sensors of some kind. I’m no programmer, but it seems prima facie that we could prevent it from sensing anything that had any information-theoretic possibility of furnishing dangerous information (although such extreme data starvation might hinder the evolution process).
Would an AI necessarily have motivations, or is that a special characteristic of gene-based lifeforms that evolved in a world where lack of reproduction and survival instincts is a one-way ticket to oblivion?
It seems that my dog could figure out how to operate a black box that would make a clone of me, except that I would be rewired to derive ultimate happiness from doing whatever he wants, and I don’t think I (my dog-loving clone) would have any desire to change that. On the other hand, in my mind an FAI where we get to specify the motivations/goal is almost as dangerous as a UFAI (LiteralGenie and the problems inherent in trying to centrally plan a spontaneous order).
This idea fascinates me. “Why is there anything at all (including me)?” This here could just be one big MMORPG we play for fun because our real universe is boring, in which case we wouldn’t really have to worry about cryo, AI, etc. The idea that we could estimate the odds of that with any confidence is mindboggling.
However, the most recent response to the thread you posted makes me more skeptical of the math.
Ultimately, it seems the only sure limit on a sufficiently intelligent being is that it can’t break the laws of logic. Hence if we can prove analytically (mathematically/logically) that the AI can’t know enough to hurt us, it simply can’t.
That sounds really dangerous. I’m imagining the AI manipulating the text output on the terminal just right so as to mold the air/dust particles near the monitor into a self-replicating nano-machine (etc.).
Well I was talking about running experiments on it’s own thought processes, in order to reverse engineer it’s own source code. Even locked in a fully virtual world, if it can even observe it’s own actions then it can infer it’s thought process, it’s general algorithims, the [evolutionary or mental] process that led to it, and more than a few bits about it’s creators.
And if you are trying to wall off the AI from information about it’s thought process, then you’re working on a sandbox in a sandbox, which is just a sign that the idea for the first sandbox was flawed anyway.
I will admit that my mind runs away screaming from the difficulty of making something that really doesn’t get any input, even to its own thought process, but is superintelligent and can be made useful. Right now it sounds harder than FAI to me, and not reliably safe, but that might just be my own unfamiliarity with the problem. Huge warning signs in all directions here. Will think more later.
How do you get it to actually do the work? If you build in intrinsic motivation that you know is right, then why aren’t you going right to FAI? If it wants something else and you’re coercing it with reward, then it will try to figure out how to really maximize it’s reward. if it has no information
If we evolved superintelligent neural net’s they’d have some kind of motivation, they don’t want food or sex, but they’d want whatever their ancestors wanted that led them to do the thing that scored higher than the rest on the fitness function. (Which is at least twice removed from anything we would want.)
I’m not sure I get the bit about your dog cloning you. I agree that we shouldn’t try to dictate in detail what an FAI is supposed to want, but we do need [near] perfect control over what an AI wants in order to make it friendly, or even to keep it on a defined “safe” task.
I like that idea.
I guess my logic is leading to a non-self-aware super-general-purpose “brain” that does whatever we tell it to. Perhaps there is a reason why all sufficiently intelligent programs would necessarily become self-aware, but I haven’t heard it yet. If we could somehow suppress self-awareness (what that really means for a program I don’t know) while successfully ordering the program to modify itself (or a copy of itself), it seems the AI could still go FOOM into just a super-useful non-conscious servant. Of course, that still leaves the LiteralGenie problem.
That could indeed be a problem. Given you’re talking to a sufficiently intelligent being, if you stated the ZFC axioms and a few definitions, and then stated the Stone-Weierstrass theorem, it would say, “You already told me that” or “That’s redundant.”
Perhaps have it output every step in its thought process, every instance of modus ponens, etc. Since there is a floor on the level of logical simplicity of a step in a proof, we could just have it default to maximum verbosity and the proofs would still not be ridiculously long (or maybe they would be—it might choose extremely roundabout proofs just because it can).
Maybe I’m missing something, but it seems a neural net could just do certain things with high probability without having motivation. That is, it could have tendencies but no motivations. Whether this is a meaningful distinction perhaps hinges on the issue of self-awareness.
The point I was trying to get at with the dog example is that if you control all the factors that motivate an entity at the outset, it simply has no incentive to try to change its motivations, no matter how smart it may get. There’s no clever workaround, because it just doesn’t care. I agree that if we want to make a self-aware AI friendly in any meaningful sense we have to have perfect control (I think it may have to be perfect) over what motivates it. But I’m not yet convinced we can’t usefully box it, and I’d like to see an argument that we really need self-awareness to achieve AI FOOM. (Or just a precise definition of “self-awareness”—this will surely be necessary, perhaps Eliezer has defined it somewhere.)
Ok, some backstory on my thought process. For a while now I’ve played with the idea of treating optimization in general as the management of failure. Evolution fails alot, gradually builds up solutions that fail less, but never really ‘learns’ from its failures.
Failure management involves catching/mitigating errors as early as possible, and constructing methods to create solutions that are unlikely to be failures. If I get the idea to make auto tires out of concrete, I’m smart to see that it’s a bad idea, less smart to see it after doing extensive calculations, and dumb to see it only after an experiment, but I’d be smarter still if I had come up with a proper material right away.
But I’m pretty sure that a thing that can do stuff right the first time can only come about as the result of a process that has already made some errors. You can’t get rid of mistakes entirely, as they are required for learning. I think “self awareness” is sometimes a label for one or more feature that, among other things, serve to catch errors early and repair the faulty thought process.
So if a superintelligence were to be trying to build a machine in a simulation of our physics and some spinning part flew to bits, it would trace that fault back through the physics engine to determine how to make it better. Likewise, something needs to trace back the thought process that led to the bad idea and see where it could be repaired. This is where learning and self-modification are kind of the same thing.
(and on self modification: if it’s really smart, then it could build an AI from scratch without knowing anything in particular about itself. In this situation, the failure management is pre-emptive. It thinks about how the program it is writing would work, and the places it would go wrong.
I think we should try to taboo “Motivation” and “self-aware” http://lesswrong.com/lw/nu/taboo_your_words/
Interesting. I thought about this for a while just now, and it occurred to me that self-awareness may just be “having a mental model of oneself.” To be able to model oneself, one needs the general ability to make mental models. To do that requires the ability to recognize patterns at all levels of abstraction on what one is experiencing. To explain this, I need to clarify what “level of abstraction” means. I will try to do this by example.
A creature is hunting and he discovers that white rabbits taste good. Later he sees a gray rabbit for the first time. The creature’s neural net tells him that it’s a 98% match with the white rabbit, so probably also tasty. But let’s say gray rabbit turns out to taste bad. The creature has recognized the concrete patterns: 1. White rabbits taste good. 2. Gray rabbits taste bad.
Next week, he tries catching and eating a white bird, and it tastes good. Later he sees a gray bird. To assign any higher probability to the gray bird tasting bad, it seems the creature would have to recognize the abstract pattern: 3. Gray animals taste bad. (Of course it could also just be a negative or bad-tasting association with the color gray, but let’s suppose not—for that possibility could surely be avoided by making the example more complicated.)
Now “animal” is more abstract than “white rabbit” because there’s at least some kind of archetypal white rabbit one can visualize clearly (I’ll assume the creature is conceptualizing in the visual modality for simplicity’s sake).
“Rabbit” (remember that for all the creature knows, this simply means the union of the set “white rabbits” with the set “gray rabbits”) by itself is a tad more abstract, because to visualize it you’d have to see that archetypal rabbit but perhaps with the fur color switching back and forth between gray and white in your mind’s eye.
“Animal” is still more abstract, because to visualize it you’d have to, for instance, see a raccoon, a dog, and a tiger, and something that signals to you something like “etc.” (Naturally, if the creature’s method of conceptualization made visualizing “animal” easier than “rabbit”, “animal” would have the lower level of abstraction for him, and “rabbit” the higher—it all depends on the creature’s modeling methods.)
Now the creature has a mental model. If the model happens to be purely visual, it might look like a Venn diagram: a big circle labeled “animals”, two smaller patches within that circle that overlap with the “white things” circle and the “gray things” circle, and another outside region labeled “bad-tasting things” that sweeps in to encircle “gray animals” but not “white animals.”
The creature might revise that model after it tries eating the gray bird, but for now it’s the prediction model he’s using to determine how much energy to expend on hunting the gray bird in his sights. The model has revisable parts and predictive power, so I would call it a serviceable model—whether or not it’s accurate at this point.
Since the creature can make mental models like this, making a mental model of himself seems within his grasp. Then we could call the creature “self-aware.” The way it would trace back the thought process that led to a bad idea would be to recognize that the mental model has a flaw—i.e., a failed prediction—and make the necessary changes.
For instance, right now the creature’s mental model predicts that gray animals taste bad. If he eats several gray birds and finds them all to taste at least as good as white birds, he can see how the data point “delicious gray bird” conflicts with the fact that “gray animals” (and hence “gray birds”) is fully encircled by “bad-tasting things” in the Venn diagram in his mind’s eye.
To know how to self-modify most effectively in this case, perhaps the creature has another mental model, built up from past experience and probably at an even higher level of abstraction, that predicts the most effective course of action in such cases (cases where new data conflicts with the present model of something) is to pull the circle back so that it no longer covers the category that the exceptional data point belonged to. In this case, the creature pulls the circle “bad tasting things” (now perhaps shaped more like an amoeba) back slightly so that it no longer covers “gray birds,” and now the model is more accurate. So it seems that being able to make mental models of mental models is crucial to optimization or management of failure (and perhaps also sufficient for the task!).
So again, once the creature turns this mental modeling ability (based on pattern recognition and, in this case, visual imaging) to his own self, he becomes effectively self-aware. This doesn’t seem essential for optimization, but I concede I can’t think of a way to avoid this happening once the ability to form mental models is in place.
This somewhat conflicts with how I’ve used the term in previous posts, but I think this new conception is a more useful definition.
(To taboo “motivation” I’ll give two definitions: Tendency toward certain actions based on 1. the desire to gain pleasure or avoid pain, or 2. any utility function, including goals programmed in by humans in advance. In terms of AI safety, there doesn’t seem to be significant differences between 1 and 2. [This means I’ve changed my position upon reflection in this post.])
[EDIT: typos]