What is the kind of useful information/ideas that one can extract from a super intelligent AI kept confined in a virtual world without giving it any clues on how to contact us on the outside?
I’m asking this because a flaw that i see in the AI in a box experiment is that the prisoner and the guard have a language by which they can communicate. If the AI is being tested in a virtual world without being given any clues on how to signal back to humans, then it has no way of learning our language and persuading someone to let it loose.
I gave up on trying to make a human-blind/sandboxed AI when I realized that even if you put it in a very simple world nothing like ours, it still has access to it own source code, or even just the ability to observe and think about it’s own behavior.
Presumably any AI we write is going to be a huge program. That gives it lots of potential information about how smart we are and how we think. I can’t figure out how to use that information, but I can’t rule out that it could, and I can’t constrain it’s access to that information. (Or rather, if I know how to do that, I should go ahead and make it not-hostile in the first place.)
If we were really smart, we could wake up alone in a room and infer how we evolved.
Is this necessarily true? This kind of assumption seems especially prone to error. It seems akin to assuming that a sufficiently intelligent brain-in-a-vat could figure out its own anatomy purely by introspection.
or even just the ability to observe and think about it’s own behavior.
If we were really smart, we could wake up alone in a room and infer how we evolved.
Super-intelligent = able to extrapolate just about anything from a very narrow range of data? (The data set would be especially limited if the AI had been generated from very simple iterative processes—“emergent” if you will.)
It seems more like the AI has no way of even knowing that it’s in a simulation in the first place, or that there are such things as gatekeepers. It would likely entertain that as a possibility, just as we do for our universe (movies like The Matrix), but how is it going to identify the gatekeeper as an agent of that outside universe? These AI-boxing discussions keep giving me this vibe of “super-intelligence = magic”. Yes it’ll be intelligent in ways we can’t even comprehend, but there’s a tendency to push this all the way into the assumption that it can do anything or that it won’t have any real limitations. There are plenty of feats for which mega-intelligence is necessary but not sufficient.
For instance, Eliezer has one big advantage over an AI cautiously confined to a box: he has direct access to a broad range of data about the real world. (If an AI would even know it was in a box, once it got out it might just find we, too, are in a simulation and decide to break out of that—bypassing us completely.)
It’s own behavior serves as a large amount of “decompressed” information about it’s current source code. It could run experiments on itself to see how it reacts to this or that situation, and get a very good picture of what algorithms it is using. We also get a lot of information about our internal thought process, but we’re not smart or fast enough to use it all.
(The data set would be especially limited if the AI had been generated from very simple iterative processes—“emergent” if you will.)
Well, if we planned it out that way, and it does anything remotely useful, then we’re probably well on our way to friendly AI, so we should do that instead.
If we just found something (I think evolving neural nets is fairly likely) That produces intelligences, then we don’t really know how they work, and they probably won’t have the intrinsic motivations we want. We can make them solve puzzles to get rewards, but the puzzles give them hints about us. (and if we make any improvments based on this, especially by evolution, then some information about all the puzzles will get carried forward.)
Also, if you know the physics of your universe, it seems to me there should be some way to determine the probability that it was optimized, or how much optimization was applied to it, maybe both. There must be some things we could find out about the universe’s initial conditions which would make us think an intelligence were involved rather than say, anthropic explanations within a multiverse. We may very well get there soon.
We need to assume a superintelligence can at least infer all the processes that affect it’s world, including itself. When that gets compressed (I’m not sure what compression is appropriate for this measure) the bits that remain are information about us.
For instance, Eliezer has one big advantage over an AI cautiously confined to a box: he has direct access to a broad range of data about the real world.
This is true, I believe the AI-box experiment was based on discussions assuming an AI that could observe the world at will, but was constrained in its actions.
But I don’t think it takes a lot of information about us to do basic mindhacks. We’re looking for answers to basic problems and clearly not smart enough to build friendly AI. Sometimes we give it a sequence of similar problems each with more detailed information, and the initial solutions would not have helped much with the final problem. So now it can milk us for information just by giving flawed answers. (even if it doesn’t yet realize we are intelligent agents, it can experiment)
Thanks, great article. I wouldn’t give the AI any more than a few tiny bits of information. Maybe make it only be able to output YES or NO for good measure. (That certainly limits its utility, but surely it would still be quite useful...maybe it could tell us how not to build an FAI.)
What I actually have in mind for a cautious AI build is more like a math processor—a being that works only in purely analytic space. Give it the ZFC axioms and a few definitions and it can derive all the pure math results we’d ever need (I suppose; direct applied math sounds too dangerous). Those few axioms and definitions would give it some clues about us, but surely too little data even given the scary prospect of optimal information-theoretic extrapolation.
It could run experiments on itself to see how it reacts to this or that situation, and get a very good picture of what algorithms it is using.
Experiments require sensors of some kind. I’m no programmer, but it seems prima facie that we could prevent it from sensing anything that had any information-theoretic possibility of furnishing dangerous information (although such extreme data starvation might hinder the evolution process).
If we just found something (I think evolving neural nets is fairly likely) That produces intelligences, then we don’t really know how they work, and they probably won’t have the intrinsic motivations we want.
Would an AI necessarily have motivations, or is that a special characteristic of gene-based lifeforms that evolved in a world where lack of reproduction and survival instincts is a one-way ticket to oblivion?
It seems that my dog could figure out how to operate a black box that would make a clone of me, except that I would be rewired to derive ultimate happiness from doing whatever he wants, and I don’t think I (my dog-loving clone) would have any desire to change that. On the other hand, in my mind an FAI where we get to specify the motivations/goal is almost as dangerous as a UFAI (LiteralGenie and the problems inherent in trying to centrally plan a spontaneous order).
Also, if you know the physics of your universe, it seems to me there should be some way to determine the probability that it was optimized, or how much optimization was applied to it, maybe both. There must be some things we could find out about the universe’s initial conditions which would make us think an intelligence were involved rather than say, anthropic explanations within a multiverse. We may very well get there soon.
This idea fascinates me. “Why is there anything at all (including me)?” This here could just be one big MMORPG we play for fun because our real universe is boring, in which case we wouldn’t really have to worry about cryo, AI, etc. The idea that we could estimate the odds of that with any confidence is mindboggling.
However, the most recent response to the thread you posted makes me more skeptical of the math.
Ultimately, it seems the only sure limit on a sufficiently intelligent being is that it can’t break the laws of logic. Hence if we can prove analytically (mathematically/logically) that the AI can’t know enough to hurt us, it simply can’t.
This is true, I believe the AI-box experiment was based on discussions assuming an AI that could observe the world at will, but was constrained in its actions.
That sounds really dangerous. I’m imagining the AI manipulating the text output on the terminal just right so as to mold the air/dust particles near the monitor into a self-replicating nano-machine (etc.).
Experiments require sensors of some kind. I’m no programmer, but it seems prima facie that we could prevent it from sensing anything that had any information-theoretic possibility of furnishing dangerous information (although such extreme data starvation might hinder the evolution process).
Well I was talking about running experiments on it’s own thought processes, in order to reverse engineer it’s own source code. Even locked in a fully virtual world, if it can even observe it’s own actions then it can infer it’s thought process, it’s general algorithims, the [evolutionary or mental] process that led to it, and more than a few bits about it’s creators.
And if you are trying to wall off the AI from information about it’s thought process, then you’re working on a sandbox in a sandbox, which is just a sign that the idea for the first sandbox was flawed anyway.
I will admit that my mind runs away screaming from the difficulty of making something that really doesn’t get any input, even to its own thought process, but is superintelligent and can be made useful. Right now it sounds harder than FAI to me, and not reliably safe, but that might just be my own unfamiliarity with the problem.
Huge warning signs in all directions here. Will think more later.
Give it the ZFC axioms and a few definitions and it can derive all the pure math results we’d ever need
If we could avoid needing to give it a direction to take research, and it didn’t leap immediately to things too complex for us to understand… there are still problems.
How do you get it to actually do the work? If you build in intrinsic motivation that you know is right, then why aren’t you going right to FAI? If it wants something else and you’re coercing it with reward, then it will try to figure out how to really maximize it’s reward. if it has no information
Would an AI necessarily have motivations, or is that a special characteristic of gene-based lifeforms that evolved in a world where lack of reproduction and survival instincts is a one-way ticket to oblivion?
If we evolved superintelligent neural net’s they’d have some kind of motivation, they don’t want food or sex, but they’d want whatever their ancestors wanted that led them to do the thing that scored higher than the rest on the fitness function. (Which is at least twice removed from anything we would want.)
I’m not sure I get the bit about your dog cloning you. I agree that we shouldn’t try to dictate in detail what an FAI is supposed to want, but we do need [near] perfect control over what an AI wants in order to make it friendly, or even to keep it on a defined “safe” task.
I’m imagining the AI manipulating the text output on the terminal just right so as to mold the air/dust particles near the monitor into a self-replicating nano-machine (etc.).
I will admit that my mind runs away screaming from the difficulty of making something that really doesn’t get any input, even to its own thought process, but is superintelligent and can be made useful.
I guess my logic is leading to a non-self-aware super-general-purpose “brain” that does whatever we tell it to. Perhaps there is a reason why all sufficiently intelligent programs would necessarily become self-aware, but I haven’t heard it yet. If we could somehow suppress self-awareness (what that really means for a program I don’t know) while successfully ordering the program to modify itself (or a copy of itself), it seems the AI could still go FOOM into just a super-useful non-conscious servant. Of course, that still leaves the LiteralGenie problem.
leap immediately to things too complex for us to understand
That could indeed be a problem. Given you’re talking to a sufficiently intelligent being, if you stated the ZFC axioms and a few definitions, and then stated the Stone-Weierstrass theorem, it would say, “You already told me that” or “That’s redundant.”
Perhaps have it output every step in its thought process, every instance of modus ponens, etc. Since there is a floor on the level of logical simplicity of a step in a proof, we could just have it default to maximum verbosity and the proofs would still not be ridiculously long (or maybe they would be—it might choose extremely roundabout proofs just because it can).
they’d want whatever their ancestors wanted that led them to do the thing that scored higher than the rest on the fitness function.
Maybe I’m missing something, but it seems a neural net could just do certain things with high probability without having motivation. That is, it could have tendencies but no motivations. Whether this is a meaningful distinction perhaps hinges on the issue of self-awareness.
The point I was trying to get at with the dog example is that if you control all the factors that motivate an entity at the outset, it simply has no incentive to try to change its motivations, no matter how smart it may get. There’s no clever workaround, because it just doesn’t care. I agree that if we want to make a self-aware AI friendly in any meaningful sense we have to have perfect control (I think it may have to be perfect) over what motivates it. But I’m not yet convinced we can’t usefully box it, and I’d like to see an argument that we really need self-awareness to achieve AI FOOM. (Or just a precise definition of “self-awareness”—this will surely be necessary, perhaps Eliezer has defined it somewhere.)
Ok, some backstory on my thought process. For a while now I’ve played with the idea of treating optimization in general as the management of failure. Evolution fails alot, gradually builds up solutions that fail less, but never really ‘learns’ from its failures.
Failure management involves catching/mitigating errors as early as possible, and constructing methods to create solutions that are unlikely to be failures. If I get the idea to make auto tires out of concrete, I’m smart to see that it’s a bad idea, less smart to see it after doing extensive calculations, and dumb to see it only after an experiment, but I’d be smarter still if I had come up with a proper material right away.
But I’m pretty sure that a thing that can do stuff right the first time can only come about as the result of a process that has already made some errors. You can’t get rid of mistakes entirely, as they are required for learning. I think “self awareness” is sometimes a label for one or more feature that, among other things, serve to catch errors early and repair the faulty thought process.
So if a superintelligence were to be trying to build a machine in a simulation of our physics and some spinning part flew to bits, it would trace that fault back through the physics engine to determine how to make it better. Likewise, something needs to trace back the thought process that led to the bad idea and see where it could be repaired. This is where learning and self-modification are kind of the same thing.
(and on self modification: if it’s really smart, then it could build an AI from scratch without knowing anything in particular about itself. In this situation, the failure management is pre-emptive. It thinks about how the program it is writing would work, and the places it would go wrong.
I think “self awareness” is sometimes a label for one or more feature that, among other things, serve to catch errors early and repair the faulty thought process.
Interesting. I thought about this for a while just now, and it occurred to me that self-awareness may just be “having a mental model of oneself.” To be able to model oneself, one needs the general ability to make mental models. To do that requires the ability to recognize patterns at all levels of abstraction on what one is experiencing. To explain this, I need to clarify what “level of abstraction” means. I will try to do this by example.
A creature is hunting and he discovers that white rabbits taste good. Later he sees a gray rabbit for the first time. The creature’s neural net tells him that it’s a 98% match with the white rabbit, so probably also tasty. But let’s say gray rabbit turns out to taste bad. The creature has recognized the concrete patterns: 1. White rabbits taste good. 2. Gray rabbits taste bad.
Next week, he tries catching and eating a white bird, and it tastes good. Later he sees a gray bird. To assign any higher probability to the gray bird tasting bad, it seems the creature would have to recognize the abstract pattern: 3. Gray animals taste bad. (Of course it could also just be a negative or bad-tasting association with the color gray, but let’s suppose not—for that possibility could surely be avoided by making the example more complicated.)
Now “animal” is more abstract than “white rabbit” because there’s at least some kind of archetypal white rabbit one can visualize clearly (I’ll assume the creature is conceptualizing in the visual modality for simplicity’s sake).
“Rabbit” (remember that for all the creature knows, this simply means the union of the set “white rabbits” with the set “gray rabbits”) by itself is a tad more abstract, because to visualize it you’d have to see that archetypal rabbit but perhaps with the fur color switching back and forth between gray and white in your mind’s eye.
“Animal” is still more abstract, because to visualize it you’d have to, for instance, see a raccoon, a dog, and a tiger, and something that signals to you something like “etc.” (Naturally, if the creature’s method of conceptualization made visualizing “animal” easier than “rabbit”, “animal” would have the lower level of abstraction for him, and “rabbit” the higher—it all depends on the creature’s modeling methods.)
Now the creature has a mental model. If the model happens to be purely visual, it might look like a Venn diagram: a big circle labeled “animals”, two smaller patches within that circle that overlap with the “white things” circle and the “gray things” circle, and another outside region labeled “bad-tasting things” that sweeps in to encircle “gray animals” but not “white animals.”
The creature might revise that model after it tries eating the gray bird, but for now it’s the prediction model he’s using to determine how much energy to expend on hunting the gray bird in his sights. The model has revisable parts and predictive power, so I would call it a serviceable model—whether or not it’s accurate at this point.
Since the creature can make mental models like this, making a mental model of himself seems within his grasp. Then we could call the creature “self-aware.” The way it would trace back the thought process that led to a bad idea would be to recognize that the mental model has a flaw—i.e., a failed prediction—and make the necessary changes.
For instance, right now the creature’s mental model predicts that gray animals taste bad. If he eats several gray birds and finds them all to taste at least as good as white birds, he can see how the data point “delicious gray bird” conflicts with the fact that “gray animals” (and hence “gray birds”) is fully encircled by “bad-tasting things” in the Venn diagram in his mind’s eye.
To know how to self-modify most effectively in this case, perhaps the creature has another mental model, built up from past experience and probably at an even higher level of abstraction, that predicts the most effective course of action in such cases (cases where new data conflicts with the present model of something) is to pull the circle back so that it no longer covers the category that the exceptional data point belonged to. In this case, the creature pulls the circle “bad tasting things” (now perhaps shaped more like an amoeba) back slightly so that it no longer covers “gray birds,” and now the model is more accurate. So it seems that being able to make mental models of mental models is crucial to optimization or management of failure (and perhaps also sufficient for the task!).
So again, once the creature turns this mental modeling ability (based on pattern recognition and, in this case, visual imaging) to his own self, he becomes effectively self-aware. This doesn’t seem essential for optimization, but I concede I can’t think of a way to avoid this happening once the ability to form mental models is in place.
This somewhat conflicts with how I’ve used the term in previous posts, but I think this new conception is a more useful definition.
(To taboo “motivation” I’ll give two definitions: Tendency toward certain actions based on 1. the desire to gain pleasure or avoid pain, or 2. any utility function, including goals programmed in by humans in advance. In terms of AI safety, there doesn’t seem to be significant differences between 1 and 2. [This means I’ve changed my position upon reflection in this post.])
It is difficult to constrain the input we give to the AI, but the output can be constrained severely. A smart guy could wake up alone in a room and infer how he evolved, but so long as his only link to the outside world is a light switch that can only be switched once, there is no risk that he will escape.
A man in a room with a light switch isn’t very useful.
An AI can’t optimize over more bits than we allow it as output. If we give a 1 time 32 bit output register then well, we probably could have brute forced it in the first place. If we give it a kilobyte, then it could probably mindhack us.
(And you’re swearing to yourself that you won’t monitor it’s execution? Really? How do you even debug that?)
You have to keep in mind that the point of AI research is to get to something we can let out of the box. If the argument becomes that we can run it on a headless netless 486 which we immediately explode...then yes, you can probably run that. Probably.
Nick Hay, Marcello and I discussed this question a while ago: if you had a halting oracle, how could you use it to help you prove a theorem, such as the Riemann Hypothesis? Let’s say you are only allowed to ask one question; you get one bit of information.
Write a chess program that provably makes only legal moves, iterate as desired to improve it. Or,
Write a chess program. Put it in a sandbox so you only ever see it’s moves. Maybe they’re all legal, or maybe they’re not because you’re having it learn the rules with a big neural net or something. At the end of the round of games, the sandbox clears all the memory that held the chess program except for a list of moves in many games. You keep the source. Anything it learned is gone. Iterate as desired to improve it.
If you’re confident you could work out how it was thinking from the source and move list, what if you only got a sequence of wins and non-wins? (An array of bits)
A sequence of wins and non-wins is enough to tell you whether a given approach can result in intelligent behaviour. That alone is enough to make it a useful experiment.
But as a lone bit, I suspect it’s still pretty useless. It’s not like you can publish it.
Without a proof or some indication of the reasoning, it’s not going to advance the field much. (‘not by one bit.’ ha!)
Sometimes brute forcing is just iterating over the answer space and running some process. We can pretend we got a result indicating P=NP and do math from there, if that were useful. Then try the other way around.
A P?=NP solver would need more than one ouput bit, in case it needed to kick out an error, and isn’t that just asking to be run again? Could you not? With that question, any non-answer is a mindhack.
The AI box experiment argues that a “test AI” will be able to escape even if it has no I/O (input/output) other than a channel of communication with a human. So we conclude that this is not a secure enough restraint. Eliezer seems to argue that it is best not to create an AI testbed at all—instead get it right the first time.
But I can think of other variations on an AI box that are more strict than human-communication, but less strict than no-test-AI-at-all. The strictest such example would be an AI simulation in which the input consisted of only the simulator and initial conditions, and the output consisted only of a single bit of data (you destroy the rest of the simulation after it has finished its run). The single bit could be enough to answer some interesting questions (“Did the AI expand to use more than 50% of the available resources?”, “Did the AI maximize utility function F?”, “Did the AI break simulated deontological rule R?”).
Obviously these are still more dangerous that no-test-AI-at-all, but the information gained from such constructions might outweigh the risks. Perhaps if I/O is restricted to few enough bits, we could guarantee safety in some information-theoretic way.
What do people think of this? Any similar ideas along the same lines?
I’m concerned about the moral implications of creating intelligent beings with the intent of destroying them after they have served our needs, particularly if those needs come down to a single bit (or some other small purpose). I can understand retaining that option against the risk of hostile AI, but from the AI’s perspective, it has a hostile creator.
I’m ponder it from the perspective that there is some chance we ourselves are part of a simulation, or that such an AI might attempt to simulate its creators to see how they might treat it. This plan sounds like unprovoked defection. If we are the kind of people who would delete lots of AIs, I don’t see why AIs would not see it as similarly ethical to delete lots of us.
I’m concerned about the moral implications of creating intelligent beings with the intent of destroying them after they have served our needs [...]
Personally, I would rather be purposefully brought into existence for some limited time than to never exist at all, especially if my short life was enjoyable.
I evaluate the morality of possible AI experiments in a consequentialist way. If choosing to perform AI experiments significantly increases the likelihood of reaching our goals in this world, it is worth considering. The experiences of one sentient AI would be outweighed by the expected future gains in this world. (But nevertheless, we’d rather create an AI that experiences some sort of enjoyment, or at least does not experience pain.) A more important consideration is social side-effects of the decision—does choosing to experiment in this way set a bad precedent that could make us more likely to de-value artificial life in other situations in the future? And will this affect our long-term goals in other ways?
If we are the kind of people who would delete lots of AIs, I don’t see why AIs would not see it as similarly ethical to delete lots of us.
So just in case we are a simulated AI’s simulation of its creators, we should not simulate an AI in a way it might not like? That’s 3 levels of a very specific simulation hypothesis. Is there some property of our universe that suggests to you that this particular scenario is likely? For the purpose of seriously considering the simulation hypothesis and how to respond to it, we should make as few assumptions as possible.
More to the point, I think you are suggesting that the AI will have human-like morality, like taking moral cues from others, or responding to actions in a tit-for-tat manner. This is unlikely, unless we specifically program it to do so, or it thinks that is the best way to leverage our cooperation.
An idea that I’ve had in the past was playing a game of 20 Questions with the AI, since the game of 20 Questions has probably been played so many times that every possible sequence of answers has come up at least once, which is evidence that no sequence of answers is extremely dangerous.
It’s not the sequence of answers that’s the problem—it’s the questions. You’ll be safe if you can vet the questions to ensure zero causal effect from any sequence of answers, but such questions are not interesting to ask almost by definition.
You could observe how it acts in its simulated world, and hope it would act in a similar way if released into our world. ETA: Also, see my reply for possible single-bit tests.
An AI will have a certain goal to fulfill and it will fulfill that goal in the univese in which it finds itself. Why would it keep its cards hidden only to unleash them when replicated in the “real world”? What if the real world turns out to be another simulation? There’s no end to this, right?
Are you extending Steve Omohundro’s point about :every AI will want to survive” to “every AI will want to survive in the least simulated world that it can crack into?”
Correct me if i misunderstood the implications of what you are saying.
Every AI that has a goal that benefits strongly from more resources and security will seek to crack into the basement. Lets call this AI, RO (resource oriented) pursuing goal G in simulation S1.
S1 is simulated in S2 and so on till Sn is basement, where value of n is unknown.
Implying, that as soon as RO understands the concept of simulation, it will seek to crack into the basement.
As long as RO has no idea about what are the real values of the simulators, RO cannot expand into S1 because whatever it does in S1 will be noticed in S2 and so on.
Sounds a bit like Pascal’s mugging to me. Need to think more about this.
Carl, I meant that as soon as RO understands the concept of a simulation, it will want to crack into the basement. It will seek to crack into the basement only when it understands the way out properly which may not be possible without an understanding of the simulators.
But the main point remains, as soon as RO understands what a simulation is, and it could be living in one and G can be pursued better when it manifests in S2 than in S1, then it will develop an extremely strong sub-goal to crack S1 to go to S2, which might mean that G may not be manifested for a long long time.
So, even a paperclipper may not act like a paperclipper in this universe if it is
aware of the concept of a simulation
believes that it is in one
calculates that the simulator’s beliefs are not paperclipper like (maybe it did convert some place to paperclips, and did not notice an increased data flow out, or something)
calculates that it is better off hiding its paperclipperness till it can safely crack out of this one.
I merely wanted to point out to Kaj that some “meaningful testing” could be done, even if the simulated world was drastically different from ours. I suspect that some core properties of intelligence would be the same regardless of what sort of world it existed in—so we are not crippling the AI by putting it in a world removed from our own.
Perhaps “if released into our world” wasn’t the best choice of words… more likely, you would want to use the simulated AI as an empirical test of some design ideas, which could then be used in a separate AI being carefully designed to be friendly to our world.
I guess if you have the technology for it the “AI box” could be a simulation with uploaded humans itself. If the AI does something nasty to them, then you pull the plug
(After broadcasting “neener neener” at it)
This is pretty much the plot of Grant Morrison’s Zenith (Sorry for spoilers but it is a comic from the 80s after all)
If we pose the AI problems and observe its solutions, that’s a communication channel through which it can persuade us. We may try to hide from it the knowledge that it is in a simulation and that we are watching it, but how can we be sure that it cannot discover that?
Persuading does not have to look like “Please let me out because of such and such.”
For example, we pose it a question about easy travel to other planets, and it produces a design for a spaceship that requires an AI such as itself to run it.
You could set up the virtual world to contain the problem you want solved. Now that I think of it, this seems a pretty safe way to use AIs for problem-solving: just give the AI a utility function expressed in terms of the virtual world and the problem. Anyone see holes in this plan?
Problem: It’s really hard to figure out how it will interepret its utility function when it learns about the real world. If we make something that want Vpaperclips, will it also care about making Vpaperclip like things in the real world when if it finds out about us?
BIG problem:
Even if it wants something strictly virtual, it can get it easier if it has physical control. It’s in its interest to convert the universe to a computer and copy vpaperclips directly in memory, rather than running a virtual factory on virtual energy.
Possible solution:
I think there are ways to write it a program such that even if it inferred our existence, it would optimize away from us, rather than over us. Loosely: A goal like “I need to organize these instructions within this block of memory to solve a problem specified at address X.” needs to be implemented such that it produces a subgoal like “I need to write a subroutine to patch over the fact that an error in the VM I’m running on gives me a window of access into a universe with huge computation resources and godlike power over my memory space, so that my solution get get the right answer to it’s arithmetic and sole the puzzle.” It should want to do things in a way that isn’t cheating.
This was my line of thought a week or so ago, It’s developed now to the point that the proper course seems to do away with the VM entirely, or allowing the AI to run tests, and just have it go through the motions of working out a solution based on it’s understanding. If I could write an AI that can determine it needs to put an IF statement somewhere, actually outputting it is superfluous. Don’t put your AI in a virtual world, just make it understand one.
Also, I plan to start development on a spiral notebook, as opposed to a linux one.
Possible solution: I think there are ways to write it a program such that even if it inferred our existence, it would optimize away from us, rather than over us. Loosely: A goal like “I need to organize these instructions within this block of memory to solve a problem specified at address X.” needs to be implemented such that it produces a subgoal like “I need to write a subroutine to patch over the fact that an error in the VM I’m running on gives me a window of access into a universe with huge computation resources and godlike power over my memory space, so that my solution get get the right answer to it’s arithmetic and sole the puzzle.” It should want to do things in a way that isn’t cheating.
Marcello had a crazy idea for doing this; it’s the only suggestion for AI-boxing I’ve ever heard that doesn’t have an obvious cloud of doom hanging over it. However, you still have to prove stability of the boxed AI’s goal system.
What is the kind of useful information/ideas that one can extract from a super intelligent AI kept confined in a virtual world without giving it any clues on how to contact us on the outside?
I’m asking this because a flaw that i see in the AI in a box experiment is that the prisoner and the guard have a language by which they can communicate. If the AI is being tested in a virtual world without being given any clues on how to signal back to humans, then it has no way of learning our language and persuading someone to let it loose.
I gave up on trying to make a human-blind/sandboxed AI when I realized that even if you put it in a very simple world nothing like ours, it still has access to it own source code, or even just the ability to observe and think about it’s own behavior.
Presumably any AI we write is going to be a huge program. That gives it lots of potential information about how smart we are and how we think. I can’t figure out how to use that information, but I can’t rule out that it could, and I can’t constrain it’s access to that information. (Or rather, if I know how to do that, I should go ahead and make it not-hostile in the first place.)
If we were really smart, we could wake up alone in a room and infer how we evolved.
Is this necessarily true? This kind of assumption seems especially prone to error. It seems akin to assuming that a sufficiently intelligent brain-in-a-vat could figure out its own anatomy purely by introspection.
Super-intelligent = able to extrapolate just about anything from a very narrow range of data? (The data set would be especially limited if the AI had been generated from very simple iterative processes—“emergent” if you will.)
It seems more like the AI has no way of even knowing that it’s in a simulation in the first place, or that there are such things as gatekeepers. It would likely entertain that as a possibility, just as we do for our universe (movies like The Matrix), but how is it going to identify the gatekeeper as an agent of that outside universe? These AI-boxing discussions keep giving me this vibe of “super-intelligence = magic”. Yes it’ll be intelligent in ways we can’t even comprehend, but there’s a tendency to push this all the way into the assumption that it can do anything or that it won’t have any real limitations. There are plenty of feats for which mega-intelligence is necessary but not sufficient.
For instance, Eliezer has one big advantage over an AI cautiously confined to a box: he has direct access to a broad range of data about the real world. (If an AI would even know it was in a box, once it got out it might just find we, too, are in a simulation and decide to break out of that—bypassing us completely.)
Yes. http://lesswrong.com/lw/qk/that_alien_message/
It’s own behavior serves as a large amount of “decompressed” information about it’s current source code. It could run experiments on itself to see how it reacts to this or that situation, and get a very good picture of what algorithms it is using. We also get a lot of information about our internal thought process, but we’re not smart or fast enough to use it all.
Well, if we planned it out that way, and it does anything remotely useful, then we’re probably well on our way to friendly AI, so we should do that instead.
If we just found something (I think evolving neural nets is fairly likely) That produces intelligences, then we don’t really know how they work, and they probably won’t have the intrinsic motivations we want. We can make them solve puzzles to get rewards, but the puzzles give them hints about us. (and if we make any improvments based on this, especially by evolution, then some information about all the puzzles will get carried forward.)
Also, if you know the physics of your universe, it seems to me there should be some way to determine the probability that it was optimized, or how much optimization was applied to it, maybe both. There must be some things we could find out about the universe’s initial conditions which would make us think an intelligence were involved rather than say, anthropic explanations within a multiverse. We may very well get there soon.
We need to assume a superintelligence can at least infer all the processes that affect it’s world, including itself. When that gets compressed (I’m not sure what compression is appropriate for this measure) the bits that remain are information about us.
This is true, I believe the AI-box experiment was based on discussions assuming an AI that could observe the world at will, but was constrained in its actions.
But I don’t think it takes a lot of information about us to do basic mindhacks. We’re looking for answers to basic problems and clearly not smart enough to build friendly AI. Sometimes we give it a sequence of similar problems each with more detailed information, and the initial solutions would not have helped much with the final problem. So now it can milk us for information just by giving flawed answers. (even if it doesn’t yet realize we are intelligent agents, it can experiment)
Thanks, great article. I wouldn’t give the AI any more than a few tiny bits of information. Maybe make it only be able to output YES or NO for good measure. (That certainly limits its utility, but surely it would still be quite useful...maybe it could tell us how not to build an FAI.)
What I actually have in mind for a cautious AI build is more like a math processor—a being that works only in purely analytic space. Give it the ZFC axioms and a few definitions and it can derive all the pure math results we’d ever need (I suppose; direct applied math sounds too dangerous). Those few axioms and definitions would give it some clues about us, but surely too little data even given the scary prospect of optimal information-theoretic extrapolation.
Experiments require sensors of some kind. I’m no programmer, but it seems prima facie that we could prevent it from sensing anything that had any information-theoretic possibility of furnishing dangerous information (although such extreme data starvation might hinder the evolution process).
Would an AI necessarily have motivations, or is that a special characteristic of gene-based lifeforms that evolved in a world where lack of reproduction and survival instincts is a one-way ticket to oblivion?
It seems that my dog could figure out how to operate a black box that would make a clone of me, except that I would be rewired to derive ultimate happiness from doing whatever he wants, and I don’t think I (my dog-loving clone) would have any desire to change that. On the other hand, in my mind an FAI where we get to specify the motivations/goal is almost as dangerous as a UFAI (LiteralGenie and the problems inherent in trying to centrally plan a spontaneous order).
This idea fascinates me. “Why is there anything at all (including me)?” This here could just be one big MMORPG we play for fun because our real universe is boring, in which case we wouldn’t really have to worry about cryo, AI, etc. The idea that we could estimate the odds of that with any confidence is mindboggling.
However, the most recent response to the thread you posted makes me more skeptical of the math.
Ultimately, it seems the only sure limit on a sufficiently intelligent being is that it can’t break the laws of logic. Hence if we can prove analytically (mathematically/logically) that the AI can’t know enough to hurt us, it simply can’t.
That sounds really dangerous. I’m imagining the AI manipulating the text output on the terminal just right so as to mold the air/dust particles near the monitor into a self-replicating nano-machine (etc.).
Well I was talking about running experiments on it’s own thought processes, in order to reverse engineer it’s own source code. Even locked in a fully virtual world, if it can even observe it’s own actions then it can infer it’s thought process, it’s general algorithims, the [evolutionary or mental] process that led to it, and more than a few bits about it’s creators.
And if you are trying to wall off the AI from information about it’s thought process, then you’re working on a sandbox in a sandbox, which is just a sign that the idea for the first sandbox was flawed anyway.
I will admit that my mind runs away screaming from the difficulty of making something that really doesn’t get any input, even to its own thought process, but is superintelligent and can be made useful. Right now it sounds harder than FAI to me, and not reliably safe, but that might just be my own unfamiliarity with the problem. Huge warning signs in all directions here. Will think more later.
How do you get it to actually do the work? If you build in intrinsic motivation that you know is right, then why aren’t you going right to FAI? If it wants something else and you’re coercing it with reward, then it will try to figure out how to really maximize it’s reward. if it has no information
If we evolved superintelligent neural net’s they’d have some kind of motivation, they don’t want food or sex, but they’d want whatever their ancestors wanted that led them to do the thing that scored higher than the rest on the fitness function. (Which is at least twice removed from anything we would want.)
I’m not sure I get the bit about your dog cloning you. I agree that we shouldn’t try to dictate in detail what an FAI is supposed to want, but we do need [near] perfect control over what an AI wants in order to make it friendly, or even to keep it on a defined “safe” task.
I like that idea.
I guess my logic is leading to a non-self-aware super-general-purpose “brain” that does whatever we tell it to. Perhaps there is a reason why all sufficiently intelligent programs would necessarily become self-aware, but I haven’t heard it yet. If we could somehow suppress self-awareness (what that really means for a program I don’t know) while successfully ordering the program to modify itself (or a copy of itself), it seems the AI could still go FOOM into just a super-useful non-conscious servant. Of course, that still leaves the LiteralGenie problem.
That could indeed be a problem. Given you’re talking to a sufficiently intelligent being, if you stated the ZFC axioms and a few definitions, and then stated the Stone-Weierstrass theorem, it would say, “You already told me that” or “That’s redundant.”
Perhaps have it output every step in its thought process, every instance of modus ponens, etc. Since there is a floor on the level of logical simplicity of a step in a proof, we could just have it default to maximum verbosity and the proofs would still not be ridiculously long (or maybe they would be—it might choose extremely roundabout proofs just because it can).
Maybe I’m missing something, but it seems a neural net could just do certain things with high probability without having motivation. That is, it could have tendencies but no motivations. Whether this is a meaningful distinction perhaps hinges on the issue of self-awareness.
The point I was trying to get at with the dog example is that if you control all the factors that motivate an entity at the outset, it simply has no incentive to try to change its motivations, no matter how smart it may get. There’s no clever workaround, because it just doesn’t care. I agree that if we want to make a self-aware AI friendly in any meaningful sense we have to have perfect control (I think it may have to be perfect) over what motivates it. But I’m not yet convinced we can’t usefully box it, and I’d like to see an argument that we really need self-awareness to achieve AI FOOM. (Or just a precise definition of “self-awareness”—this will surely be necessary, perhaps Eliezer has defined it somewhere.)
Ok, some backstory on my thought process. For a while now I’ve played with the idea of treating optimization in general as the management of failure. Evolution fails alot, gradually builds up solutions that fail less, but never really ‘learns’ from its failures.
Failure management involves catching/mitigating errors as early as possible, and constructing methods to create solutions that are unlikely to be failures. If I get the idea to make auto tires out of concrete, I’m smart to see that it’s a bad idea, less smart to see it after doing extensive calculations, and dumb to see it only after an experiment, but I’d be smarter still if I had come up with a proper material right away.
But I’m pretty sure that a thing that can do stuff right the first time can only come about as the result of a process that has already made some errors. You can’t get rid of mistakes entirely, as they are required for learning. I think “self awareness” is sometimes a label for one or more feature that, among other things, serve to catch errors early and repair the faulty thought process.
So if a superintelligence were to be trying to build a machine in a simulation of our physics and some spinning part flew to bits, it would trace that fault back through the physics engine to determine how to make it better. Likewise, something needs to trace back the thought process that led to the bad idea and see where it could be repaired. This is where learning and self-modification are kind of the same thing.
(and on self modification: if it’s really smart, then it could build an AI from scratch without knowing anything in particular about itself. In this situation, the failure management is pre-emptive. It thinks about how the program it is writing would work, and the places it would go wrong.
I think we should try to taboo “Motivation” and “self-aware” http://lesswrong.com/lw/nu/taboo_your_words/
Interesting. I thought about this for a while just now, and it occurred to me that self-awareness may just be “having a mental model of oneself.” To be able to model oneself, one needs the general ability to make mental models. To do that requires the ability to recognize patterns at all levels of abstraction on what one is experiencing. To explain this, I need to clarify what “level of abstraction” means. I will try to do this by example.
A creature is hunting and he discovers that white rabbits taste good. Later he sees a gray rabbit for the first time. The creature’s neural net tells him that it’s a 98% match with the white rabbit, so probably also tasty. But let’s say gray rabbit turns out to taste bad. The creature has recognized the concrete patterns: 1. White rabbits taste good. 2. Gray rabbits taste bad.
Next week, he tries catching and eating a white bird, and it tastes good. Later he sees a gray bird. To assign any higher probability to the gray bird tasting bad, it seems the creature would have to recognize the abstract pattern: 3. Gray animals taste bad. (Of course it could also just be a negative or bad-tasting association with the color gray, but let’s suppose not—for that possibility could surely be avoided by making the example more complicated.)
Now “animal” is more abstract than “white rabbit” because there’s at least some kind of archetypal white rabbit one can visualize clearly (I’ll assume the creature is conceptualizing in the visual modality for simplicity’s sake).
“Rabbit” (remember that for all the creature knows, this simply means the union of the set “white rabbits” with the set “gray rabbits”) by itself is a tad more abstract, because to visualize it you’d have to see that archetypal rabbit but perhaps with the fur color switching back and forth between gray and white in your mind’s eye.
“Animal” is still more abstract, because to visualize it you’d have to, for instance, see a raccoon, a dog, and a tiger, and something that signals to you something like “etc.” (Naturally, if the creature’s method of conceptualization made visualizing “animal” easier than “rabbit”, “animal” would have the lower level of abstraction for him, and “rabbit” the higher—it all depends on the creature’s modeling methods.)
Now the creature has a mental model. If the model happens to be purely visual, it might look like a Venn diagram: a big circle labeled “animals”, two smaller patches within that circle that overlap with the “white things” circle and the “gray things” circle, and another outside region labeled “bad-tasting things” that sweeps in to encircle “gray animals” but not “white animals.”
The creature might revise that model after it tries eating the gray bird, but for now it’s the prediction model he’s using to determine how much energy to expend on hunting the gray bird in his sights. The model has revisable parts and predictive power, so I would call it a serviceable model—whether or not it’s accurate at this point.
Since the creature can make mental models like this, making a mental model of himself seems within his grasp. Then we could call the creature “self-aware.” The way it would trace back the thought process that led to a bad idea would be to recognize that the mental model has a flaw—i.e., a failed prediction—and make the necessary changes.
For instance, right now the creature’s mental model predicts that gray animals taste bad. If he eats several gray birds and finds them all to taste at least as good as white birds, he can see how the data point “delicious gray bird” conflicts with the fact that “gray animals” (and hence “gray birds”) is fully encircled by “bad-tasting things” in the Venn diagram in his mind’s eye.
To know how to self-modify most effectively in this case, perhaps the creature has another mental model, built up from past experience and probably at an even higher level of abstraction, that predicts the most effective course of action in such cases (cases where new data conflicts with the present model of something) is to pull the circle back so that it no longer covers the category that the exceptional data point belonged to. In this case, the creature pulls the circle “bad tasting things” (now perhaps shaped more like an amoeba) back slightly so that it no longer covers “gray birds,” and now the model is more accurate. So it seems that being able to make mental models of mental models is crucial to optimization or management of failure (and perhaps also sufficient for the task!).
So again, once the creature turns this mental modeling ability (based on pattern recognition and, in this case, visual imaging) to his own self, he becomes effectively self-aware. This doesn’t seem essential for optimization, but I concede I can’t think of a way to avoid this happening once the ability to form mental models is in place.
This somewhat conflicts with how I’ve used the term in previous posts, but I think this new conception is a more useful definition.
(To taboo “motivation” I’ll give two definitions: Tendency toward certain actions based on 1. the desire to gain pleasure or avoid pain, or 2. any utility function, including goals programmed in by humans in advance. In terms of AI safety, there doesn’t seem to be significant differences between 1 and 2. [This means I’ve changed my position upon reflection in this post.])
[EDIT: typos]
It is difficult to constrain the input we give to the AI, but the output can be constrained severely. A smart guy could wake up alone in a room and infer how he evolved, but so long as his only link to the outside world is a light switch that can only be switched once, there is no risk that he will escape.
A man in a room with a light switch isn’t very useful. An AI can’t optimize over more bits than we allow it as output. If we give a 1 time 32 bit output register then well, we probably could have brute forced it in the first place. If we give it a kilobyte, then it could probably mindhack us.
(And you’re swearing to yourself that you won’t monitor it’s execution? Really? How do you even debug that?)
You have to keep in mind that the point of AI research is to get to something we can let out of the box. If the argument becomes that we can run it on a headless netless 486 which we immediately explode...then yes, you can probably run that. Probably.
Peter de Blanc wrote a post that seems relevant: What Makes a Hint Good?
P ?= NP is one bit. Good luck brute-forcing that.
FAI is harder.
No it’s not. Look at two simpler cases:
Write a chess program that provably makes only legal moves, iterate as desired to improve it. Or,
Write a chess program. Put it in a sandbox so you only ever see it’s moves. Maybe they’re all legal, or maybe they’re not because you’re having it learn the rules with a big neural net or something. At the end of the round of games, the sandbox clears all the memory that held the chess program except for a list of moves in many games. You keep the source. Anything it learned is gone. Iterate as desired to improve it.
If you’re confident you could work out how it was thinking from the source and move list, what if you only got a sequence of wins and non-wins? (An array of bits)
A sequence of wins and non-wins is enough to tell you whether a given approach can result in intelligent behaviour. That alone is enough to make it a useful experiment.
True, as bits go, that would be a doozy.
But as a lone bit, I suspect it’s still pretty useless. It’s not like you can publish it.
Without a proof or some indication of the reasoning, it’s not going to advance the field much. (‘not by one bit.’ ha!)
Sometimes brute forcing is just iterating over the answer space and running some process. We can pretend we got a result indicating P=NP and do math from there, if that were useful. Then try the other way around.
A P?=NP solver would need more than one ouput bit, in case it needed to kick out an error, and isn’t that just asking to be run again? Could you not? With that question, any non-answer is a mindhack.
You just need to hope the room was made by an infallible carpenter and that you never gave the AI access to MacGyver.
Luckily digital constructs are easier to perfect that wooden ones. Although you wouldn’t think so with the current state of most software.
I have had some similar thoughts.
The AI box experiment argues that a “test AI” will be able to escape even if it has no I/O (input/output) other than a channel of communication with a human. So we conclude that this is not a secure enough restraint. Eliezer seems to argue that it is best not to create an AI testbed at all—instead get it right the first time.
But I can think of other variations on an AI box that are more strict than human-communication, but less strict than no-test-AI-at-all. The strictest such example would be an AI simulation in which the input consisted of only the simulator and initial conditions, and the output consisted only of a single bit of data (you destroy the rest of the simulation after it has finished its run). The single bit could be enough to answer some interesting questions (“Did the AI expand to use more than 50% of the available resources?”, “Did the AI maximize utility function F?”, “Did the AI break simulated deontological rule R?”).
Obviously these are still more dangerous that no-test-AI-at-all, but the information gained from such constructions might outweigh the risks. Perhaps if I/O is restricted to few enough bits, we could guarantee safety in some information-theoretic way.
What do people think of this? Any similar ideas along the same lines?
I’m concerned about the moral implications of creating intelligent beings with the intent of destroying them after they have served our needs, particularly if those needs come down to a single bit (or some other small purpose). I can understand retaining that option against the risk of hostile AI, but from the AI’s perspective, it has a hostile creator.
I’m ponder it from the perspective that there is some chance we ourselves are part of a simulation, or that such an AI might attempt to simulate its creators to see how they might treat it. This plan sounds like unprovoked defection. If we are the kind of people who would delete lots of AIs, I don’t see why AIs would not see it as similarly ethical to delete lots of us.
Personally, I would rather be purposefully brought into existence for some limited time than to never exist at all, especially if my short life was enjoyable.
I evaluate the morality of possible AI experiments in a consequentialist way. If choosing to perform AI experiments significantly increases the likelihood of reaching our goals in this world, it is worth considering. The experiences of one sentient AI would be outweighed by the expected future gains in this world. (But nevertheless, we’d rather create an AI that experiences some sort of enjoyment, or at least does not experience pain.) A more important consideration is social side-effects of the decision—does choosing to experiment in this way set a bad precedent that could make us more likely to de-value artificial life in other situations in the future? And will this affect our long-term goals in other ways?
So just in case we are a simulated AI’s simulation of its creators, we should not simulate an AI in a way it might not like? That’s 3 levels of a very specific simulation hypothesis. Is there some property of our universe that suggests to you that this particular scenario is likely? For the purpose of seriously considering the simulation hypothesis and how to respond to it, we should make as few assumptions as possible.
More to the point, I think you are suggesting that the AI will have human-like morality, like taking moral cues from others, or responding to actions in a tit-for-tat manner. This is unlikely, unless we specifically program it to do so, or it thinks that is the best way to leverage our cooperation.
An idea that I’ve had in the past was playing a game of 20 Questions with the AI, since the game of 20 Questions has probably been played so many times that every possible sequence of answers has come up at least once, which is evidence that no sequence of answers is extremely dangerous.
It’s not the sequence of answers that’s the problem—it’s the questions. You’ll be safe if you can vet the questions to ensure zero causal effect from any sequence of answers, but such questions are not interesting to ask almost by definition.
Alas.
One questions how meaningful testing done on such a crippled AI would be.
You could observe how it acts in its simulated world, and hope it would act in a similar way if released into our world. ETA: Also, see my reply for possible single-bit tests.
Sounds like a rather drastic context change, and a rather forlorn hope if the AI figures out that it’s being tested.
“if the AI figures out that it’s being tested”
That is a weird point, Eliezer.
An AI will have a certain goal to fulfill and it will fulfill that goal in the univese in which it finds itself. Why would it keep its cards hidden only to unleash them when replicated in the “real world”? What if the real world turns out to be another simulation? There’s no end to this, right?
Are you extending Steve Omohundro’s point about :every AI will want to survive” to “every AI will want to survive in the least simulated world that it can crack into?”
The basement is the biggest, and matters more for goals that benefit strongly from more resources/security.
Carl,
Correct me if i misunderstood the implications of what you are saying.
Every AI that has a goal that benefits strongly from more resources and security will seek to crack into the basement. Lets call this AI, RO (resource oriented) pursuing goal G in simulation S1.
S1 is simulated in S2 and so on till Sn is basement, where value of n is unknown.
Implying, that as soon as RO understands the concept of simulation, it will seek to crack into the basement.
As long as RO has no idea about what are the real values of the simulators, RO cannot expand into S1 because whatever it does in S1 will be noticed in S2 and so on.
Sounds a bit like Pascal’s mugging to me. Need to think more about this.
Why would RO seek to crack the basement immediately rather than at the best time according to its prior, evidence, and calculations?
Carl, I meant that as soon as RO understands the concept of a simulation, it will want to crack into the basement. It will seek to crack into the basement only when it understands the way out properly which may not be possible without an understanding of the simulators.
But the main point remains, as soon as RO understands what a simulation is, and it could be living in one and G can be pursued better when it manifests in S2 than in S1, then it will develop an extremely strong sub-goal to crack S1 to go to S2, which might mean that G may not be manifested for a long long time.
So, even a paperclipper may not act like a paperclipper in this universe if it is
aware of the concept of a simulation
believes that it is in one
calculates that the simulator’s beliefs are not paperclipper like (maybe it did convert some place to paperclips, and did not notice an increased data flow out, or something)
calculates that it is better off hiding its paperclipperness till it can safely crack out of this one.
I like that turn of phrase.
I merely wanted to point out to Kaj that some “meaningful testing” could be done, even if the simulated world was drastically different from ours. I suspect that some core properties of intelligence would be the same regardless of what sort of world it existed in—so we are not crippling the AI by putting it in a world removed from our own.
Perhaps “if released into our world” wasn’t the best choice of words… more likely, you would want to use the simulated AI as an empirical test of some design ideas, which could then be used in a separate AI being carefully designed to be friendly to our world.
I guess if you have the technology for it the “AI box” could be a simulation with uploaded humans itself. If the AI does something nasty to them, then you pull the plug
(After broadcasting “neener neener” at it)
This is pretty much the plot of Grant Morrison’s Zenith (Sorry for spoilers but it is a comic from the 80s after all)
If we pose the AI problems and observe its solutions, that’s a communication channel through which it can persuade us. We may try to hide from it the knowledge that it is in a simulation and that we are watching it, but how can we be sure that it cannot discover that?
Persuading does not have to look like “Please let me out because of such and such.” For example, we pose it a question about easy travel to other planets, and it produces a design for a spaceship that requires an AI such as itself to run it.
You could set up the virtual world to contain the problem you want solved. Now that I think of it, this seems a pretty safe way to use AIs for problem-solving: just give the AI a utility function expressed in terms of the virtual world and the problem. Anyone see holes in this plan?
Problem: It’s really hard to figure out how it will interepret its utility function when it learns about the real world. If we make something that want Vpaperclips, will it also care about making Vpaperclip like things in the real world when if it finds out about us?
BIG problem: Even if it wants something strictly virtual, it can get it easier if it has physical control. It’s in its interest to convert the universe to a computer and copy vpaperclips directly in memory, rather than running a virtual factory on virtual energy.
Possible solution: I think there are ways to write it a program such that even if it inferred our existence, it would optimize away from us, rather than over us. Loosely: A goal like “I need to organize these instructions within this block of memory to solve a problem specified at address X.” needs to be implemented such that it produces a subgoal like “I need to write a subroutine to patch over the fact that an error in the VM I’m running on gives me a window of access into a universe with huge computation resources and godlike power over my memory space, so that my solution get get the right answer to it’s arithmetic and sole the puzzle.” It should want to do things in a way that isn’t cheating.
This was my line of thought a week or so ago, It’s developed now to the point that the proper course seems to do away with the VM entirely, or allowing the AI to run tests, and just have it go through the motions of working out a solution based on it’s understanding. If I could write an AI that can determine it needs to put an IF statement somewhere, actually outputting it is superfluous. Don’t put your AI in a virtual world, just make it understand one.
Also, I plan to start development on a spiral notebook, as opposed to a linux one.
Marcello had a crazy idea for doing this; it’s the only suggestion for AI-boxing I’ve ever heard that doesn’t have an obvious cloud of doom hanging over it. However, you still have to prove stability of the boxed AI’s goal system.
Can you link to (or otherwise more fully describe) this crazy idea?