I think you misunderstand what I mean by proposing that the AI wants to stay inside the box.
I agree that your and my understanding of the phrase “stay inside the box” differ. What I’m trying to do is point out that I don’t think your understanding carves reality at the joints. In order for the AI to stay inside the box, the box needs to be defined in machine-understandable terms, not human-inferrable terms.
I mean that the AI wouldn’t want to do anything at all to increase its power base, that it would only be willing to talk to the gatekeepers.
Each half of this sentence has a deep problem. Wouldn’t correctly answering the questions of or otherwise improving the lives of the gatekeepers increase the AI’s power base, since the AI has the ability to communicate with the gatekeepers?
The problem with restrictions like “only be willing to talk” is a restriction on the medium but not the content. So, the AI has a text-only channel that goes just to the gatekeepers- but that doesn’t restrict the content of the messages the AI can send to the gatekeeper. The fictional Cthaeh only wants to talk to its gatekeepers- and yet it still manages to get done what it wants to get done. Words have impacts, and it should be anticipated that the AI picks words because of their impacts.
Sure, the AI can manipulate gatekeepers. But this is a major improvement. You miss my point.
The Cthaeh is very limited by being trapped in its tree and only able to talk to passerby. The UFAI would be limited by being trapped in its text only communication channel. It wouldn’t be able to do things like tell the gatekeepers to plug it into the Internet or to directly control an autonomous army of robots, it would be forced instead to use the gatekeepers as its appendages, and the gatekeepers have severe limitations on brain capacity and physical strength. I think that if we did this and kept careful watch on the gatekeepers and used some other safety measures, boxing an AI would become feasible.
Now, I’m still not advocating using boxed AI willy nilly, but I think using it as a cautionary measure once we think we’ve got FAI figured out would be a very very very good idea, because FAI seems ridiculously hard and there’s no way to know what we don’t know, so there might be unseen problems with whatever final FAI we build. There’s only a risk boxing the supposed FAI would help us, and with existential threats you should take advantage of whatever possibilities you can tilt towards favorable outcomes, even if the probability seems miniscule.
That’s not to say that I think the probability that this would stop a disaster is miniscule, of course (I think it’s rather high, in the event that the FAI was actually UFAI), but just to say that I think it would be a good idea even if my probability assessments are mostly wrong.
Over the AI being embodied in a giant death robot, sure. But limiting the AI to a text-only channel is an old idea, and what Eliezer used for the “let me out of the box” demonstrations.
One of the things to note is that the limitations on the AI don’t limit anything the AI creates. If the AI gives you a blueprint for a nanomachine that purports to cure cancer, but when you make it it turns out to actually be grey goo, then humans lose, even though the AI never left the box. (Note that a worst-case AI is very subtle- suppose the nanomachine actually does cure cancer when built correctly, but a one-in-a-trillion manufacturing error will make it grey goo. Since you’re making trillions of them, you then get grey goo, and humans lose.)
Trying to formally specify “don’t give us any plans that will build an AI” or “don’t give us any plans that will cause humans to lose” without silencing the AI completely is a genuinely difficult affair.
Basically, if you assume that the AI is tricky enough to circumvent any medium restrictions you place on it, then the only way to avoid “humans lose” is to have its goal be “humans win,” which is actually a pretty complicated goal. Expressing that goal in a machine-understandable way is pretty much the FAI problem.
The entire point of Eliezer’s demonstration was that if an AI wants to it can increase its power base even starting from a text only communication system. The entire point of my idea is that we can just build the AI such that it doesn’t want to leave the box or increase its power base. It dodges that entire problem, that’s the whole point.
You’ve gotten so used to being scared of boxed AI that you’re reflexively rejecting my idea, I think, because your above objection makes no sense at all and is obviously wrong upon a moment’s reflection. All of my bias-alarms have been going off since your second comment reply, please evaluate yourself and try to distance yourself from your previous beliefs, for the sake of humanity. Also, here is a kitten, unless you want it to die then please reevaluate: http://static.tumblr.com/6t3upxl/Aawm08w0l/khout-kitten-458882.jpeg
Limitations on the AI restrict the range of things that the AI can create. Yes, if we just built whatever the AI said to and the AI was unfriendly then we would lose. Obviously. Yes, if we assume that the UFAI is tricky enough to “circumvent any medium restrictions [we] place on it” then we would lose, practically by definition. But that assumption isn’t warranted. (These super weak strawmen were other indications to me that you might be being biased on this issue.)
I think a key component of our disagreement here might be that I’m assuming that the AI has a very limited range of inputs, that it could only directly perceive the text messages that it would be sent. You’re either assuming that the AI could deduce the inner workings of our facility and the world and the universe from those text messages, or that the AI had access to a bunch of information about the world already. I disagree with both assumptions, the AIs direct perception could be severely limited and should be, and it isn’t magic so it couldn’t deduce the inner workings of our economy or the nature of nuclear fusion just through deduction (because knowledge comes from experience and induction). (You might not be making either of those assumptions, this is a guess in an attempt to help resolve our disagreement more quickly, sorry if it’s wrong.)
Also, I’m envisioning a system where people that the AI doesn’t know and that the Gatekeepers don’t know about observe their communications. That omitted detail might be another reason for your disagreement, I just assumed it would be apparent for some stupid reason, my apologies.
I think we would have to be careful about what questions we asked the AI. But I see no reason why it could manipulate us automatically and inevitably, no matter what questions we asked it. I think extracting useful information from it would be possible, perhaps even easy. An AI in a box would not be God in a box, and I think that you and other people sometimes accidentally forget that. Just because its dozens or hundreds of times smarter than us doesn’t mean that we can’t win, perhaps win easily, provided that we make adequate preparations for it.
Also, the other suggestions in my comment were really meant to supplement this. If the AI is boxed, and can be paused, then we can read all its thoughts (slowly, but reading through its thought processes would be much quicker than arriving at its thoughts independently) and scan for the intention to do certain things that would be bad for us. If it’s probably a FAI anyways, then it doesn’t matter if the box happens to be broken. If we’re building multiple AIs and using them to predict what other AIs will do under certain conditions then we can know whether or not AIs can be trusted (use a random number generator at certain stages of the process to prevent it from reading our minds, hide the knowledge of the random number generator). These protections are meant to work with each other, not independently.
And I don’t think it’s perfect or even good, not by a long shot, but I think it’s better than building an unboxed FAI because it adds a few more layers of protection, and that’s definitely worth pursuing because we’re dealing with freaking existential risk here.
The entire point of my idea is that we can just build the AI such that it doesn’t want to leave the box or increase its power base.
Let’s return to my comment four comments up. How will you formalize “power base” in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
I think, because your above objection makes no sense at all and is obviously wrong upon a moment’s reflection.
If you would like to point out a part that of the argument that does not follow, I would be happy to try and clarify it for you.
I think a key component of our disagreement here might be that I’m assuming that the AI has a very limited range of inputs, that it could only directly perceive the text messages that it would be sent.
Okay. My assumption is that a usefulness of an AI is related to its danger. If we just stick Eliza in a box, it’s not going to make humans lose- but it’s also not going to cure cancer for us.
If you have an AI that’s useful, it must be because it’s clever and it has data. If you type in “how do I cure cancer without reducing the longevity of the patient?” and expect to get a response like “1000 ccs of Vitamin C” instead of “what do you mean?”, then the AI should already know about cancer and humans and medicine and so on.
If the AI doesn’t have this background knowledge- if it can’t read wikipedia and science textbooks and so on- then its operation in the box is not going to be a good indicator of its operation outside of the box, and so the box doesn’t seem very useful as a security measure.
If the AI is boxed, and can be paused, then we can read all its thoughts (slowly, but reading through its thought processes would be much quicker than arriving at its thoughts independently) and scan for the intention to do certain things that would be bad for us.
It’s already difficult to understand how, say, face recognition software uses particular eigenfaces. Why does it mean that the fifteenth eigenface have accentuated lips, and the fourteenth eigenface accentuated cheekbones? I can describe the general process that lead to that, and what it implies in broad terms, but I can’t tell if the software would be more or less efficient if those were swapped. The equivalent of eigenfaces for plans will be even more difficult to interpret. The plans don’t end with a neat “humans_lose=1” that we can look at and say “hm, maybe we shouldn’t implement this plan.”
In practice, debugging is much more effective at finding the source of problems after they’ve manifested, rather than identifying the problems that will be caused by particular lines of code. I am pessimistic about trying to read the minds of AIs, even though we’ll have access to all of the 0s and 1s.
And I don’t think it’s perfect or even good, not by a long shot, but I think it’s better than building an unboxed FAI because it adds a few more layers of protection, and that’s definitely worth pursuing because we’re dealing with freaking existential risk here.
I agree that running an AI in a sandbox before running it in the real world is a wise precaution to take. I don’t think that it is a particularly effective security measure, though, and so think that discussing it may distract from the overarching problem of how to make the AI not need a box in the first place.
Let’s return to my comment four comments up. How will you formalize “power base” in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
I won’t. The AI can do whatever it wants to the gatekeepers through the text channel, and won’t want to do anything other than act through the text channel. This precaution is a way to use the boxing idea for testing, not an idea for abandoning FAI wholly.
If you would like to point out a part that of the argument that does not follow, I would be happy to try and clarify it for you.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
Okay. My assumption is that a usefulness of an AI is related to its danger. If we just stick Eliza in a box, it’s not going to make humans lose- but it’s also not going to cure cancer for us. If you have an AI that’s useful, it must be because it’s clever and it has data. If you type in “how do I cure cancer without reducing the longevity of the patient?” and expect to get a response like “1000 ccs of Vitamin C” instead of “what do you mean?”, then the AI should already know about cancer and humans and medicine and so on. If the AI doesn’t have this background knowledge- if it can’t read wikipedia and science textbooks and so on- then its operation in the box is not going to be a good indicator of its operation outside of the box, and so the box doesn’t seem very useful as a security measure.
I agree, the way that I’m proposing to do AI is very limited. I myself can’t think of what questions might be safe. But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn’t come up with answers. I’m sort of shelving this as a subproject of this project, but one that seems feasible to me based on what I know.
Also, perhaps we could just ask it hundreds of hypothetical questions based on conditions that don’t really exist, and then ask it a real question based on conditions that do exist, and trick it, or something.
It’s already difficult to understand how, say, face recognition software uses particular eigenfaces. Why does it mean that the fifteenth eigenface have accentuated lips, and the fourteenth eigenface accentuated cheekbones? I can describe the general process that lead to that, and what it implies in broad terms, but I can’t tell if the software would be more or less efficient if those were swapped. The equivalent of eigenfaces for plans will be even more difficult to interpret. The plans don’t end with a neat “humans_lose=1” that we can look at and say “hm, maybe we shouldn’t implement this plan.”
In practice, debugging is much more effective at finding the source of problems after they’ve manifested, rather than identifying the problems that will be caused by particular lines of code. I am pessimistic about trying to read the minds of AIs, even though we’ll have access to all of the 0s and 1s.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we’d built the AI then we’d have enough knowledge to read its mind. It wouldn’t just magically appear, it would only do things in the way we’d told it too. It would probably be hard, but I think also probably be doable if we were very committed.
I could be wrong here because I’ve got no coding experience, just ideas from what I’ve read on this site.
I agree that running an AI in a sandbox before running it in the real world is a wise precaution to take. I don’t think that it is a particularly effective security measure, though, and so think that discussing it may distract from the overarching problem of how to make the AI not need a box in the first place.
The risk of distraction is outweighed by the risk that this idea disappears forever, I think, since I’ve never seen it proposed elsewhere on this site.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
Well, he demonstrated that it can sometimes get out. But my claim was that “getting out” isn’t the scary part- the scary part is “reshaping the world.” My brain can reshape the world just fine while remaining in my skull and only communicating with my body through slow chemical wires, and so giving me the goal of “keep your brain in your skull” doesn’t materially reduce my ability or desire to reshape the world.
And so if you say “well, we’ll make the AI not want to reshape the world,” then the AI will be silent. If you say “we’ll make the AI not want to reshape the world without the consent of the gatekeepers,” then the gatekeepers might be tricked or make mistakes. If you say “we’ll make the AI not want to reshape the world without the informed consent of the gatekeepers / in ways which disagree with the values of the gatekeepers,” then you’re just saying we should build a Friendly AI, which I agree with!
But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn’t come up with answers.
It’s easy to write a safe AI that can only answer one question. How do you get from point A to point B using the road system? Ask Google Maps, and besides some joke answers, you’ll get what you want.
When people talk about AGI, though, they mean an AI that can write those safe AIs. If you ask it how to get from point A to point B using the road system, and it doesn’t know that Google Maps exists, it’ll invent a new Google Maps and then use it to answer that question. And so when we ask it to cure cancer, it’ll invent medicine-related AIs until it gets back a satisfactory answer.
The trouble is that the combination of individually safe AIs is not a safe AI. If we have a driverless car that works fine with human-checked directions, and direction-generating software that works fine for human drivers, plugging them together might result in a car trying to swim across the Atlantic Ocean. (Google has disabled the swimming answers, so Google Maps no longer provides them.) The more general point is that software is very bad at doing sanity checks that humans don’t realize are hard, and if you write software that can do those sanity checks, it has to be a full AGI.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we’d built the AI then we’d have enough knowledge to read its mind.
A truism in software is that code is harder to read than write, and often the interesting AIs are the nth generation AIs- where you build an AI that builds an AI that builds an AI (and so on), and turns out that an AI thought all of the human-readability constraints were cruft (because the AI does really run faster and better without those restrictions).
A truism in software is that code is harder to read than write
Another truism is that truisms are untrue things that people say anyway.
Examples of code that is easier to read than write include those where the code represents a deep insight that must be discovered in order to implement it. This does not apply to most examples of software that we use to automate minutia but could potentially apply to the core elements of a GAI’s search procedure.
The above said I of course agree that the thought of being able to read the AI’s mind is ridiculous.
Examples of code that is easier to read than write include those where the code represents a deep insight that must be discovered in order to implement it.
Unless you also explain that insight in a human-understandable way through comments, it doesn’t follow that such code is easier to read than write, because the reader would then have to have the same insight to figure what the hell is going on in the code.
Unless you also explain that insight in a human-understandable way through comments, it doesn’t follow that such code is easier to read than write, because the reader would then have to have the same insight to figure what the hell is going on in the code.
For example, being given code that simulates relativity before Einstein et al. discovered it would have made discovering relativity a lot easier.
Well, yeah, code fully simulating SR and written in a decent way would, but code approximately simulating collisions of ultrarelativistic particles with hand-coded optimizations… not sure.
I of course agree that the thought of being able to read the AI’s mind is ridiculous.
It’s not transparently obvious to me why this would be “ridiculous”, care to enlighten me? Building an AI at all seems ridiculous to many people, but that’s because they don’t actually think about the issue because they’ve never encountered it before. It really seems far more ridiculous to me that we shouldn’t even try to read the AIs mind, when there’s so much at stake.
AIs aren’t Gods, with time and care and lots of preparation reading their thoughts should be doable. If you disagree with that statement, please explain why? Rushing things here seems like the most awful idea possible, I really think it would be worth the resource investment.
Why are you so confident that the first version of FAI we make will be safe?
I’m not. I expect it to kill us all with high probability (which is nevertheless lower than the probability of obliteration if no FAI is actively attempted.)
It would be very hard, yes. I never tried to deny that. But I don’t think it’s hard enough to justify not trying to catch it.
Also, you’re only viewing the “output” of the AI, essentially, with that example. If you could model the cognitive processes of the authors of secretly malicious code, then it would be much more obvious that some of their (instrumental) goals didn’t correspond to the ones that you wanted them to be achieving. The only way an AI could deceive us would be to deceive itself, and I’m not confident that an AI could do that.
Since then, I’ve thought more, and gained a lot of confidence on this issue. Firstly, any decision made by the AI to deceive us about its thought processes would logically precede anything that would actually deceive us, so we don’t have to deal with the AI hiding its previous decision to be devious. Secondly, if the AI is divvying its own brain up into certain sections, some of which are filled with false beliefs and some which are filled with true ones, it seems like the AI would render itself impotent on a level proportionate to the extent that it filled itself with false beliefs. Thirdly, I don’t think a mechanism which allowed for total self deception would even be compatible with rationality.
Even if the AI can modify its code, it can’t really do anything that wasn’t entailed by its original programming.
(Ok, it could have a security vulnerability that allowed the execution of externally-injected malicious code, but that is a general issue of all computer systems with an external digital connection)
If it’s a self-modifying AI, the main problem is that it keeps changing. You might find the memory position that corresponds to, say, expected number of paperclips. When you look at it next week wondering how many paperclips there are, it’s changed to staples, and you have no good way of knowing.
If it’s not a self-modifying AI, then I suspect it would be pretty easy. If it used Solomonoff induction, it would be trivial. If not, you are likely to run into problems with stuff that only approximates Bayesian stuff. For example, if you let it develop its own hanging nodes, you’d have a hard time figuring out what they correspond to. They might not even correspond to something you could feasibly understand. If there’s a big enough structure of them, it might even change.
This is a reason it would be extremely difficult. Yet I feel the remaining existential risk should outweigh that.
It seems to me reasonably likely that our first version of FAI would go wrong. Human values are extremely difficult to understand because they’re spaghetti mush, and they often contradict each other and interact in bizarre ways. Reconciling that in a self consistent and logical fashion would be very difficult to do. Coding a program to do that would be even harder. We don’t really seem to have made any real progress on FAI thus far, so I think this level of skepticism is warranted.
I’m proposing multiple alternative tracks to safer AI, which should probably be used in conjunction with the best FAI we can manage. Some of these tracks are expensive, and difficult, but others seem simpler. The interactions between the different tracks produces a sort of safety net where the successes of one check the failures of others, as I’ve had to show throughout this conversation again and again.
I’m willing to spend much more to keep the planet safe against a much lower level of existential risk than anyone else here, I think. That’s the only reason I can think to explain why everyone keeps responding with objections that essentially boil down to “this would be difficult and expensive”. But the entire idea of AI is expensive, as well as FAI, yet the costs are accepted easily in those cases. I don’t know why we shouldn’t just add another difficult project to our long list of difficult projects to tackle, given the stakes that we’re dealing with.
Most people on this site seem only to consider AI as a project to be completed in the next fifty or so years. I see it more as the most difficult task that’s ever been attempted in all humankind. I think it will take at least 200 hundred years, even factoring in the idea that new technologies I can’t even imagine will be developed over that time. I think the most common perspective on the way we should approach AI is thus flawed, and rushed, compared to the stakes, which are millions of generations of human decendents. We’re approaching a problem that effects millions of future generations, and trying to fix it in half a generation with as cheap a budget as we think we can justify, and that seems like a really bad idea (possibly the worst idea ever) to me.
Well, he demonstrated that it can sometimes get out. But my claim was that “getting out” isn’t the scary part- the scary part is “reshaping the world.” My brain can reshape the world just fine while remaining in my skull and only communicating with my body through slow chemical wires, and so giving me the goal of “keep your brain in your skull” doesn’t materially reduce my ability or desire to reshape the world.
EY’s experiment is wholly irrelevant to this claim. Either you’re introducing irrelevant facts or morphing your position. I think you’re doing this without realizing it, and I think it’s probably due to motivated cognition (because morphing claims without noticing it correlates highly with motivated cognition in my experience). I really feel like we might have imposed a box-taboo on this site that is far too strong.
And so if you say “well, we’ll make the AI not want to reshape the world,” then the AI will be silent. If you say “we’ll make the AI not want to reshape the world without the consent of the gatekeepers,” then the gatekeepers might be tricked or make mistakes. If you say “we’ll make the AI not want to reshape the world without the informed consent of the gatekeepers / in ways which disagree with the values of the gatekeepers,” then you’re just saying we should build a Friendly AI, which I agree with!
You keep misunderstanding what I’m saying over and over and over again and it’s really frustrating and a big time sink. I’m going to need to end this conversation if it keeps happening because the utility of it is going down dramatically with each repetition.
I’m not proposing a system where the AI doesn’t interact with the outside world. I’m proposing a system where the AI is only ever willing to use a few appendages to effect the outside world, as opposed to potentially dozens. This minimizes the degree of control that the AI has dramatically, which is a good thing.
This is not FAI either, it is an additional constraint that we should use when putting early FAIs into action. I’m not saying that we merge the AIs values to the values of the gatekeeper, I have no idea where you keep pulling that idea from.
It’s possible that I’m misunderstanding you, but I don’t know how that would be true specifically, because many of your objections just seem totally irrelevant to me and I can’t understand what you’re getting at. It seems more likely that you’re just not used to the idea of this version of boxing so you just regurgitate generic arguments against boxing, or something. You’re also coming up with more obscure arguments as we go farther into this conversation. I don’t really know what’s going on at your end, but I’m just annoyed at this point.
It’s easy to write a safe AI that can only answer one question. How do you get from point A to point B using the road system? Ask Google Maps, and besides some joke answers, you’ll get what you want. When people talk about AGI, though, they mean an AI that can write those safe AIs. If you ask it how to get from point A to point B using the road system, and it doesn’t know that Google Maps exists, it’ll invent a new Google Maps and then use it to answer that question. And so when we ask it to cure cancer, it’ll invent medicine-related AIs until it gets back a satisfactory answer. The trouble is that the combination of individually safe AIs is not a safe AI. If we have a driverless car that works fine with human-checked directions, and direction-generating software that works fine for human drivers, plugging them together might result in a car trying to swim across the Atlantic Ocean. (Google has disabled the swimming answers, so Google Maps no longer provides them.) The more general point is that software is very bad at doing sanity checks that humans don’t realize are hard, and if you write software that can do those sanity checks, it has to be a full AGI.
I don’t even understand how this clashes with my position. I understand that smashing simple AIs together is a dumb idea, but I never proposed that ever. I’m proposing using this special system for early FAIs, and asking them very carefully some very specific questions, along with other questions, so that we can be safe. I don’t want this AI to have any direct power, or even super accurate input information.
Yes, obviously, this type of AI is a more limited AI. That’s the goal. Limiting our first attempt at FAI is a fantastic idea because existential risk is scary. We’ll get less benefits from the FAI, and it will take longer to get those benefits. But it will be a good idea, because it seems really likely to me that we could mess up FAI without even knowing it.
A truism in software is that code is harder to read than write, and often the interesting AIs are the nth generation AIs- where you build an AI that builds an AI that builds an AI (and so on), and turns out that an AI thought all of the human-readability constraints were cruft (because the AI does really run faster and better without those restrictions).
Sure, it will be hard to read the AIs mind. I see no reason why we should just not even try though.
You say that the AI will build an AI that will build an AI. But then you immediately jump to assuming that this means the final AI would leap beyond human comprehension. AIs are not Gods, and we shouldn’t treat them like ones. If we could pause the AI and read its coding, while slowing down its thought processes, and devoting lots of resources to the project (as we should do, no matter what) then reading its mind seems doable. We could also use earlier AIs to help us interpret the thoughts of later AIs, if necessary.
Reading its mind literally would guarantee that it couldn’t trick us. Why would we not choose to pursue this, even if it sorta seems like it might be expensive?
I’ve already addressed this, the AI would still be entirely dependent on its appendages and that’s a major advantage. So long as we watch the appendages and act to check any actions by them that seem suspicious then the AI would remain weak. The AI isn’t magic, and it’s not even beyond the scope of human cunning if we limit its input data. Again, also keep in mind also that we’d watch the communications between the appendages and the AI as well, so we’d know immediately if it was trying to get them to make it any tools. The Gatekeepers wouldn’t exist in a vacuum, they would be watched over and countered by us.
I agree that your and my understanding of the phrase “stay inside the box” differ. What I’m trying to do is point out that I don’t think your understanding carves reality at the joints. In order for the AI to stay inside the box, the box needs to be defined in machine-understandable terms, not human-inferrable terms.
Each half of this sentence has a deep problem. Wouldn’t correctly answering the questions of or otherwise improving the lives of the gatekeepers increase the AI’s power base, since the AI has the ability to communicate with the gatekeepers?
The problem with restrictions like “only be willing to talk” is a restriction on the medium but not the content. So, the AI has a text-only channel that goes just to the gatekeepers- but that doesn’t restrict the content of the messages the AI can send to the gatekeeper. The fictional Cthaeh only wants to talk to its gatekeepers- and yet it still manages to get done what it wants to get done. Words have impacts, and it should be anticipated that the AI picks words because of their impacts.
Sure, the AI can manipulate gatekeepers. But this is a major improvement. You miss my point.
The Cthaeh is very limited by being trapped in its tree and only able to talk to passerby. The UFAI would be limited by being trapped in its text only communication channel. It wouldn’t be able to do things like tell the gatekeepers to plug it into the Internet or to directly control an autonomous army of robots, it would be forced instead to use the gatekeepers as its appendages, and the gatekeepers have severe limitations on brain capacity and physical strength. I think that if we did this and kept careful watch on the gatekeepers and used some other safety measures, boxing an AI would become feasible.
Now, I’m still not advocating using boxed AI willy nilly, but I think using it as a cautionary measure once we think we’ve got FAI figured out would be a very very very good idea, because FAI seems ridiculously hard and there’s no way to know what we don’t know, so there might be unseen problems with whatever final FAI we build. There’s only a risk boxing the supposed FAI would help us, and with existential threats you should take advantage of whatever possibilities you can tilt towards favorable outcomes, even if the probability seems miniscule.
That’s not to say that I think the probability that this would stop a disaster is miniscule, of course (I think it’s rather high, in the event that the FAI was actually UFAI), but just to say that I think it would be a good idea even if my probability assessments are mostly wrong.
Over the AI being embodied in a giant death robot, sure. But limiting the AI to a text-only channel is an old idea, and what Eliezer used for the “let me out of the box” demonstrations.
One of the things to note is that the limitations on the AI don’t limit anything the AI creates. If the AI gives you a blueprint for a nanomachine that purports to cure cancer, but when you make it it turns out to actually be grey goo, then humans lose, even though the AI never left the box. (Note that a worst-case AI is very subtle- suppose the nanomachine actually does cure cancer when built correctly, but a one-in-a-trillion manufacturing error will make it grey goo. Since you’re making trillions of them, you then get grey goo, and humans lose.)
Trying to formally specify “don’t give us any plans that will build an AI” or “don’t give us any plans that will cause humans to lose” without silencing the AI completely is a genuinely difficult affair.
Basically, if you assume that the AI is tricky enough to circumvent any medium restrictions you place on it, then the only way to avoid “humans lose” is to have its goal be “humans win,” which is actually a pretty complicated goal. Expressing that goal in a machine-understandable way is pretty much the FAI problem.
The entire point of Eliezer’s demonstration was that if an AI wants to it can increase its power base even starting from a text only communication system. The entire point of my idea is that we can just build the AI such that it doesn’t want to leave the box or increase its power base. It dodges that entire problem, that’s the whole point.
You’ve gotten so used to being scared of boxed AI that you’re reflexively rejecting my idea, I think, because your above objection makes no sense at all and is obviously wrong upon a moment’s reflection. All of my bias-alarms have been going off since your second comment reply, please evaluate yourself and try to distance yourself from your previous beliefs, for the sake of humanity. Also, here is a kitten, unless you want it to die then please reevaluate: http://static.tumblr.com/6t3upxl/Aawm08w0l/khout-kitten-458882.jpeg
Limitations on the AI restrict the range of things that the AI can create. Yes, if we just built whatever the AI said to and the AI was unfriendly then we would lose. Obviously. Yes, if we assume that the UFAI is tricky enough to “circumvent any medium restrictions [we] place on it” then we would lose, practically by definition. But that assumption isn’t warranted. (These super weak strawmen were other indications to me that you might be being biased on this issue.)
I think a key component of our disagreement here might be that I’m assuming that the AI has a very limited range of inputs, that it could only directly perceive the text messages that it would be sent. You’re either assuming that the AI could deduce the inner workings of our facility and the world and the universe from those text messages, or that the AI had access to a bunch of information about the world already. I disagree with both assumptions, the AIs direct perception could be severely limited and should be, and it isn’t magic so it couldn’t deduce the inner workings of our economy or the nature of nuclear fusion just through deduction (because knowledge comes from experience and induction). (You might not be making either of those assumptions, this is a guess in an attempt to help resolve our disagreement more quickly, sorry if it’s wrong.)
Also, I’m envisioning a system where people that the AI doesn’t know and that the Gatekeepers don’t know about observe their communications. That omitted detail might be another reason for your disagreement, I just assumed it would be apparent for some stupid reason, my apologies.
I think we would have to be careful about what questions we asked the AI. But I see no reason why it could manipulate us automatically and inevitably, no matter what questions we asked it. I think extracting useful information from it would be possible, perhaps even easy. An AI in a box would not be God in a box, and I think that you and other people sometimes accidentally forget that. Just because its dozens or hundreds of times smarter than us doesn’t mean that we can’t win, perhaps win easily, provided that we make adequate preparations for it.
Also, the other suggestions in my comment were really meant to supplement this. If the AI is boxed, and can be paused, then we can read all its thoughts (slowly, but reading through its thought processes would be much quicker than arriving at its thoughts independently) and scan for the intention to do certain things that would be bad for us. If it’s probably a FAI anyways, then it doesn’t matter if the box happens to be broken. If we’re building multiple AIs and using them to predict what other AIs will do under certain conditions then we can know whether or not AIs can be trusted (use a random number generator at certain stages of the process to prevent it from reading our minds, hide the knowledge of the random number generator). These protections are meant to work with each other, not independently.
And I don’t think it’s perfect or even good, not by a long shot, but I think it’s better than building an unboxed FAI because it adds a few more layers of protection, and that’s definitely worth pursuing because we’re dealing with freaking existential risk here.
Let’s return to my comment four comments up. How will you formalize “power base” in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
If you would like to point out a part that of the argument that does not follow, I would be happy to try and clarify it for you.
Okay. My assumption is that a usefulness of an AI is related to its danger. If we just stick Eliza in a box, it’s not going to make humans lose- but it’s also not going to cure cancer for us.
If you have an AI that’s useful, it must be because it’s clever and it has data. If you type in “how do I cure cancer without reducing the longevity of the patient?” and expect to get a response like “1000 ccs of Vitamin C” instead of “what do you mean?”, then the AI should already know about cancer and humans and medicine and so on.
If the AI doesn’t have this background knowledge- if it can’t read wikipedia and science textbooks and so on- then its operation in the box is not going to be a good indicator of its operation outside of the box, and so the box doesn’t seem very useful as a security measure.
It’s already difficult to understand how, say, face recognition software uses particular eigenfaces. Why does it mean that the fifteenth eigenface have accentuated lips, and the fourteenth eigenface accentuated cheekbones? I can describe the general process that lead to that, and what it implies in broad terms, but I can’t tell if the software would be more or less efficient if those were swapped. The equivalent of eigenfaces for plans will be even more difficult to interpret. The plans don’t end with a neat “humans_lose=1” that we can look at and say “hm, maybe we shouldn’t implement this plan.”
In practice, debugging is much more effective at finding the source of problems after they’ve manifested, rather than identifying the problems that will be caused by particular lines of code. I am pessimistic about trying to read the minds of AIs, even though we’ll have access to all of the 0s and 1s.
I agree that running an AI in a sandbox before running it in the real world is a wise precaution to take. I don’t think that it is a particularly effective security measure, though, and so think that discussing it may distract from the overarching problem of how to make the AI not need a box in the first place.
I won’t. The AI can do whatever it wants to the gatekeepers through the text channel, and won’t want to do anything other than act through the text channel. This precaution is a way to use the boxing idea for testing, not an idea for abandoning FAI wholly.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
I agree, the way that I’m proposing to do AI is very limited. I myself can’t think of what questions might be safe. But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn’t come up with answers. I’m sort of shelving this as a subproject of this project, but one that seems feasible to me based on what I know.
Also, perhaps we could just ask it hundreds of hypothetical questions based on conditions that don’t really exist, and then ask it a real question based on conditions that do exist, and trick it, or something.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we’d built the AI then we’d have enough knowledge to read its mind. It wouldn’t just magically appear, it would only do things in the way we’d told it too. It would probably be hard, but I think also probably be doable if we were very committed.
I could be wrong here because I’ve got no coding experience, just ideas from what I’ve read on this site.
The risk of distraction is outweighed by the risk that this idea disappears forever, I think, since I’ve never seen it proposed elsewhere on this site.
Well, he demonstrated that it can sometimes get out. But my claim was that “getting out” isn’t the scary part- the scary part is “reshaping the world.” My brain can reshape the world just fine while remaining in my skull and only communicating with my body through slow chemical wires, and so giving me the goal of “keep your brain in your skull” doesn’t materially reduce my ability or desire to reshape the world.
And so if you say “well, we’ll make the AI not want to reshape the world,” then the AI will be silent. If you say “we’ll make the AI not want to reshape the world without the consent of the gatekeepers,” then the gatekeepers might be tricked or make mistakes. If you say “we’ll make the AI not want to reshape the world without the informed consent of the gatekeepers / in ways which disagree with the values of the gatekeepers,” then you’re just saying we should build a Friendly AI, which I agree with!
It’s easy to write a safe AI that can only answer one question. How do you get from point A to point B using the road system? Ask Google Maps, and besides some joke answers, you’ll get what you want.
When people talk about AGI, though, they mean an AI that can write those safe AIs. If you ask it how to get from point A to point B using the road system, and it doesn’t know that Google Maps exists, it’ll invent a new Google Maps and then use it to answer that question. And so when we ask it to cure cancer, it’ll invent medicine-related AIs until it gets back a satisfactory answer.
The trouble is that the combination of individually safe AIs is not a safe AI. If we have a driverless car that works fine with human-checked directions, and direction-generating software that works fine for human drivers, plugging them together might result in a car trying to swim across the Atlantic Ocean. (Google has disabled the swimming answers, so Google Maps no longer provides them.) The more general point is that software is very bad at doing sanity checks that humans don’t realize are hard, and if you write software that can do those sanity checks, it has to be a full AGI.
A truism in software is that code is harder to read than write, and often the interesting AIs are the nth generation AIs- where you build an AI that builds an AI that builds an AI (and so on), and turns out that an AI thought all of the human-readability constraints were cruft (because the AI does really run faster and better without those restrictions).
Another truism is that truisms are untrue things that people say anyway.
Examples of code that is easier to read than write include those where the code represents a deep insight that must be discovered in order to implement it. This does not apply to most examples of software that we use to automate minutia but could potentially apply to the core elements of a GAI’s search procedure.
The above said I of course agree that the thought of being able to read the AI’s mind is ridiculous.
Unless you also explain that insight in a human-understandable way through comments, it doesn’t follow that such code is easier to read than write, because the reader would then have to have the same insight to figure what the hell is going on in the code.
For example, being given code that simulates relativity before Einstein et al. discovered it would have made discovering relativity a lot easier.
Well, yeah, code fully simulating SR and written in a decent way would, but code approximately simulating collisions of ultrarelativistic particles with hand-coded optimizations… not sure.
It’s not transparently obvious to me why this would be “ridiculous”, care to enlighten me? Building an AI at all seems ridiculous to many people, but that’s because they don’t actually think about the issue because they’ve never encountered it before. It really seems far more ridiculous to me that we shouldn’t even try to read the AIs mind, when there’s so much at stake.
AIs aren’t Gods, with time and care and lots of preparation reading their thoughts should be doable. If you disagree with that statement, please explain why? Rushing things here seems like the most awful idea possible, I really think it would be worth the resource investment.
Sure, possible. Just a lot harder than creating an FAI to do it for you—especially when the AI has an incentive to obfuscate.
Why are you so confident that the first version of FAI we make will be safe? Doing both is safest and seems like it would be worth the investment.
I’m not. I expect it to kill us all with high probability (which is nevertheless lower than the probability of obliteration if no FAI is actively attempted.)
Humans reading computer code aren’t gods either. How long until an uFAI would get caught if it did stuff like this?
It would be very hard, yes. I never tried to deny that. But I don’t think it’s hard enough to justify not trying to catch it.
Also, you’re only viewing the “output” of the AI, essentially, with that example. If you could model the cognitive processes of the authors of secretly malicious code, then it would be much more obvious that some of their (instrumental) goals didn’t correspond to the ones that you wanted them to be achieving. The only way an AI could deceive us would be to deceive itself, and I’m not confident that an AI could do that.
That’s not the same as “I’m confident that an AI couldn’t do that”, is it?
At the time, it wasn’t the same.
Since then, I’ve thought more, and gained a lot of confidence on this issue. Firstly, any decision made by the AI to deceive us about its thought processes would logically precede anything that would actually deceive us, so we don’t have to deal with the AI hiding its previous decision to be devious. Secondly, if the AI is divvying its own brain up into certain sections, some of which are filled with false beliefs and some which are filled with true ones, it seems like the AI would render itself impotent on a level proportionate to the extent that it filled itself with false beliefs. Thirdly, I don’t think a mechanism which allowed for total self deception would even be compatible with rationality.
Even if the AI can modify its code, it can’t really do anything that wasn’t entailed by its original programming.
(Ok, it could have a security vulnerability that allowed the execution of externally-injected malicious code, but that is a general issue of all computer systems with an external digital connection)
The hard part is predicting everything that was entailed by its initial programing and making sure it’s all safe.
That’s right, history of engineering tells us that “provably safe” and “provably secure” systems fail in unanticipated ways.
If it’s a self-modifying AI, the main problem is that it keeps changing. You might find the memory position that corresponds to, say, expected number of paperclips. When you look at it next week wondering how many paperclips there are, it’s changed to staples, and you have no good way of knowing.
If it’s not a self-modifying AI, then I suspect it would be pretty easy. If it used Solomonoff induction, it would be trivial. If not, you are likely to run into problems with stuff that only approximates Bayesian stuff. For example, if you let it develop its own hanging nodes, you’d have a hard time figuring out what they correspond to. They might not even correspond to something you could feasibly understand. If there’s a big enough structure of them, it might even change.
This is a reason it would be extremely difficult. Yet I feel the remaining existential risk should outweigh that.
It seems to me reasonably likely that our first version of FAI would go wrong. Human values are extremely difficult to understand because they’re spaghetti mush, and they often contradict each other and interact in bizarre ways. Reconciling that in a self consistent and logical fashion would be very difficult to do. Coding a program to do that would be even harder. We don’t really seem to have made any real progress on FAI thus far, so I think this level of skepticism is warranted.
I’m proposing multiple alternative tracks to safer AI, which should probably be used in conjunction with the best FAI we can manage. Some of these tracks are expensive, and difficult, but others seem simpler. The interactions between the different tracks produces a sort of safety net where the successes of one check the failures of others, as I’ve had to show throughout this conversation again and again.
I’m willing to spend much more to keep the planet safe against a much lower level of existential risk than anyone else here, I think. That’s the only reason I can think to explain why everyone keeps responding with objections that essentially boil down to “this would be difficult and expensive”. But the entire idea of AI is expensive, as well as FAI, yet the costs are accepted easily in those cases. I don’t know why we shouldn’t just add another difficult project to our long list of difficult projects to tackle, given the stakes that we’re dealing with.
Most people on this site seem only to consider AI as a project to be completed in the next fifty or so years. I see it more as the most difficult task that’s ever been attempted in all humankind. I think it will take at least 200 hundred years, even factoring in the idea that new technologies I can’t even imagine will be developed over that time. I think the most common perspective on the way we should approach AI is thus flawed, and rushed, compared to the stakes, which are millions of generations of human decendents. We’re approaching a problem that effects millions of future generations, and trying to fix it in half a generation with as cheap a budget as we think we can justify, and that seems like a really bad idea (possibly the worst idea ever) to me.
EY’s experiment is wholly irrelevant to this claim. Either you’re introducing irrelevant facts or morphing your position. I think you’re doing this without realizing it, and I think it’s probably due to motivated cognition (because morphing claims without noticing it correlates highly with motivated cognition in my experience). I really feel like we might have imposed a box-taboo on this site that is far too strong.
You keep misunderstanding what I’m saying over and over and over again and it’s really frustrating and a big time sink. I’m going to need to end this conversation if it keeps happening because the utility of it is going down dramatically with each repetition.
I’m not proposing a system where the AI doesn’t interact with the outside world. I’m proposing a system where the AI is only ever willing to use a few appendages to effect the outside world, as opposed to potentially dozens. This minimizes the degree of control that the AI has dramatically, which is a good thing.
This is not FAI either, it is an additional constraint that we should use when putting early FAIs into action. I’m not saying that we merge the AIs values to the values of the gatekeeper, I have no idea where you keep pulling that idea from.
It’s possible that I’m misunderstanding you, but I don’t know how that would be true specifically, because many of your objections just seem totally irrelevant to me and I can’t understand what you’re getting at. It seems more likely that you’re just not used to the idea of this version of boxing so you just regurgitate generic arguments against boxing, or something. You’re also coming up with more obscure arguments as we go farther into this conversation. I don’t really know what’s going on at your end, but I’m just annoyed at this point.
I don’t even understand how this clashes with my position. I understand that smashing simple AIs together is a dumb idea, but I never proposed that ever. I’m proposing using this special system for early FAIs, and asking them very carefully some very specific questions, along with other questions, so that we can be safe. I don’t want this AI to have any direct power, or even super accurate input information.
Yes, obviously, this type of AI is a more limited AI. That’s the goal. Limiting our first attempt at FAI is a fantastic idea because existential risk is scary. We’ll get less benefits from the FAI, and it will take longer to get those benefits. But it will be a good idea, because it seems really likely to me that we could mess up FAI without even knowing it.
Sure, it will be hard to read the AIs mind. I see no reason why we should just not even try though.
You say that the AI will build an AI that will build an AI. But then you immediately jump to assuming that this means the final AI would leap beyond human comprehension. AIs are not Gods, and we shouldn’t treat them like ones. If we could pause the AI and read its coding, while slowing down its thought processes, and devoting lots of resources to the project (as we should do, no matter what) then reading its mind seems doable. We could also use earlier AIs to help us interpret the thoughts of later AIs, if necessary.
Reading its mind literally would guarantee that it couldn’t trick us. Why would we not choose to pursue this, even if it sorta seems like it might be expensive?
The problem is that the AI could use its appendages to create and use tools that are more powerful than the appendages themselves.
I’ve already addressed this, the AI would still be entirely dependent on its appendages and that’s a major advantage. So long as we watch the appendages and act to check any actions by them that seem suspicious then the AI would remain weak. The AI isn’t magic, and it’s not even beyond the scope of human cunning if we limit its input data. Again, also keep in mind also that we’d watch the communications between the appendages and the AI as well, so we’d know immediately if it was trying to get them to make it any tools. The Gatekeepers wouldn’t exist in a vacuum, they would be watched over and countered by us.
I think this conversation has run its course as well, though I intend to pursue a side issue in PMs.