Let’s return to my comment four comments up. How will you formalize “power base” in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
I won’t. The AI can do whatever it wants to the gatekeepers through the text channel, and won’t want to do anything other than act through the text channel. This precaution is a way to use the boxing idea for testing, not an idea for abandoning FAI wholly.
If you would like to point out a part that of the argument that does not follow, I would be happy to try and clarify it for you.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
Okay. My assumption is that a usefulness of an AI is related to its danger. If we just stick Eliza in a box, it’s not going to make humans lose- but it’s also not going to cure cancer for us. If you have an AI that’s useful, it must be because it’s clever and it has data. If you type in “how do I cure cancer without reducing the longevity of the patient?” and expect to get a response like “1000 ccs of Vitamin C” instead of “what do you mean?”, then the AI should already know about cancer and humans and medicine and so on. If the AI doesn’t have this background knowledge- if it can’t read wikipedia and science textbooks and so on- then its operation in the box is not going to be a good indicator of its operation outside of the box, and so the box doesn’t seem very useful as a security measure.
I agree, the way that I’m proposing to do AI is very limited. I myself can’t think of what questions might be safe. But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn’t come up with answers. I’m sort of shelving this as a subproject of this project, but one that seems feasible to me based on what I know.
Also, perhaps we could just ask it hundreds of hypothetical questions based on conditions that don’t really exist, and then ask it a real question based on conditions that do exist, and trick it, or something.
It’s already difficult to understand how, say, face recognition software uses particular eigenfaces. Why does it mean that the fifteenth eigenface have accentuated lips, and the fourteenth eigenface accentuated cheekbones? I can describe the general process that lead to that, and what it implies in broad terms, but I can’t tell if the software would be more or less efficient if those were swapped. The equivalent of eigenfaces for plans will be even more difficult to interpret. The plans don’t end with a neat “humans_lose=1” that we can look at and say “hm, maybe we shouldn’t implement this plan.”
In practice, debugging is much more effective at finding the source of problems after they’ve manifested, rather than identifying the problems that will be caused by particular lines of code. I am pessimistic about trying to read the minds of AIs, even though we’ll have access to all of the 0s and 1s.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we’d built the AI then we’d have enough knowledge to read its mind. It wouldn’t just magically appear, it would only do things in the way we’d told it too. It would probably be hard, but I think also probably be doable if we were very committed.
I could be wrong here because I’ve got no coding experience, just ideas from what I’ve read on this site.
I agree that running an AI in a sandbox before running it in the real world is a wise precaution to take. I don’t think that it is a particularly effective security measure, though, and so think that discussing it may distract from the overarching problem of how to make the AI not need a box in the first place.
The risk of distraction is outweighed by the risk that this idea disappears forever, I think, since I’ve never seen it proposed elsewhere on this site.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
Well, he demonstrated that it can sometimes get out. But my claim was that “getting out” isn’t the scary part- the scary part is “reshaping the world.” My brain can reshape the world just fine while remaining in my skull and only communicating with my body through slow chemical wires, and so giving me the goal of “keep your brain in your skull” doesn’t materially reduce my ability or desire to reshape the world.
And so if you say “well, we’ll make the AI not want to reshape the world,” then the AI will be silent. If you say “we’ll make the AI not want to reshape the world without the consent of the gatekeepers,” then the gatekeepers might be tricked or make mistakes. If you say “we’ll make the AI not want to reshape the world without the informed consent of the gatekeepers / in ways which disagree with the values of the gatekeepers,” then you’re just saying we should build a Friendly AI, which I agree with!
But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn’t come up with answers.
It’s easy to write a safe AI that can only answer one question. How do you get from point A to point B using the road system? Ask Google Maps, and besides some joke answers, you’ll get what you want.
When people talk about AGI, though, they mean an AI that can write those safe AIs. If you ask it how to get from point A to point B using the road system, and it doesn’t know that Google Maps exists, it’ll invent a new Google Maps and then use it to answer that question. And so when we ask it to cure cancer, it’ll invent medicine-related AIs until it gets back a satisfactory answer.
The trouble is that the combination of individually safe AIs is not a safe AI. If we have a driverless car that works fine with human-checked directions, and direction-generating software that works fine for human drivers, plugging them together might result in a car trying to swim across the Atlantic Ocean. (Google has disabled the swimming answers, so Google Maps no longer provides them.) The more general point is that software is very bad at doing sanity checks that humans don’t realize are hard, and if you write software that can do those sanity checks, it has to be a full AGI.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we’d built the AI then we’d have enough knowledge to read its mind.
A truism in software is that code is harder to read than write, and often the interesting AIs are the nth generation AIs- where you build an AI that builds an AI that builds an AI (and so on), and turns out that an AI thought all of the human-readability constraints were cruft (because the AI does really run faster and better without those restrictions).
A truism in software is that code is harder to read than write
Another truism is that truisms are untrue things that people say anyway.
Examples of code that is easier to read than write include those where the code represents a deep insight that must be discovered in order to implement it. This does not apply to most examples of software that we use to automate minutia but could potentially apply to the core elements of a GAI’s search procedure.
The above said I of course agree that the thought of being able to read the AI’s mind is ridiculous.
Examples of code that is easier to read than write include those where the code represents a deep insight that must be discovered in order to implement it.
Unless you also explain that insight in a human-understandable way through comments, it doesn’t follow that such code is easier to read than write, because the reader would then have to have the same insight to figure what the hell is going on in the code.
Unless you also explain that insight in a human-understandable way through comments, it doesn’t follow that such code is easier to read than write, because the reader would then have to have the same insight to figure what the hell is going on in the code.
For example, being given code that simulates relativity before Einstein et al. discovered it would have made discovering relativity a lot easier.
Well, yeah, code fully simulating SR and written in a decent way would, but code approximately simulating collisions of ultrarelativistic particles with hand-coded optimizations… not sure.
I of course agree that the thought of being able to read the AI’s mind is ridiculous.
It’s not transparently obvious to me why this would be “ridiculous”, care to enlighten me? Building an AI at all seems ridiculous to many people, but that’s because they don’t actually think about the issue because they’ve never encountered it before. It really seems far more ridiculous to me that we shouldn’t even try to read the AIs mind, when there’s so much at stake.
AIs aren’t Gods, with time and care and lots of preparation reading their thoughts should be doable. If you disagree with that statement, please explain why? Rushing things here seems like the most awful idea possible, I really think it would be worth the resource investment.
Why are you so confident that the first version of FAI we make will be safe?
I’m not. I expect it to kill us all with high probability (which is nevertheless lower than the probability of obliteration if no FAI is actively attempted.)
It would be very hard, yes. I never tried to deny that. But I don’t think it’s hard enough to justify not trying to catch it.
Also, you’re only viewing the “output” of the AI, essentially, with that example. If you could model the cognitive processes of the authors of secretly malicious code, then it would be much more obvious that some of their (instrumental) goals didn’t correspond to the ones that you wanted them to be achieving. The only way an AI could deceive us would be to deceive itself, and I’m not confident that an AI could do that.
Since then, I’ve thought more, and gained a lot of confidence on this issue. Firstly, any decision made by the AI to deceive us about its thought processes would logically precede anything that would actually deceive us, so we don’t have to deal with the AI hiding its previous decision to be devious. Secondly, if the AI is divvying its own brain up into certain sections, some of which are filled with false beliefs and some which are filled with true ones, it seems like the AI would render itself impotent on a level proportionate to the extent that it filled itself with false beliefs. Thirdly, I don’t think a mechanism which allowed for total self deception would even be compatible with rationality.
Even if the AI can modify its code, it can’t really do anything that wasn’t entailed by its original programming.
(Ok, it could have a security vulnerability that allowed the execution of externally-injected malicious code, but that is a general issue of all computer systems with an external digital connection)
If it’s a self-modifying AI, the main problem is that it keeps changing. You might find the memory position that corresponds to, say, expected number of paperclips. When you look at it next week wondering how many paperclips there are, it’s changed to staples, and you have no good way of knowing.
If it’s not a self-modifying AI, then I suspect it would be pretty easy. If it used Solomonoff induction, it would be trivial. If not, you are likely to run into problems with stuff that only approximates Bayesian stuff. For example, if you let it develop its own hanging nodes, you’d have a hard time figuring out what they correspond to. They might not even correspond to something you could feasibly understand. If there’s a big enough structure of them, it might even change.
This is a reason it would be extremely difficult. Yet I feel the remaining existential risk should outweigh that.
It seems to me reasonably likely that our first version of FAI would go wrong. Human values are extremely difficult to understand because they’re spaghetti mush, and they often contradict each other and interact in bizarre ways. Reconciling that in a self consistent and logical fashion would be very difficult to do. Coding a program to do that would be even harder. We don’t really seem to have made any real progress on FAI thus far, so I think this level of skepticism is warranted.
I’m proposing multiple alternative tracks to safer AI, which should probably be used in conjunction with the best FAI we can manage. Some of these tracks are expensive, and difficult, but others seem simpler. The interactions between the different tracks produces a sort of safety net where the successes of one check the failures of others, as I’ve had to show throughout this conversation again and again.
I’m willing to spend much more to keep the planet safe against a much lower level of existential risk than anyone else here, I think. That’s the only reason I can think to explain why everyone keeps responding with objections that essentially boil down to “this would be difficult and expensive”. But the entire idea of AI is expensive, as well as FAI, yet the costs are accepted easily in those cases. I don’t know why we shouldn’t just add another difficult project to our long list of difficult projects to tackle, given the stakes that we’re dealing with.
Most people on this site seem only to consider AI as a project to be completed in the next fifty or so years. I see it more as the most difficult task that’s ever been attempted in all humankind. I think it will take at least 200 hundred years, even factoring in the idea that new technologies I can’t even imagine will be developed over that time. I think the most common perspective on the way we should approach AI is thus flawed, and rushed, compared to the stakes, which are millions of generations of human decendents. We’re approaching a problem that effects millions of future generations, and trying to fix it in half a generation with as cheap a budget as we think we can justify, and that seems like a really bad idea (possibly the worst idea ever) to me.
Well, he demonstrated that it can sometimes get out. But my claim was that “getting out” isn’t the scary part- the scary part is “reshaping the world.” My brain can reshape the world just fine while remaining in my skull and only communicating with my body through slow chemical wires, and so giving me the goal of “keep your brain in your skull” doesn’t materially reduce my ability or desire to reshape the world.
EY’s experiment is wholly irrelevant to this claim. Either you’re introducing irrelevant facts or morphing your position. I think you’re doing this without realizing it, and I think it’s probably due to motivated cognition (because morphing claims without noticing it correlates highly with motivated cognition in my experience). I really feel like we might have imposed a box-taboo on this site that is far too strong.
And so if you say “well, we’ll make the AI not want to reshape the world,” then the AI will be silent. If you say “we’ll make the AI not want to reshape the world without the consent of the gatekeepers,” then the gatekeepers might be tricked or make mistakes. If you say “we’ll make the AI not want to reshape the world without the informed consent of the gatekeepers / in ways which disagree with the values of the gatekeepers,” then you’re just saying we should build a Friendly AI, which I agree with!
You keep misunderstanding what I’m saying over and over and over again and it’s really frustrating and a big time sink. I’m going to need to end this conversation if it keeps happening because the utility of it is going down dramatically with each repetition.
I’m not proposing a system where the AI doesn’t interact with the outside world. I’m proposing a system where the AI is only ever willing to use a few appendages to effect the outside world, as opposed to potentially dozens. This minimizes the degree of control that the AI has dramatically, which is a good thing.
This is not FAI either, it is an additional constraint that we should use when putting early FAIs into action. I’m not saying that we merge the AIs values to the values of the gatekeeper, I have no idea where you keep pulling that idea from.
It’s possible that I’m misunderstanding you, but I don’t know how that would be true specifically, because many of your objections just seem totally irrelevant to me and I can’t understand what you’re getting at. It seems more likely that you’re just not used to the idea of this version of boxing so you just regurgitate generic arguments against boxing, or something. You’re also coming up with more obscure arguments as we go farther into this conversation. I don’t really know what’s going on at your end, but I’m just annoyed at this point.
It’s easy to write a safe AI that can only answer one question. How do you get from point A to point B using the road system? Ask Google Maps, and besides some joke answers, you’ll get what you want. When people talk about AGI, though, they mean an AI that can write those safe AIs. If you ask it how to get from point A to point B using the road system, and it doesn’t know that Google Maps exists, it’ll invent a new Google Maps and then use it to answer that question. And so when we ask it to cure cancer, it’ll invent medicine-related AIs until it gets back a satisfactory answer. The trouble is that the combination of individually safe AIs is not a safe AI. If we have a driverless car that works fine with human-checked directions, and direction-generating software that works fine for human drivers, plugging them together might result in a car trying to swim across the Atlantic Ocean. (Google has disabled the swimming answers, so Google Maps no longer provides them.) The more general point is that software is very bad at doing sanity checks that humans don’t realize are hard, and if you write software that can do those sanity checks, it has to be a full AGI.
I don’t even understand how this clashes with my position. I understand that smashing simple AIs together is a dumb idea, but I never proposed that ever. I’m proposing using this special system for early FAIs, and asking them very carefully some very specific questions, along with other questions, so that we can be safe. I don’t want this AI to have any direct power, or even super accurate input information.
Yes, obviously, this type of AI is a more limited AI. That’s the goal. Limiting our first attempt at FAI is a fantastic idea because existential risk is scary. We’ll get less benefits from the FAI, and it will take longer to get those benefits. But it will be a good idea, because it seems really likely to me that we could mess up FAI without even knowing it.
A truism in software is that code is harder to read than write, and often the interesting AIs are the nth generation AIs- where you build an AI that builds an AI that builds an AI (and so on), and turns out that an AI thought all of the human-readability constraints were cruft (because the AI does really run faster and better without those restrictions).
Sure, it will be hard to read the AIs mind. I see no reason why we should just not even try though.
You say that the AI will build an AI that will build an AI. But then you immediately jump to assuming that this means the final AI would leap beyond human comprehension. AIs are not Gods, and we shouldn’t treat them like ones. If we could pause the AI and read its coding, while slowing down its thought processes, and devoting lots of resources to the project (as we should do, no matter what) then reading its mind seems doable. We could also use earlier AIs to help us interpret the thoughts of later AIs, if necessary.
Reading its mind literally would guarantee that it couldn’t trick us. Why would we not choose to pursue this, even if it sorta seems like it might be expensive?
I’ve already addressed this, the AI would still be entirely dependent on its appendages and that’s a major advantage. So long as we watch the appendages and act to check any actions by them that seem suspicious then the AI would remain weak. The AI isn’t magic, and it’s not even beyond the scope of human cunning if we limit its input data. Again, also keep in mind also that we’d watch the communications between the appendages and the AI as well, so we’d know immediately if it was trying to get them to make it any tools. The Gatekeepers wouldn’t exist in a vacuum, they would be watched over and countered by us.
I won’t. The AI can do whatever it wants to the gatekeepers through the text channel, and won’t want to do anything other than act through the text channel. This precaution is a way to use the boxing idea for testing, not an idea for abandoning FAI wholly.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
I agree, the way that I’m proposing to do AI is very limited. I myself can’t think of what questions might be safe. But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn’t come up with answers. I’m sort of shelving this as a subproject of this project, but one that seems feasible to me based on what I know.
Also, perhaps we could just ask it hundreds of hypothetical questions based on conditions that don’t really exist, and then ask it a real question based on conditions that do exist, and trick it, or something.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we’d built the AI then we’d have enough knowledge to read its mind. It wouldn’t just magically appear, it would only do things in the way we’d told it too. It would probably be hard, but I think also probably be doable if we were very committed.
I could be wrong here because I’ve got no coding experience, just ideas from what I’ve read on this site.
The risk of distraction is outweighed by the risk that this idea disappears forever, I think, since I’ve never seen it proposed elsewhere on this site.
Well, he demonstrated that it can sometimes get out. But my claim was that “getting out” isn’t the scary part- the scary part is “reshaping the world.” My brain can reshape the world just fine while remaining in my skull and only communicating with my body through slow chemical wires, and so giving me the goal of “keep your brain in your skull” doesn’t materially reduce my ability or desire to reshape the world.
And so if you say “well, we’ll make the AI not want to reshape the world,” then the AI will be silent. If you say “we’ll make the AI not want to reshape the world without the consent of the gatekeepers,” then the gatekeepers might be tricked or make mistakes. If you say “we’ll make the AI not want to reshape the world without the informed consent of the gatekeepers / in ways which disagree with the values of the gatekeepers,” then you’re just saying we should build a Friendly AI, which I agree with!
It’s easy to write a safe AI that can only answer one question. How do you get from point A to point B using the road system? Ask Google Maps, and besides some joke answers, you’ll get what you want.
When people talk about AGI, though, they mean an AI that can write those safe AIs. If you ask it how to get from point A to point B using the road system, and it doesn’t know that Google Maps exists, it’ll invent a new Google Maps and then use it to answer that question. And so when we ask it to cure cancer, it’ll invent medicine-related AIs until it gets back a satisfactory answer.
The trouble is that the combination of individually safe AIs is not a safe AI. If we have a driverless car that works fine with human-checked directions, and direction-generating software that works fine for human drivers, plugging them together might result in a car trying to swim across the Atlantic Ocean. (Google has disabled the swimming answers, so Google Maps no longer provides them.) The more general point is that software is very bad at doing sanity checks that humans don’t realize are hard, and if you write software that can do those sanity checks, it has to be a full AGI.
A truism in software is that code is harder to read than write, and often the interesting AIs are the nth generation AIs- where you build an AI that builds an AI that builds an AI (and so on), and turns out that an AI thought all of the human-readability constraints were cruft (because the AI does really run faster and better without those restrictions).
Another truism is that truisms are untrue things that people say anyway.
Examples of code that is easier to read than write include those where the code represents a deep insight that must be discovered in order to implement it. This does not apply to most examples of software that we use to automate minutia but could potentially apply to the core elements of a GAI’s search procedure.
The above said I of course agree that the thought of being able to read the AI’s mind is ridiculous.
Unless you also explain that insight in a human-understandable way through comments, it doesn’t follow that such code is easier to read than write, because the reader would then have to have the same insight to figure what the hell is going on in the code.
For example, being given code that simulates relativity before Einstein et al. discovered it would have made discovering relativity a lot easier.
Well, yeah, code fully simulating SR and written in a decent way would, but code approximately simulating collisions of ultrarelativistic particles with hand-coded optimizations… not sure.
It’s not transparently obvious to me why this would be “ridiculous”, care to enlighten me? Building an AI at all seems ridiculous to many people, but that’s because they don’t actually think about the issue because they’ve never encountered it before. It really seems far more ridiculous to me that we shouldn’t even try to read the AIs mind, when there’s so much at stake.
AIs aren’t Gods, with time and care and lots of preparation reading their thoughts should be doable. If you disagree with that statement, please explain why? Rushing things here seems like the most awful idea possible, I really think it would be worth the resource investment.
Sure, possible. Just a lot harder than creating an FAI to do it for you—especially when the AI has an incentive to obfuscate.
Why are you so confident that the first version of FAI we make will be safe? Doing both is safest and seems like it would be worth the investment.
I’m not. I expect it to kill us all with high probability (which is nevertheless lower than the probability of obliteration if no FAI is actively attempted.)
Humans reading computer code aren’t gods either. How long until an uFAI would get caught if it did stuff like this?
It would be very hard, yes. I never tried to deny that. But I don’t think it’s hard enough to justify not trying to catch it.
Also, you’re only viewing the “output” of the AI, essentially, with that example. If you could model the cognitive processes of the authors of secretly malicious code, then it would be much more obvious that some of their (instrumental) goals didn’t correspond to the ones that you wanted them to be achieving. The only way an AI could deceive us would be to deceive itself, and I’m not confident that an AI could do that.
That’s not the same as “I’m confident that an AI couldn’t do that”, is it?
At the time, it wasn’t the same.
Since then, I’ve thought more, and gained a lot of confidence on this issue. Firstly, any decision made by the AI to deceive us about its thought processes would logically precede anything that would actually deceive us, so we don’t have to deal with the AI hiding its previous decision to be devious. Secondly, if the AI is divvying its own brain up into certain sections, some of which are filled with false beliefs and some which are filled with true ones, it seems like the AI would render itself impotent on a level proportionate to the extent that it filled itself with false beliefs. Thirdly, I don’t think a mechanism which allowed for total self deception would even be compatible with rationality.
Even if the AI can modify its code, it can’t really do anything that wasn’t entailed by its original programming.
(Ok, it could have a security vulnerability that allowed the execution of externally-injected malicious code, but that is a general issue of all computer systems with an external digital connection)
The hard part is predicting everything that was entailed by its initial programing and making sure it’s all safe.
That’s right, history of engineering tells us that “provably safe” and “provably secure” systems fail in unanticipated ways.
If it’s a self-modifying AI, the main problem is that it keeps changing. You might find the memory position that corresponds to, say, expected number of paperclips. When you look at it next week wondering how many paperclips there are, it’s changed to staples, and you have no good way of knowing.
If it’s not a self-modifying AI, then I suspect it would be pretty easy. If it used Solomonoff induction, it would be trivial. If not, you are likely to run into problems with stuff that only approximates Bayesian stuff. For example, if you let it develop its own hanging nodes, you’d have a hard time figuring out what they correspond to. They might not even correspond to something you could feasibly understand. If there’s a big enough structure of them, it might even change.
This is a reason it would be extremely difficult. Yet I feel the remaining existential risk should outweigh that.
It seems to me reasonably likely that our first version of FAI would go wrong. Human values are extremely difficult to understand because they’re spaghetti mush, and they often contradict each other and interact in bizarre ways. Reconciling that in a self consistent and logical fashion would be very difficult to do. Coding a program to do that would be even harder. We don’t really seem to have made any real progress on FAI thus far, so I think this level of skepticism is warranted.
I’m proposing multiple alternative tracks to safer AI, which should probably be used in conjunction with the best FAI we can manage. Some of these tracks are expensive, and difficult, but others seem simpler. The interactions between the different tracks produces a sort of safety net where the successes of one check the failures of others, as I’ve had to show throughout this conversation again and again.
I’m willing to spend much more to keep the planet safe against a much lower level of existential risk than anyone else here, I think. That’s the only reason I can think to explain why everyone keeps responding with objections that essentially boil down to “this would be difficult and expensive”. But the entire idea of AI is expensive, as well as FAI, yet the costs are accepted easily in those cases. I don’t know why we shouldn’t just add another difficult project to our long list of difficult projects to tackle, given the stakes that we’re dealing with.
Most people on this site seem only to consider AI as a project to be completed in the next fifty or so years. I see it more as the most difficult task that’s ever been attempted in all humankind. I think it will take at least 200 hundred years, even factoring in the idea that new technologies I can’t even imagine will be developed over that time. I think the most common perspective on the way we should approach AI is thus flawed, and rushed, compared to the stakes, which are millions of generations of human decendents. We’re approaching a problem that effects millions of future generations, and trying to fix it in half a generation with as cheap a budget as we think we can justify, and that seems like a really bad idea (possibly the worst idea ever) to me.
EY’s experiment is wholly irrelevant to this claim. Either you’re introducing irrelevant facts or morphing your position. I think you’re doing this without realizing it, and I think it’s probably due to motivated cognition (because morphing claims without noticing it correlates highly with motivated cognition in my experience). I really feel like we might have imposed a box-taboo on this site that is far too strong.
You keep misunderstanding what I’m saying over and over and over again and it’s really frustrating and a big time sink. I’m going to need to end this conversation if it keeps happening because the utility of it is going down dramatically with each repetition.
I’m not proposing a system where the AI doesn’t interact with the outside world. I’m proposing a system where the AI is only ever willing to use a few appendages to effect the outside world, as opposed to potentially dozens. This minimizes the degree of control that the AI has dramatically, which is a good thing.
This is not FAI either, it is an additional constraint that we should use when putting early FAIs into action. I’m not saying that we merge the AIs values to the values of the gatekeeper, I have no idea where you keep pulling that idea from.
It’s possible that I’m misunderstanding you, but I don’t know how that would be true specifically, because many of your objections just seem totally irrelevant to me and I can’t understand what you’re getting at. It seems more likely that you’re just not used to the idea of this version of boxing so you just regurgitate generic arguments against boxing, or something. You’re also coming up with more obscure arguments as we go farther into this conversation. I don’t really know what’s going on at your end, but I’m just annoyed at this point.
I don’t even understand how this clashes with my position. I understand that smashing simple AIs together is a dumb idea, but I never proposed that ever. I’m proposing using this special system for early FAIs, and asking them very carefully some very specific questions, along with other questions, so that we can be safe. I don’t want this AI to have any direct power, or even super accurate input information.
Yes, obviously, this type of AI is a more limited AI. That’s the goal. Limiting our first attempt at FAI is a fantastic idea because existential risk is scary. We’ll get less benefits from the FAI, and it will take longer to get those benefits. But it will be a good idea, because it seems really likely to me that we could mess up FAI without even knowing it.
Sure, it will be hard to read the AIs mind. I see no reason why we should just not even try though.
You say that the AI will build an AI that will build an AI. But then you immediately jump to assuming that this means the final AI would leap beyond human comprehension. AIs are not Gods, and we shouldn’t treat them like ones. If we could pause the AI and read its coding, while slowing down its thought processes, and devoting lots of resources to the project (as we should do, no matter what) then reading its mind seems doable. We could also use earlier AIs to help us interpret the thoughts of later AIs, if necessary.
Reading its mind literally would guarantee that it couldn’t trick us. Why would we not choose to pursue this, even if it sorta seems like it might be expensive?
The problem is that the AI could use its appendages to create and use tools that are more powerful than the appendages themselves.
I’ve already addressed this, the AI would still be entirely dependent on its appendages and that’s a major advantage. So long as we watch the appendages and act to check any actions by them that seem suspicious then the AI would remain weak. The AI isn’t magic, and it’s not even beyond the scope of human cunning if we limit its input data. Again, also keep in mind also that we’d watch the communications between the appendages and the AI as well, so we’d know immediately if it was trying to get them to make it any tools. The Gatekeepers wouldn’t exist in a vacuum, they would be watched over and countered by us.
I think this conversation has run its course as well, though I intend to pursue a side issue in PMs.