So, here’s my pet theory for AI that I’d love to put out of it’s misery:
“Don’t do anything your designer wouldn’t approve of”. It’s loosely based on the “Gandi wouldn’t take a pill that would turn him into a murderer” principle.
A possible implementation: Make an emulation of the designer and use it as an isolated component of the AI. Any plan of action has to be submitted for approval to this component before being implemented. This is nicely recursive and rejects plans such as “make a plan of action deceptively complex such that my designer will mistakenly approve it” and “modify my designer so that they approve what I want them to approve”.
There could be an argument about how the designer’s emulation would feel in this situation, but.. torture vs. dust specks! Also, is this a corrupted version of ?
You flick the switch, and find out that you are a component of the AI, now doomed to an unhappy eternity of answering stupid questions from the rest of the AI.
I’m sure the designer would approve of being modified to enjoy answering stupid questions. The designer might also approve of being cloned for the purpose of answering one question, and then being destroyed.
Unfortunately, it turns out that you’re Stalin. Sounds like 1-person CEV.
I had assumed that a new copy of the designer would be spawned for each decision, and shut down afterwards.
Although thinking about it, that might just doom you to a subjective eternity of listening to the AI explain what it’s done so far, in the anticipation that it’s going to ask you a question at some point.
You’d need a good theory of ems, consciousness and subjective probability to have any idea what you’d subjectively experience.
The AI wishes to make ten thousand tiny changes to the world, individually innocuous, but some combination of which add up to catastrophe. To submit its plan to a human, it would need to distill the list of predicted consequences down to its human-comprehensible essentials. The AI that understands which details are morally salient is one that doesn’t need the oversight.
The AI that understands which details are morally salient is one that doesn’t need the oversight.
That’s quite non-obvious to me. A quite arbitrary claim, it seems to me.
You’re basically saying if an intelligent mind (A for Alice) knows that person (B for Bob) will care about a certain Consequence C, then A will definitely know how much B will care about it.
This isn’t the case for real human minds. If Alice is a human mechanic and tells to Bob “I can fix your car, but it’ll cost 200$ dollars”, then Alice knows that Bob will care about the cost, but doesn’t know how much Bob will care, and whether Bob prefers to have a fixed car, or to have 200$.
So if your claim doesn’t even hold for human minds, why do you think it applies for non-human minds?
And even if it does hold, what about the case where Alice doesn’t know about whether a detail is morally salient, but errs on the side of caution. e.g. Alice the waitress asks Bob the customer “The chocolate icecream you asked for also has some crushed peanuts in it. Is that okay?”—and Bob can respond “Ofcourse, why should I care about that?” or alternatively “It’s not okay, I’m allergic to peanuts!”
In this case Alice the waitress doesn’t know if the detail is salient to Bob, but asks just to make sure.
If the AI is designed to follow the principle by the letter, it has to request approval from the designer even for the action of requesting approval, leaving the AI incapable of action.
If the AI is designed to be able to make certain exemptions, it will figure out a way to modify the designer without needing approval for this modification.
The AI may stumble upon a plan which contains a sequence of words that hacks the approver’s mind, making him approve pretty much anything. Such plans may even be easier for the AI to generate than plans for saving the world, seeing as Eliezer has won some AI-box experiments but hasn’t yet solved world hunger.
Um, does the approver also have to approve each step of the computation that builds the plan to be submitted for approval? Isn’t this infinite regress?
The weak link is “plan of action.” What counts as a plan of action? How will you structure the AI so that it knows what a plan is and when to submit it for approval?
So, here’s my pet theory for AI that I’d love to put out of it’s misery: “Don’t do anything your designer wouldn’t approve of”. It’s loosely based on the “Gandi wouldn’t take a pill that would turn him into a murderer” principle.
A possible implementation: Make an emulation of the designer and use it as an isolated component of the AI. Any plan of action has to be submitted for approval to this component before being implemented. This is nicely recursive and rejects plans such as “make a plan of action deceptively complex such that my designer will mistakenly approve it” and “modify my designer so that they approve what I want them to approve”.
There could be an argument about how the designer’s emulation would feel in this situation, but.. torture vs. dust specks! Also, is this a corrupted version of ?
You flick the switch, and find out that you are a component of the AI, now doomed to an unhappy eternity of answering stupid questions from the rest of the AI.
This is a problem. But if this is the only problem, then it is significantly better than paperclip universe.
I’m sure the designer would approve of being modified to enjoy answering stupid questions. The designer might also approve of being cloned for the purpose of answering one question, and then being destroyed.
Unfortunately, it turns out that you’re Stalin. Sounds like 1-person CEV.
That is or requires a pretty fundamental change. How can you be sure it’s value-preserving?
I had assumed that a new copy of the designer would be spawned for each decision, and shut down afterwards.
Although thinking about it, that might just doom you to a subjective eternity of listening to the AI explain what it’s done so far, in the anticipation that it’s going to ask you a question at some point.
You’d need a good theory of ems, consciousness and subjective probability to have any idea what you’d subjectively experience.
The AI wishes to make ten thousand tiny changes to the world, individually innocuous, but some combination of which add up to catastrophe. To submit its plan to a human, it would need to distill the list of predicted consequences down to its human-comprehensible essentials. The AI that understands which details are morally salient is one that doesn’t need the oversight.
That’s quite non-obvious to me. A quite arbitrary claim, it seems to me.
You’re basically saying if an intelligent mind (A for Alice) knows that person (B for Bob) will care about a certain Consequence C, then A will definitely know how much B will care about it.
This isn’t the case for real human minds. If Alice is a human mechanic and tells to Bob “I can fix your car, but it’ll cost 200$ dollars”, then Alice knows that Bob will care about the cost, but doesn’t know how much Bob will care, and whether Bob prefers to have a fixed car, or to have 200$.
So if your claim doesn’t even hold for human minds, why do you think it applies for non-human minds?
And even if it does hold, what about the case where Alice doesn’t know about whether a detail is morally salient, but errs on the side of caution. e.g. Alice the waitress asks Bob the customer “The chocolate icecream you asked for also has some crushed peanuts in it. Is that okay?”—and Bob can respond “Ofcourse, why should I care about that?” or alternatively “It’s not okay, I’m allergic to peanuts!”
In this case Alice the waitress doesn’t know if the detail is salient to Bob, but asks just to make sure.
This is good, and I have no valid response at this time. Will try to think more about it later.
If the AI is designed to follow the principle by the letter, it has to request approval from the designer even for the action of requesting approval, leaving the AI incapable of action. If the AI is designed to be able to make certain exemptions, it will figure out a way to modify the designer without needing approval for this modification.
How about making ‘ask for approval’ the only pre-approved action?
The AI may stumble upon a plan which contains a sequence of words that hacks the approver’s mind, making him approve pretty much anything. Such plans may even be easier for the AI to generate than plans for saving the world, seeing as Eliezer has won some AI-box experiments but hasn’t yet solved world hunger.
You mean accidentally stumble upon such a sequence of words? Because purposefully building one would certainly not be approved.
Um, does the approver also have to approve each step of the computation that builds the plan to be submitted for approval? Isn’t this infinite regress?
Consider “Ask for approval” as an auto-approved action. Not sure if that solves it, will give this a little more thought.
The weak link is “plan of action.” What counts as a plan of action? How will you structure the AI so that it knows what a plan is and when to submit it for approval?
Accidentally does something dangerous because the plan is confusing to the designer.
Yeah, this is the plan’s weakness. But what stops such an issue occurring today?
I think the main difference is that, ideally, people would confirm the rules by which plans are made, rather than the specific details of the plan.
Hopefully the rules would be more understandable.
The AI doesn’t do anything.