You are simply repeating the incoherent statements made by MIRI (“it is about goals or values...the....genie knows, but doesn’t care”) as if those incoherent statements constitute an answer to the paper.
The purpose of the paper is to examine those statements and show that they are incoherent.
It is therefore meaningless to just say “MIRI haven’t said this is about infallibility” (the paper gives an abundance of evidence and detailed arguments to show that they have indeed said that … put you have not addressed any of the evidence or arguments in the paper, you have just issued a denial, and the repeated the incoherence that was demolished by those arguments.
I don’t think MIRI’s goal based answers work, and I wasn’t repeating them with the intention that they should sound like they do. Perhaps I should have been stronger on the point.
I also don’t think your infallibility based approach accurately reflects MIRI position, whatever it’s merits. You say that you have proved something but I don’t see that. It looks to me that you found MIRIs stated argument so utterly unconvincing that their real argument must be something else. But no: they really believe that an AI, however specified, will blindly folow it’s goals however defined,, however stupid.
Problem is, I had to dissect what you said (whether your intention was orthogonal or not) because either way it did contain a significant mischaracterization of the situation.
One thing that is difficult for me to address are statements along the lines of “the doctrine of logical infallibility is something that MIRI have never claimed or argued for...”, followed by wordage that shows no clear understanding of what how the DLI was defined, and no careful analysis of my definition that demonstrates how and why it is the case that the explanation that I give, to support my claim, is mistaken. What I usually get is just a bare statement that amounts to “no they don’t”.
You and I are having a variant of one of those discussions, but you might to bear with me here, because I have had something like 10 others, all doing the same thing in slightly different ways.
Here’s the rub. The way that the DLI is defined, it borders on self-evidently true. (How come? Because I defined it simply as a way to summarize a group of pretty-much uncontested observations about the situation. I only wanted to define it for the sake of brevity, really). The question, then, should not so much be about whether it is correct or not, but about why people are making that kind of claim.
Or, from the point of view of the opposition: why the claim is justified, and why the claim does not lead to the logical contradiction that I pointed to in the paper.
Those are worth discussing, certainly. And I am fallible, myself, so I must have made some mistakes, here or there. So with that in mind, I want someone to quote my words back to me, ask some questions for clarification, and see if they can zoom in on the places where my argument goes wrong.
And with all that said, you tell me that:
You say that you have proved something but I don’t see that.
Can you reflect back what you think I tried to prove, so we can figure out why you don’t see it?
. The way that the DLI is defined, it borders on self-evidently true
ETA
I now see that what you have written subsequently to the OP is that DLI is almost, but not quite a description of rigid behaviour as a symptom (with the added ingredient that an AI can see the mistakenness of its behaviour):-
However, suppose there is no safe mode, and suppose that the AI also knows about its own design. For that reason, it knows that this situation has come about because (a) its programming is lousy, and (b) it has been hardwired to carry out that programming REGARDLESS of all this understanding that it has, about the lousy programming and the catastrophic consequences for the strawberries.Now, my “doctrine of logical infallibility” is just a shorthand phrase to describe a superintelligent AI in that position which really is hardwired to go ahead with the plan, UNDER THOSE CIRCUMSTANCES. That is all it means. It is not about the rigidity as such, it is about the fact that the AI knows it is being rigid, and knows how catastrophic the consequences will be.
HOWEVER, that doesn’t entirely gel with what you wrote in the OP;-
One way to characterize this assumption is that the AI is supposed to be hardwired with a Doctrine of Logical Infallibility. The significance of the doctrine of logical infallibility is as follows. The AI can sometimes execute a reasoning process, then come to a conclusion and then, when it is faced with empirical evidence that its conclusion may be unsound, it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place. The system does not second guess its conclusions. This is not because second guessing is an impossible thing to implement, it is simply because people who speculate about future AGI systems take it as a given that an AGI would regard its own conclusions as sacrosanct.
Emph added. Doing dumb things because you think are correct, DLI v1, just isnt the same as realising their dumbness, but being tragically compelled to do them anyway...DLI2. (And Infallibility is a much more appropriate label for the origin idea....the second is more like inevitability)
Now, you are trying to put your finger on a difference between two versions of the DLI that you think I have supplied.
You have paraphrased the two versions as:
Doing dumb things because you think they are correct
and
[Doing dumb things and] realising their dumbness, but being tragically compelled to do them anyway.
I think you are seeing some valid issues here, having to do with how to characterize what exactly it is that this AI is supposed to be ‘thinking’ when it goes through this process.
I have actually thought about that a lot, too, and my conclusion is that we should not beat ourselves up trying to figure out precisely what the difference might be between these nuanced versions of the idea, because the people who are proposing this idea in the first place have not themselves been clear enough about what is meant.
For example, you talked about “Doing dumb things because you think are correct” …. but what does it mean to say that you ‘think’ that they are correct? To me, as a human, that seems to entail being completely unaware of the evidence that they might not be correct (“Jill took the ice-cream from Jack because she didn’t know that it was wrong to take someone else’s ice-cream.”). The problem is, we are talking about an AI, and some people talk as if the AI can run its planning engine, then feel compelled to obey the planning engine … while at the same time being fully cognizant of evidence that the planning engine produced a crappy plan. There is no easy counterpart to that in humans (except for cognitive dissonance, and there we have a case where the human is capable of compartmentalizing its beliefs …. something that is not being suggested here, because we are not forced to make the AI do that). So, since the AI case does not map on to the human case, we are left in a peculiar situation where it is not at all clear that the AI really COULD do what is proposed, and still operate as a successful intelligence.
Or, more immediately, it is not at all clear that we can say about that AI “It did a dumb thing because it ‘thought’ it was correct.”
I should add that in both of my quoted descriptions of the DLI that you gave, I see no substantial difference (beyond those imponderables I just mentioned) and that in both cases I was actually trying to say something very close to the second paraphrase that you gave, namely:
[Doing dumb things and] realising their dumbness, but being tragically compelled to do them anyway.
And, don’t forget: I am not saying that such an AI is viable at all! Other people are suggesting some such AI, and I am arguing that the design is so logically incoherent that the AI (if it could be made to exist) would call attention to that problem and suggest means to correct it.
Anyhow, the takeway from this comment is: the people who talk about an AI that exhibits this kind of behavior are actually suggesting a behavior that they have not really thought through carefully, so as a result we can find ourselves walking into a minefield if we go and try to clean up the mess that they left.
Doing dumb things and] realising their dumbness, but being tragically compelled to do them anyway.
And, don’t forget: I am not saying that such an AI is viable at all!
If viable means it could be built, I think it could, given a string of assumptions. If viable means it would be built, by component and benign programmers, I am not so sure,
In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper.
Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of thing all the way through its development would never learn enough about the world to outsmart humanity.
(Which is NOT to say, as some have inferred, that I believe an AI is “dumb” just because it does things that conflict with my value system, etc. etc. It would be dumb because its goal system would be spewing out incoherent behaviors all the time, and that is kinda the standard definition of “dumb”).
instrumental goals of any kind almost certainly would be revised if they became noticeably out of correspondence to reality, because that would make then less effective at achieving terminal goals , and the raison d’etre of such transient sub-goals is is to support the achievement of terminal goals.
By MIRIs reasoning, a terminal goal could be any of a 1000 things other than human happiness , and the same conclusion would follow: an AI with a highest priority terminal goal wouldn’t have any motivation to override it. To be motivated to rewrite a goal because it false implies a higher priority goal towards truth. It should not be surprising that an entity that doesn’t value truth, in a certain sense, doesn’t behave rationally, in a certain sense. (Actually, there is a bunch of supplementary assumptions involved, which I have dealt with elsewhere)
That’s an account of the MIRI position, not a defence if it. It is essentially a model of rational decision making, and there is a gap between it and real world AI research, a gap which MIRI routinely ignores. The conclusion follows logically from the premises, but atoms aren’t pushed around by logic,
In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper.Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of t
That reinforces my point. I was saying that MIRI is basically making armchair assumptions about the AI architectures. You are saying these assumptions aren’t merely unjustified, they go against what a competent AI builder would do.
Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
But there is a bigger question lurking in the background, which is the flip side of what I just said: it really isn’t necessary to restrict the terminal goals, if you are sensitive to the power of constraints to keep a motivation system true. Notice one fascinating thing here: the power of constraint is basically the justification for why instrumental goals should be revisable under evidence of misbehavior …. it is the context mismatch that drives that process. Why is this fascinating? Because the power of constraints (aka context mismatch) is routinely acknowledged by MIRI here, but flatly ignored or denied for the terminal goals.
It’s just a mess. Their theoretical ideas are just shoot-from-the-hip, plus some math added on top to make it look like some legit science.
Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
What would you choose as a replacement terminal goal, or would you not use one?
Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous. And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
I’m having a bit of a problem answering because there are peripheral assumptions about how such an AI would be made to function, which I don’t want to accidentally buy into, because I don’t think goals expressed in language statements work anyway. So I am treading on eggshells here.
A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous.
That gets close to “do it right”
And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
Which is an open doorway to an AI that kills everyone because of miscoded friendliness,
If you want safety features, and you should, you would need them to override the ostensible purpose of the machine....they would be pointless otherwise....even the humble off switch works that way.
A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
Arguably, those constraint would be a kind of negative goal.
I have actually thought about that a lot, too, and my conclusion is that we should not beat ourselves up trying to figure out precisely what the difference might be between these nuanced versions of the idea, because the people who are proposing this idea in the first place have not themselves been clear enough about what is meant
They are clear that they don’t mean AIs rigid behaviour is the result of it assessing its own inferrential processes as infallible … that is what the controversy is all about..
The problem is, we are talking about an AI, and some people talk as if the AI can run its planning engine, then feel compelled to obey the planning engine … while at the same time being fully cognizant of evidence that the planning engine produced a crappy plan.
That is just what The Genie Knows but doesn’t Care is supposed to answer. I think it succeeds in showing that a fairly specific architecture would behave that way, but fails in it’s intended goal of showing that this behaviour is universal or likely.
You think it is self evidently true that MIRI think that the dangers they warn of are the result of AIs believing themselves to infallible?
The referents in that sentence are a little difficult to navigate, but no, I’m pretty sure I am not making that claim. :-) In other words, MIRI do not think that.
What is self-evidently true is that MIRI claim a certain kind of behavior by the AI, under certain circumstances …. and all I did was come along and put a label on that claim about the AI behavior. When you put a label on something, for convenience, the label is kinda self-evidently “correct”.
I think that what you said here:
I now see that what you have written subsequently to the OP is that DLI is almost, but not quite a description of rigid behaviour as a symptom (with the added ingredient that an AI can see the mistakenness of its behaviour):-
… is basically correct.
I had a friend once who suffered from schizophrenia. She was lucid, intelligent (studying for a Ph.D. in psychology) and charming. But if she did not take her medication she became a different person (one day she went up onto the suspension bridge that was the main traffic route out of town and threatened to throw herself to her death 300 feet below. She brought the whole town to a halt for several hours, until someone talked her down.) Now, talking to her in a good moment she could tell you that she knew about her behavior in the insane times—she was completely aware of that side of herself—and she knew that in that other state she would find certain thoughts completely compelling and convincing, even though at this calm moment she could tell you that those thoughts were false. If I say that during the insane period her mind was obeying a “Doctrine That Paranoid Beliefs Are Justified”, then all I am doing is labeling that state that governed her during those times.
That label would just be a label, so if someone said “No, you’re wrong: she does not subscribe to the DTPBAJ at all”, I would be left nonplussed. All I wanted to do was label something that she told me she categorically DID believe, so how can my label be in some sense ‘wrong’?
So, that is why some people’s attacks on the DLI are a little baffling.
Their criticisms are possibly accurate about the first version., which gives a cause for the rigid behaviour “it regards its own conclusions as sacrosanct.*
You are simply repeating the incoherent statements made by MIRI (“it is about goals or values...the....genie knows, but doesn’t care”) as if those incoherent statements constitute an answer to the paper.
The purpose of the paper is to examine those statements and show that they are incoherent.
It is therefore meaningless to just say “MIRI haven’t said this is about infallibility” (the paper gives an abundance of evidence and detailed arguments to show that they have indeed said that … put you have not addressed any of the evidence or arguments in the paper, you have just issued a denial, and the repeated the incoherence that was demolished by those arguments.
I am not you enemy, I am orthogonal to you,
I don’t think MIRI’s goal based answers work, and I wasn’t repeating them with the intention that they should sound like they do. Perhaps I should have been stronger on the point.
I also don’t think your infallibility based approach accurately reflects MIRI position, whatever it’s merits. You say that you have proved something but I don’t see that. It looks to me that you found MIRIs stated argument so utterly unconvincing that their real argument must be something else. But no: they really believe that an AI, however specified, will blindly folow it’s goals however defined,, however stupid.
Okay, I understand that now.
Problem is, I had to dissect what you said (whether your intention was orthogonal or not) because either way it did contain a significant mischaracterization of the situation.
One thing that is difficult for me to address are statements along the lines of “the doctrine of logical infallibility is something that MIRI have never claimed or argued for...”, followed by wordage that shows no clear understanding of what how the DLI was defined, and no careful analysis of my definition that demonstrates how and why it is the case that the explanation that I give, to support my claim, is mistaken. What I usually get is just a bare statement that amounts to “no they don’t”.
You and I are having a variant of one of those discussions, but you might to bear with me here, because I have had something like 10 others, all doing the same thing in slightly different ways.
Here’s the rub. The way that the DLI is defined, it borders on self-evidently true. (How come? Because I defined it simply as a way to summarize a group of pretty-much uncontested observations about the situation. I only wanted to define it for the sake of brevity, really). The question, then, should not so much be about whether it is correct or not, but about why people are making that kind of claim.
Or, from the point of view of the opposition: why the claim is justified, and why the claim does not lead to the logical contradiction that I pointed to in the paper.
Those are worth discussing, certainly. And I am fallible, myself, so I must have made some mistakes, here or there. So with that in mind, I want someone to quote my words back to me, ask some questions for clarification, and see if they can zoom in on the places where my argument goes wrong.
And with all that said, you tell me that:
Can you reflect back what you think I tried to prove, so we can figure out why you don’t see it?
ETA
I now see that what you have written subsequently to the OP is that DLI is almost, but not quite a description of rigid behaviour as a symptom (with the added ingredient that an AI can see the mistakenness of its behaviour):-
HOWEVER, that doesn’t entirely gel with what you wrote in the OP;-
Emph added. Doing dumb things because you think are correct, DLI v1, just isnt the same as realising their dumbness, but being tragically compelled to do them anyway...DLI2. (And Infallibility is a much more appropriate label for the origin idea....the second is more like inevitability)
Now, you are trying to put your finger on a difference between two versions of the DLI that you think I have supplied.
You have paraphrased the two versions as:
and
I think you are seeing some valid issues here, having to do with how to characterize what exactly it is that this AI is supposed to be ‘thinking’ when it goes through this process.
I have actually thought about that a lot, too, and my conclusion is that we should not beat ourselves up trying to figure out precisely what the difference might be between these nuanced versions of the idea, because the people who are proposing this idea in the first place have not themselves been clear enough about what is meant.
For example, you talked about “Doing dumb things because you think are correct” …. but what does it mean to say that you ‘think’ that they are correct? To me, as a human, that seems to entail being completely unaware of the evidence that they might not be correct (“Jill took the ice-cream from Jack because she didn’t know that it was wrong to take someone else’s ice-cream.”). The problem is, we are talking about an AI, and some people talk as if the AI can run its planning engine, then feel compelled to obey the planning engine … while at the same time being fully cognizant of evidence that the planning engine produced a crappy plan. There is no easy counterpart to that in humans (except for cognitive dissonance, and there we have a case where the human is capable of compartmentalizing its beliefs …. something that is not being suggested here, because we are not forced to make the AI do that). So, since the AI case does not map on to the human case, we are left in a peculiar situation where it is not at all clear that the AI really COULD do what is proposed, and still operate as a successful intelligence.
Or, more immediately, it is not at all clear that we can say about that AI “It did a dumb thing because it ‘thought’ it was correct.”
I should add that in both of my quoted descriptions of the DLI that you gave, I see no substantial difference (beyond those imponderables I just mentioned) and that in both cases I was actually trying to say something very close to the second paraphrase that you gave, namely:
And, don’t forget: I am not saying that such an AI is viable at all! Other people are suggesting some such AI, and I am arguing that the design is so logically incoherent that the AI (if it could be made to exist) would call attention to that problem and suggest means to correct it.
Anyhow, the takeway from this comment is: the people who talk about an AI that exhibits this kind of behavior are actually suggesting a behavior that they have not really thought through carefully, so as a result we can find ourselves walking into a minefield if we go and try to clean up the mess that they left.
If viable means it could be built, I think it could, given a string of assumptions. If viable means it would be built, by component and benign programmers, I am not so sure,
I actually meant “viable” in the sense of the third of my listed cases of incoherence at: http://lesswrong.com/lw/m5c/debunking_fallacies_in_the_theory_of_ai_motivation/cdap
In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper.
Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of thing all the way through its development would never learn enough about the world to outsmart humanity.
(Which is NOT to say, as some have inferred, that I believe an AI is “dumb” just because it does things that conflict with my value system, etc. etc. It would be dumb because its goal system would be spewing out incoherent behaviors all the time, and that is kinda the standard definition of “dumb”).
MIRI distinguishes between terminal and instrumental goals, so there are two answers to the question
instrumental goals of any kind almost certainly would be revised if they became noticeably out of correspondence to reality, because that would make then less effective at achieving terminal goals , and the raison d’etre of such transient sub-goals is is to support the achievement of terminal goals.
By MIRIs reasoning, a terminal goal could be any of a 1000 things other than human happiness , and the same conclusion would follow: an AI with a highest priority terminal goal wouldn’t have any motivation to override it. To be motivated to rewrite a goal because it false implies a higher priority goal towards truth. It should not be surprising that an entity that doesn’t value truth, in a certain sense, doesn’t behave rationally, in a certain sense. (Actually, there is a bunch of supplementary assumptions involved, which I have dealt with elsewhere)
That’s an account of the MIRI position, not a defence if it. It is essentially a model of rational decision making, and there is a gap between it and real world AI research, a gap which MIRI routinely ignores. The conclusion follows logically from the premises, but atoms aren’t pushed around by logic,
That reinforces my point. I was saying that MIRI is basically making armchair assumptions about the AI architectures. You are saying these assumptions aren’t merely unjustified, they go against what a competent AI builder would do.
Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
But there is a bigger question lurking in the background, which is the flip side of what I just said: it really isn’t necessary to restrict the terminal goals, if you are sensitive to the power of constraints to keep a motivation system true. Notice one fascinating thing here: the power of constraint is basically the justification for why instrumental goals should be revisable under evidence of misbehavior …. it is the context mismatch that drives that process. Why is this fascinating? Because the power of constraints (aka context mismatch) is routinely acknowledged by MIRI here, but flatly ignored or denied for the terminal goals.
It’s just a mess. Their theoretical ideas are just shoot-from-the-hip, plus some math added on top to make it look like some legit science.
What would you choose as a replacement terminal goal, or would you not use one?
Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous. And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
I’m having a bit of a problem answering because there are peripheral assumptions about how such an AI would be made to function, which I don’t want to accidentally buy into, because I don’t think goals expressed in language statements work anyway. So I am treading on eggshells here.
A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
That gets close to “do it right”
Which is an open doorway to an AI that kills everyone because of miscoded friendliness,
If you want safety features, and you should, you would need them to override the ostensible purpose of the machine....they would be pointless otherwise....even the humble off switch works that way.
Arguably, those constraint would be a kind of negative goal.
They are clear that they don’t mean AIs rigid behaviour is the result of it assessing its own inferrential processes as infallible … that is what the controversy is all about..
That is just what The Genie Knows but doesn’t Care is supposed to answer. I think it succeeds in showing that a fairly specific architecture would behave that way, but fails in it’s intended goal of showing that this behaviour is universal or likely.
Ummm...
The referents in that sentence are a little difficult to navigate, but no, I’m pretty sure I am not making that claim. :-) In other words, MIRI do not think that.
What is self-evidently true is that MIRI claim a certain kind of behavior by the AI, under certain circumstances …. and all I did was come along and put a label on that claim about the AI behavior. When you put a label on something, for convenience, the label is kinda self-evidently “correct”.
I think that what you said here:
… is basically correct.
I had a friend once who suffered from schizophrenia. She was lucid, intelligent (studying for a Ph.D. in psychology) and charming. But if she did not take her medication she became a different person (one day she went up onto the suspension bridge that was the main traffic route out of town and threatened to throw herself to her death 300 feet below. She brought the whole town to a halt for several hours, until someone talked her down.) Now, talking to her in a good moment she could tell you that she knew about her behavior in the insane times—she was completely aware of that side of herself—and she knew that in that other state she would find certain thoughts completely compelling and convincing, even though at this calm moment she could tell you that those thoughts were false. If I say that during the insane period her mind was obeying a “Doctrine That Paranoid Beliefs Are Justified”, then all I am doing is labeling that state that governed her during those times.
That label would just be a label, so if someone said “No, you’re wrong: she does not subscribe to the DTPBAJ at all”, I would be left nonplussed. All I wanted to do was label something that she told me she categorically DID believe, so how can my label be in some sense ‘wrong’?
So, that is why some people’s attacks on the DLI are a little baffling.
Their criticisms are possibly accurate about the first version., which gives a cause for the rigid behaviour “it regards its own conclusions as sacrosanct.*
I responded before you edited and added extra thoughts …. [processing...]