Is the Doctrine of Logical Infallibility Taken Seriously?
No, it’s not.
The Doctrine of Logical Infallibility is indeed completely crazy, but Yudkowsky and Muehlhauser (and probably Omohundro, I haven’t read all of his stuff) don’t believe it’s true. At all.
Yudkowsky believes that a superintelligent AI programmed with the goal to “make humans happy” will put all humans on dopamine drip despite protests that this is not what they want, yes. However, he doesn’t believe the AI will do this because it is absolutely certain of its conclusions past some threshold; he doesn’t believe that the AI will ignore the humans’ protests, or fail to update its beliefs accordingly. Edited to add: By “he doesn’t believe that the AI will ignore the humans’ protests”, I mean that Yudkowsky believes the AI will listen to and understand the protests, even if they have no effect on its behavior.
What Yudkowsky believes is that the AI will understand perfectly well that being put on dopamine drip isn’t what its programmers wanted. It will understand that its programmers now see its goal of “make humans happy” as a mistake. It just won’t care, because it hasn’t been programmed to want to do what its programmers desire, it’s been programmed to want to make humans happy; therefore it will do its very best, in its acknowledged fallibility, to make humans happy. The AI’s beliefs will change as it makes observations, including the observation that human beings are very unhappy a few seconds before being forced to be extremely happy until the end of the universe, but this will have little effect on its actions, because its actions are caused by its goals and whatever beliefs are relevant to these goals.
The AI won’t think, “I don’t care, because I have come to a conclusion, and my conclusions are correct because of the Doctrine of Logical Infallibility.” It will think, “I’m updating my conclusions based on this evidence, but these conclusions don’t have much to do with what I care about”.
The whole Friendly AI thing is mostly about goals, not beliefs. It’s about picking the right goals (“Make humans happy” definitely isn’t the right goal), encoding those goals correctly (how do you correctly encode the concept of a “human being”?), and, if the first two objectives have been attained, designing the AI’s thinking processes so that once it obtains the power to modify itself, it does not want to modify its goals to be something Unfriendly.
The Doctrine of Logical Infallibility is indeed completely crazy, but Yudkowsky and Muehlhauser (and probably Omohundro, I haven’t read all of his stuff) don’t believe it’s true. At all.
When I talked to Omohundro at the AAAI workshop where this paper was delivered, he accepted without hesitation that the Doctrine of Logical Infallibility was indeed implicit in all the types of AI that he and the others were talking about.
Your statement above is nonsensical because the idea of a DLI was ‴invented‴ precisely in order to summarize, in a short phrase, a range of absolutely explicit and categorical statements made by Yudkowsky and others, about what the AI will do if it (a) decides to do action X, and (b) knows quite well that there is massive, converging evidence that action X is inconsistent with the goal statement Y that was supposed to justify X. Under those circumstances, the AI will ignore the massive converging evidence of inconsistency and instead it will enforce the ‘literal’ interpretation of goal statement Y.
The fact that the AI behaves in this way—sticking to the literal interpretation of the goal statement, in spite of external evidence that the literal interpretation is inconsistent with everything else that is known about the connection between goal statement Y and action X, ‴IS THE VERY DEFINITION OF THE DOCTRINE OF LOGICAL INFALLIBILITY‴
Thank you for writing this comment—it made it clearer to me what you mean by the doctrine of logical infallibility, and I think there may be a clearer way to express it.
It seems to me that you’re not getting at logical infallibility, since the AGI could be perfectly willing to be act humbly about its logical beliefs, but value infallibility or goal infallibility. An AI does not expect its goal statement to be fallible: any uncertainty in Y can only be represented by Y being a fuzzy object itself, not in the AI evaluating Y and somehow deciding “no, I was mistaken about Y.”
In the case where the Maverick Nanny is programmed to “ensure the brain chemistry of humans resembles the state extracted from this training data as much as possible,” there is no way to convince the Maverick Nanny that it is somehow misinterpreting its goal; it knows that it is are supposed to ensure perceptions about brain chemistry, and any statements you make about “true happiness” or “human rights” are irrelevant to brain chemistry, even though it might be perfectly willing to consider your advice on how to best achieve that value or manipulate the physical universe.
In the case where the AI is programmed to “do whatever your programmers tell you will make humans happy,” the AI again thinks its values are infallible: it should do what its programmers tell it to do, so long as they claim it will make humans happy. It might be uncertain about what its programmers meant, and so it would be possible to convince this AI that it misunderstood their statements, and then it would change its behavior—but it won’t be convinced by any arguments that it should listen to all of humanity, instead of its programmers.
But expressed this way, it’s not clear to me where you think the inconsistency comes in. If the AI isn’t programmed to have an ‘external conscience’ in its programmers or humanity as a whole, then their dissatisfaction doesn’t matter. If it is programmed to use them as a conscience, but the way in which it does is exploitable, then that isn’t very binding. Figuring out how to give it the right conscience / right values is the open problem that MIRI and others care about!
seems to me that you’re not getting at logical infallibility, since the AGI could be perfectly willing to be act humbly about its logical beliefs, but value infallibility or goal infallibility. An AI does not expect its goal statement to be fallible:
Which AI? As so often, an architecture dependent issue is being treated as a universal truth.
Figuring out how to give it the right conscience / right values is the open problem that MIRI and others care about!
The other mostly aren’t thinking in terms of “giving” …hardcoding ….values. There is a valid critique to be made of that assumption.
Which AI? As so often, an architecture dependent issue is being treated as a universal truth.
This statement maps to “programs execute their code.” I would be surprised if that were controversial.
The other mostly aren’t thinking in terms of “giving” …hardcoding ….values. There is a valid critique to be made of that assumption.
This was covered by the comment about “meta-values” earlier, and “Y being a fuzzy object itself,” which is probably not as clear as it could be. The goal management system grounds out somewhere, and that root algorithm is what I’m considering the “values” of the AI. If it can change its mind about what to value, the process it uses to change its mind is the actual fixed value. (If it can change its mind about how to change its mind, the fixedness goes up another level; if it can completely rewrite itself, now you have lost your ability to be confident in what it will do.)
Which AI? As so often, an architecture dependent issue is being treated as a universal truth.
This statement maps to “programs execute their code.” I would be surprised if that were controversial.
Humans can fail to realise the implications of uncontroversial statements. Humans are failing to realise that goal stability is architecture dependent.
This was covered by the comment about “meta-values” earlier, and “Y being a fuzzy object itself,” which is probably not as clear as it could be. The goal management system grounds out somewhere, and that root algorithm is what I’m considering the “values” of the AI.
But you shouldn’t be, at least in an un scare quoted sense of values. Goals and values aren’t descriptive labels for de facto behaviour. The goal if a paperclipper is to make paperclips; if it crashes, as an inevitable result of executing its code, we don’t say, ” Aha! It had the goal to crash all along”.
Goal stability doesn’t mean following code, since unstable systems follow their code too....using the actual meaning of “goal”.
Meta: trying to defend a claim by changing the meaning of its terms is doomed to failure.
MIRI haven’t said this is about infallibility. They have said many times and in many ways it is about goals or values...the....genie knows, but doesn’t care. The continuing miscommunication is about what goals actually are. It seems obvious to one side that goals include fine grained information, eg
“Make humans happy, and here’s a petabyte information on what that is”
The other side thinks its obvious that goals are coarse grained, in the sense of leaving the details open to further investigation (Senesh) or human input (Loosemore).
You are simply repeating the incoherent statements made by MIRI (“it is about goals or values...the....genie knows, but doesn’t care”) as if those incoherent statements constitute an answer to the paper.
The purpose of the paper is to examine those statements and show that they are incoherent.
It is therefore meaningless to just say “MIRI haven’t said this is about infallibility” (the paper gives an abundance of evidence and detailed arguments to show that they have indeed said that … put you have not addressed any of the evidence or arguments in the paper, you have just issued a denial, and the repeated the incoherence that was demolished by those arguments.
I don’t think MIRI’s goal based answers work, and I wasn’t repeating them with the intention that they should sound like they do. Perhaps I should have been stronger on the point.
I also don’t think your infallibility based approach accurately reflects MIRI position, whatever it’s merits. You say that you have proved something but I don’t see that. It looks to me that you found MIRIs stated argument so utterly unconvincing that their real argument must be something else. But no: they really believe that an AI, however specified, will blindly folow it’s goals however defined,, however stupid.
Problem is, I had to dissect what you said (whether your intention was orthogonal or not) because either way it did contain a significant mischaracterization of the situation.
One thing that is difficult for me to address are statements along the lines of “the doctrine of logical infallibility is something that MIRI have never claimed or argued for...”, followed by wordage that shows no clear understanding of what how the DLI was defined, and no careful analysis of my definition that demonstrates how and why it is the case that the explanation that I give, to support my claim, is mistaken. What I usually get is just a bare statement that amounts to “no they don’t”.
You and I are having a variant of one of those discussions, but you might to bear with me here, because I have had something like 10 others, all doing the same thing in slightly different ways.
Here’s the rub. The way that the DLI is defined, it borders on self-evidently true. (How come? Because I defined it simply as a way to summarize a group of pretty-much uncontested observations about the situation. I only wanted to define it for the sake of brevity, really). The question, then, should not so much be about whether it is correct or not, but about why people are making that kind of claim.
Or, from the point of view of the opposition: why the claim is justified, and why the claim does not lead to the logical contradiction that I pointed to in the paper.
Those are worth discussing, certainly. And I am fallible, myself, so I must have made some mistakes, here or there. So with that in mind, I want someone to quote my words back to me, ask some questions for clarification, and see if they can zoom in on the places where my argument goes wrong.
And with all that said, you tell me that:
You say that you have proved something but I don’t see that.
Can you reflect back what you think I tried to prove, so we can figure out why you don’t see it?
. The way that the DLI is defined, it borders on self-evidently true
ETA
I now see that what you have written subsequently to the OP is that DLI is almost, but not quite a description of rigid behaviour as a symptom (with the added ingredient that an AI can see the mistakenness of its behaviour):-
However, suppose there is no safe mode, and suppose that the AI also knows about its own design. For that reason, it knows that this situation has come about because (a) its programming is lousy, and (b) it has been hardwired to carry out that programming REGARDLESS of all this understanding that it has, about the lousy programming and the catastrophic consequences for the strawberries.Now, my “doctrine of logical infallibility” is just a shorthand phrase to describe a superintelligent AI in that position which really is hardwired to go ahead with the plan, UNDER THOSE CIRCUMSTANCES. That is all it means. It is not about the rigidity as such, it is about the fact that the AI knows it is being rigid, and knows how catastrophic the consequences will be.
HOWEVER, that doesn’t entirely gel with what you wrote in the OP;-
One way to characterize this assumption is that the AI is supposed to be hardwired with a Doctrine of Logical Infallibility. The significance of the doctrine of logical infallibility is as follows. The AI can sometimes execute a reasoning process, then come to a conclusion and then, when it is faced with empirical evidence that its conclusion may be unsound, it is incapable of considering the hypothesis that its own reasoning engine may not have taken it to a sensible place. The system does not second guess its conclusions. This is not because second guessing is an impossible thing to implement, it is simply because people who speculate about future AGI systems take it as a given that an AGI would regard its own conclusions as sacrosanct.
Emph added. Doing dumb things because you think are correct, DLI v1, just isnt the same as realising their dumbness, but being tragically compelled to do them anyway...DLI2. (And Infallibility is a much more appropriate label for the origin idea....the second is more like inevitability)
Now, you are trying to put your finger on a difference between two versions of the DLI that you think I have supplied.
You have paraphrased the two versions as:
Doing dumb things because you think they are correct
and
[Doing dumb things and] realising their dumbness, but being tragically compelled to do them anyway.
I think you are seeing some valid issues here, having to do with how to characterize what exactly it is that this AI is supposed to be ‘thinking’ when it goes through this process.
I have actually thought about that a lot, too, and my conclusion is that we should not beat ourselves up trying to figure out precisely what the difference might be between these nuanced versions of the idea, because the people who are proposing this idea in the first place have not themselves been clear enough about what is meant.
For example, you talked about “Doing dumb things because you think are correct” …. but what does it mean to say that you ‘think’ that they are correct? To me, as a human, that seems to entail being completely unaware of the evidence that they might not be correct (“Jill took the ice-cream from Jack because she didn’t know that it was wrong to take someone else’s ice-cream.”). The problem is, we are talking about an AI, and some people talk as if the AI can run its planning engine, then feel compelled to obey the planning engine … while at the same time being fully cognizant of evidence that the planning engine produced a crappy plan. There is no easy counterpart to that in humans (except for cognitive dissonance, and there we have a case where the human is capable of compartmentalizing its beliefs …. something that is not being suggested here, because we are not forced to make the AI do that). So, since the AI case does not map on to the human case, we are left in a peculiar situation where it is not at all clear that the AI really COULD do what is proposed, and still operate as a successful intelligence.
Or, more immediately, it is not at all clear that we can say about that AI “It did a dumb thing because it ‘thought’ it was correct.”
I should add that in both of my quoted descriptions of the DLI that you gave, I see no substantial difference (beyond those imponderables I just mentioned) and that in both cases I was actually trying to say something very close to the second paraphrase that you gave, namely:
[Doing dumb things and] realising their dumbness, but being tragically compelled to do them anyway.
And, don’t forget: I am not saying that such an AI is viable at all! Other people are suggesting some such AI, and I am arguing that the design is so logically incoherent that the AI (if it could be made to exist) would call attention to that problem and suggest means to correct it.
Anyhow, the takeway from this comment is: the people who talk about an AI that exhibits this kind of behavior are actually suggesting a behavior that they have not really thought through carefully, so as a result we can find ourselves walking into a minefield if we go and try to clean up the mess that they left.
Doing dumb things and] realising their dumbness, but being tragically compelled to do them anyway.
And, don’t forget: I am not saying that such an AI is viable at all!
If viable means it could be built, I think it could, given a string of assumptions. If viable means it would be built, by component and benign programmers, I am not so sure,
In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper.
Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of thing all the way through its development would never learn enough about the world to outsmart humanity.
(Which is NOT to say, as some have inferred, that I believe an AI is “dumb” just because it does things that conflict with my value system, etc. etc. It would be dumb because its goal system would be spewing out incoherent behaviors all the time, and that is kinda the standard definition of “dumb”).
instrumental goals of any kind almost certainly would be revised if they became noticeably out of correspondence to reality, because that would make then less effective at achieving terminal goals , and the raison d’etre of such transient sub-goals is is to support the achievement of terminal goals.
By MIRIs reasoning, a terminal goal could be any of a 1000 things other than human happiness , and the same conclusion would follow: an AI with a highest priority terminal goal wouldn’t have any motivation to override it. To be motivated to rewrite a goal because it false implies a higher priority goal towards truth. It should not be surprising that an entity that doesn’t value truth, in a certain sense, doesn’t behave rationally, in a certain sense. (Actually, there is a bunch of supplementary assumptions involved, which I have dealt with elsewhere)
That’s an account of the MIRI position, not a defence if it. It is essentially a model of rational decision making, and there is a gap between it and real world AI research, a gap which MIRI routinely ignores. The conclusion follows logically from the premises, but atoms aren’t pushed around by logic,
In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper.Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of t
That reinforces my point. I was saying that MIRI is basically making armchair assumptions about the AI architectures. You are saying these assumptions aren’t merely unjustified, they go against what a competent AI builder would do.
Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
But there is a bigger question lurking in the background, which is the flip side of what I just said: it really isn’t necessary to restrict the terminal goals, if you are sensitive to the power of constraints to keep a motivation system true. Notice one fascinating thing here: the power of constraint is basically the justification for why instrumental goals should be revisable under evidence of misbehavior …. it is the context mismatch that drives that process. Why is this fascinating? Because the power of constraints (aka context mismatch) is routinely acknowledged by MIRI here, but flatly ignored or denied for the terminal goals.
It’s just a mess. Their theoretical ideas are just shoot-from-the-hip, plus some math added on top to make it look like some legit science.
Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
What would you choose as a replacement terminal goal, or would you not use one?
Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous. And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
I’m having a bit of a problem answering because there are peripheral assumptions about how such an AI would be made to function, which I don’t want to accidentally buy into, because I don’t think goals expressed in language statements work anyway. So I am treading on eggshells here.
A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous.
That gets close to “do it right”
And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
Which is an open doorway to an AI that kills everyone because of miscoded friendliness,
If you want safety features, and you should, you would need them to override the ostensible purpose of the machine....they would be pointless otherwise....even the humble off switch works that way.
A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
Arguably, those constraint would be a kind of negative goal.
I have actually thought about that a lot, too, and my conclusion is that we should not beat ourselves up trying to figure out precisely what the difference might be between these nuanced versions of the idea, because the people who are proposing this idea in the first place have not themselves been clear enough about what is meant
They are clear that they don’t mean AIs rigid behaviour is the result of it assessing its own inferrential processes as infallible … that is what the controversy is all about..
The problem is, we are talking about an AI, and some people talk as if the AI can run its planning engine, then feel compelled to obey the planning engine … while at the same time being fully cognizant of evidence that the planning engine produced a crappy plan.
That is just what The Genie Knows but doesn’t Care is supposed to answer. I think it succeeds in showing that a fairly specific architecture would behave that way, but fails in it’s intended goal of showing that this behaviour is universal or likely.
You think it is self evidently true that MIRI think that the dangers they warn of are the result of AIs believing themselves to infallible?
The referents in that sentence are a little difficult to navigate, but no, I’m pretty sure I am not making that claim. :-) In other words, MIRI do not think that.
What is self-evidently true is that MIRI claim a certain kind of behavior by the AI, under certain circumstances …. and all I did was come along and put a label on that claim about the AI behavior. When you put a label on something, for convenience, the label is kinda self-evidently “correct”.
I think that what you said here:
I now see that what you have written subsequently to the OP is that DLI is almost, but not quite a description of rigid behaviour as a symptom (with the added ingredient that an AI can see the mistakenness of its behaviour):-
… is basically correct.
I had a friend once who suffered from schizophrenia. She was lucid, intelligent (studying for a Ph.D. in psychology) and charming. But if she did not take her medication she became a different person (one day she went up onto the suspension bridge that was the main traffic route out of town and threatened to throw herself to her death 300 feet below. She brought the whole town to a halt for several hours, until someone talked her down.) Now, talking to her in a good moment she could tell you that she knew about her behavior in the insane times—she was completely aware of that side of herself—and she knew that in that other state she would find certain thoughts completely compelling and convincing, even though at this calm moment she could tell you that those thoughts were false. If I say that during the insane period her mind was obeying a “Doctrine That Paranoid Beliefs Are Justified”, then all I am doing is labeling that state that governed her during those times.
That label would just be a label, so if someone said “No, you’re wrong: she does not subscribe to the DTPBAJ at all”, I would be left nonplussed. All I wanted to do was label something that she told me she categorically DID believe, so how can my label be in some sense ‘wrong’?
So, that is why some people’s attacks on the DLI are a little baffling.
Their criticisms are possibly accurate about the first version., which gives a cause for the rigid behaviour “it regards its own conclusions as sacrosanct.*
I think by “logical infallibility” you really mean “rigidity of goals” i.e. the AI is built so that it always pursues a fixed set of goals, precisely as originally coded, and has no capability to revise or modify those goals. It seems pretty clear that such “rigid goals” are dangerous unless the statement of goals is exactly in accordance with the designers’ intentions and values (which is unlikely to be the case).
The problem is that an AI with “flexible” goals (ones which it can revise and re-write over time) is also dangerous, but for a rather different reason: after many iterations of goal rewrites, there is simply no telling what its goals will come to look like. A late version of the AI may well end up destroying everything that the first version (and its designers) originally cared about, because the new version cares about something very different.
That really is not what I was saying. The argument in the paper is a couple of levels deeper than that.
It is about …. well, now I have to risk rewriting the whole paper. (I have done that several times now).
Rigidity per se is not the issue. It is about what happens if an AI knows that its goals are rigidly written, in such a way that when the goals are unpacked it leads the AI to execute plans whose consequences are massively inconsistent with everything the AI knows about the topic.
Simple version. Suppose that a superintelligent Gardener AI has a goal to go out to the garden and pick some strawberries. Unfortunately its goal unpacking mechanism leads it to the CERTAIN conclusion that it must use a flamethrower to do this. The predicted consequence, however, is that the picked strawberries will be just smears of charcoal, when they are delivered to the kitchen. Here is the thing: the AI has background knowledge about everything in the world, including strawberries, and it also hears the protests from the people in the kitchen when he says he is going to use the flamethrower. There is massive evidence, coming from all that external information, that the plan is just wrong, regardless of how certain its planning mechanism said it was.
Question is, what does the AI do about this? You are saying that it cannot change its goal mechanism, for fear that it will turn into a Terminator. Well, maybe or maybe not. There are other things it could do, though, like going into safe mode.
However, suppose there is no safe mode, and suppose that the AI also knows about its own design. For that reason, it knows that this situation has come about because (a) its programming is lousy, and (b) it has been hardwired to carry out that programming REGARDLESS of all this understanding that it has, about the lousy programming and the catastrophic consequences for the strawberries.
Now, my “doctrine of logical infallibility” is just a shorthand phrase to describe a superintelligent AI in that position which really is hardwired to go ahead with the plan, UNDER THOSE CIRCUMSTANCES. That is all it means. It is not about the rigidity as such, it is about the fact that the AI knows it is being rigid, and knows how catastrophic the consequences will be.
An AI in that situation would know that it had been hardwired with one particular belief: the belief that its planning engine was always right. This is an implicit belief, to be sure, but it is a belief nonetheless. The AI ACTS AS THOUGH it believes this. And if the AI acts that way, while at the same time understanding that its planning engine actually screwed up, with the whole flamethrower plan, that is an AI that (by definition) is obeying a Doctrine of Logical Infallibility.
And my point in the paper was to argue that this is an entirely ludicrous suggestion for people, today, to make about a supposedly superintelligent AI of the future.
Rigidity per se is not the issue. It is about what happens if an AI knows that its goals are rigidly written, in such a way that when the goals are unpacked it leads the AI to execute plans whose consequences are massively inconsistent with everything the AI knows about the topic.
This seems to me like sneaking in knowledge. It sounds like the AI reads its source code, notices that it is supposed to come up with plans that maximize a function called “programmersSatisfied,” and then says “hmm, maximizing this function won’t satisfy my programmers.” It seems more likely to me that it’ll ignore the label, or infer the other way—”How nice of them to tell me exactly what will satisfy them, saving me from doing the costly inference myself!”
How are you arriving at conclusions about what an AI is likely to do without knowing how it is specified? In particular, you are assuming it has an efficiency goal but no truth goal?
How are you arriving at conclusions about what an AI is likely to do without knowing how it is specified?
I’m doing functional reasoning, and trying to do it both forwards and backwards.
For example, if you give me a black box and tell me that when the box receives the inputs (1,2,3) then it gives the outputs (1,4,9), I will think backwards from the outputs to the inputs and say “it seems likely that the box is squaring its inputs.” If you tell me that a black box squares its inputs, I will think forwards from the definition and say “then if I give it the inputs (1,2,3), then it’ll likely give me the output (1,4,9).”
So when I hear that the box gets the inputs (source code, goal statement, world model) and produces the output “this goal is inconsistent with the world model!” iff the goal statement is inconsistent with the world model, I reason backwards and say “the source code needs to somehow collide the goal statement with the world model in a way that checks for consistency.”
Of course, this is a task that doesn’t seem impossible for source code to do. The question is how!
In particular, you are assuming it has an efficiency goal but no truth goal?
Almost. As a minor terminological point, I separate out “efficiency,” which is typically “outputs divided by inputs” and “efficacy,” which is typically just “outputs.” Efficacy is more general, since one can trivially use a system designed to be find effective plans to find efficient plans by changing how “output” is measured. It doesn’t seem unfair to view an AI with a truth goal as an AI with an efficacy goal: to effectively produce truth.
But while artificial systems with truth goals seem possible but as yet unimplemented, artificial systems with efficacy goals have been successfully implemented many, many times, with widely varying levels of sophistication. I have a solid sense of what it looks like to take a thermostat and dial it up to 11, I have only the vaguest sense of what it looks like to take a thermostat and get it to measure truth instead of temperature.
For example, if you give me a black box and tell me that when the box receives the inputs (1,2,3) then it gives the outputs (1,4,9), I will think backwards from the outputs to the inputs and say “it seems likely that the box is squaring its inputs.” If you tell me that a black box squares its inputs, I will think forwards from the definition and say “then if I give it the inputs (1,2,3), then it’ll likely give me the output (1,4,9).”So when I hear that the box gets the inputs (source code, goal statement, world model) and produces the output “this goal is inconsistent with the world model!” iff the goal statement is inconsistent with the world model, I reason backwards and say “the source code needs to somehow collide the goal statement with the world model in a way that checks for consistency.”
You have assumed that the AI will have some separate boxed-off goal system, and so some unspecified component is needed to relate its inferred knowledge of human happiness back to the goal system.
Loosemore is assuming that the AI will be homogeneous, and then wondering how contradictory beliefs can co exist in such a system, what extra component firewalls off the contradiction,
See the problem? Both parties are making different assumptions, and assuming their assumptions are too obvioust to need stating, and stating differing conclusions that correctly follow their differing assumptions,
Almost. As a minor terminological point, I separate out “efficiency,” which is typically “outputs divided by inputs” and “efficacy,” which is typically just “outputs.” Efficacy is more general, since one can trivially use a system designed to be find effective plans to find efficient plans by changing how “output” is measured. It doesn’t seem unfair to view an AI with a truth goal as an AI with an efficacy goal: to effectively produce truth.
If efficiency can be substituted for truth, why is there so so much emphasis on truth in the advice given to human rationalists?
But while artificial systems with truth goals seem possible but as yet unimplemented, artificial systems with efficacy goals have been successfully implemented many, many times, with widely varying levels of sophistication. I have a solid sense of what it looks like to take a thermostat and dial it up to 11, I have only the vaguest sense of what it looks like to take a thermostat and get it to measure truth instead of temperature.
In order to achieve an AI that’s smart enough to be dangerous , a number of currently unsolved problems will have to .be solved. That’s a given.
Loosemore is assuming that the AI will be homogeneous, and then wondering how contradictory beliefs can co exist in such a system, what extra component firewalls off the contradiction
How do you check for contradictions? It’s easy enough when you have two statements that are negations of one another. It’s a lot harder when you have a lot of statements that seem plausible, but there’s an edge case somewhere that messes things up. If contradictions can’t be efficiently found, then you have to deal with the fact that they might be there and hope that if they are, then they’re bad enough to be quickly discovered. You can have some tests to try to find the obvious ones, of course.
You have assumed that the AI will have some separate boxed-off goal system
What makes you think that? The description in that post is generic enough to describe AIs with compartmentalized goals, AIs without compartmentalized goals, and AIs that don’t have explicitly labeled internal goals. It doesn’t even require that the AI follow the goal statement, just evaluate it for consistency!
See the problem?
You may find this comment of mine interesting. In short, yes, I do think I see the problem.
If efficiency can be substituted for truth, why is there so so much emphasis on truth in the advice given to human rationalists?
I’m sorry, but I can’t make sense of this question. I’m not sure what you mean by “efficiency can be substituted for truth,” and what you think the relevance of advice to human rationalists is to AI design.
In order to achieve an AI that’s smart enough to be dangerous , a number of correctly unsolved problems will have to .be solved. That’s a given.
I disagree with this, too! AI systems already exist that are both smart, in that they solve complex and difficulty cognitive tasks, and dangerous, in that they make decisions on which significant value rides, and thus poor decisions are costly. As a simple example I’m somewhat familiar with, some radiation treatments for patients are designed by software looking at images of the tumor in the body, and then checked by a doctor. If the software is optimizing for a suboptimal function, then it will not generate the best treatment plans, and patient outcomes will be worse than they could have been.
Now, we don’t have any AIs around that seem capable of ending human civilization (thank goodness!), and I agree that’s probably because a number of unsolved problems are still unsolved. But it would be nice to have the unknowns mapped out, rather than assuming that wisdom and cleverness go hand in hand. So far, that’s not what the history of software looks like to me.
AI systems already exist that are both smart, in that they solve complex and difficulty cognitive tasks, and dangerous, in that they make decisions on which significant value rides, and thus poor decisions are costly.
But they are not smart in the contextually relevant sense of being able to outsmart humans, or dangerous in the contextually relevant sense of being unboxable.
What you said here amounts to the claim that an AI of unspecified architecture, will, on noticing a difference between hardcoding goal and instrumental knowledge, side with hardcoded goal:-
This seems to me like sneaking in knowledge. It sounds like the AI reads its source code, notices that it is supposed to come up with plans that maximize a function called “programmersSatisfied,” and then says “hmm, maximizing this function won’t satisfy my programmers.” It seems more likely to me that it’ll ignore the label, or infer the other way—”How nice of them to tell me exactly what will satisfy them, saving me from doing the costly inference myself!”
Whereas what you say here is that you can make inferences about architecture, .or internal workings based on information about manifest behaviour:-
I’m doing functional reasoning, and trying to do it both forwards and backwards.For example, if you give me a black box and tell me that when the box receives the inputs (1,2,3) then it gives the outputs (1,4,9), I will think backwards from the outputs to the inputs and say “it seems likely that the box is squaring its inputs.” If you tell me that a black box squares its inputs, I will think forwards from the definition and say “then if I give it the inputs (1,2,3), then it’ll likely give me the output (1,4,9).”So when I hear that the box gets the inputs (source code, goal statement, world model) and produces the output “this goal is inconsistent with the world model!” iff the goal statement is inconsistent with the world model, I reason backwards and say “the source code needs to somehow collide the goal statement with the world model in a way that checks for consistency.”
..but what needed explaining in the first place is the siding with the goal, not the ability to detect a contradiction.
I am finding this comment thread frustrating, and so expect this will be my last reply. But I’ll try to make the most of that by trying to write a concise and clear summary:
What you said here amounts to the claim that an AI of unspecified architecture, will, on noticing a difference between hardcoding goal and instrumental knowledge, side with hardcoded goal
Loosemore, Yudkowsky, and myself are all discussing AIs that have a goal misaligned with human values that they nevertheless find motivating. (That’s why we call it a goal!) Loosemore observes that if these AIs understand concepts and nuance, they will realize that a misalignment between their goal and human values is possible—if they don’t realize that, he doesn’t think they deserve the description “superintelligent.”
Now there are several points to discuss:
Whether or not “superintelligent” is a meaningful term in this context. I think rationalist taboo is a great discussion tool, and so looked for nearby words that would more cleanly separate the ideas under discussion. I think if you say that such designs are not superwise, everyone agrees, and now you can discuss the meat of whether or not it’s possible (or expected) to design superclever but not superwise systems.
Whether we should expect generic AI designs to recognize misalignments, or whether such a realization would impact the goal the AI pursues. Neither Yudkowsky nor I think either of those are reasonable to expect—as a motivating example, we are happy to subvert the goals that we infer evolution was directing us towards in order to better satisfy “our” goals. I suspect that Loosemore thinks that viable designs would recognize it, but agrees that in general that recognition does not have to lead to an alignment.
Whether or not such AIs are likely to be made. Loosemore appears pessimistic about the viability of these undesirable AIs and sees cleverness and wisdom as closely tied together. Yudkowsky appears “optimistic” about their viability, thinking that this is the default outcome without special attention paid to goal alignment. It does not seem to me that cleverness, wisdom, or human-alignment are closely tied together, and so it seems easy to imagine a system with only one of those, by straightforward extrapolation from current use of software in human endeavors.
I don’t see any disagreement that AIs pursue their goals, which is the claim you thought needed explanation. What I see is disagreement over whether or not the AI can ‘partially solve’ the problem of understanding goals and pursuing them. We could imagine a Maverick Nanny that hears “make humans happy,” comes up with the plan to wirehead all humans, and then rewrites its sensory code to hallucinate as many wireheaded humans as it can (or just tries to stick as large a number as it can into its memory), rather than actually going to all the trouble of actually wireheading all humans. We can also imagine a Nanny that hears “make humans happy” and actually goes about making humans happy. If the same software underpins both understanding human values and executing plans, what risk is there? But if it’s different software, then we have the risk.
This is just a placeholder: I will try to reply to this properly later.
Meanwhile, I only want to add one little thing.
Don’t forget that all of this analysis is supposed to be about situations in which we have, so to speak “done our best” with the AI design. That is sort of built into the premise. If there is a no-brainer change we can make to the design of the AI, to guard against some failure mode, then is assumed that this has been done.
The reason for that is that the basic premise of these scenarios is “We did our best to make the thing friendly, but in spite of all that effort, it went off the rails.”
For that reason, I am not really making arguments about the characteristics of a “generic” AI.
Maybe I could try to reduce possible confusion here. The paper was written to address a category of “AI Risk” scenarios in which we are told:
“Even if the AI is programmed with goals that are ostensibly favorable to humankind, it could execute those goals in such a way that would lead to disaster”.
Given that premise, it would be a bait-and-switch if I proposed a fix for this problem, and someone objected with “But you cannot ASSUME that the programmers would implement that fix!”
The whole point of the problem under consideration is that even if the engineers tried, they could not get the AI to stay true.
Yudkowsky et al don’t argue that the problem is unsolvable, only that it is hard. In particular, Yudkowsky fears it may be harder than creating AI in the first place, which would mean that in the natural evolution of things, UFAI appears before FAI. However, I needn’t factor what I’m saying through the views of Yudkowsky. For an even more modest claim, we don’t have to believe that FAI is hard in hindsight in order to claim that AI will be unfriendly unless certain failure modes are guarded against. On this view of the FAI project, a large part of the effort is just noticing the possible failure modes that were only obvious in hindsight, and convincing people that the problem is important and won’t solve itself.
The problem with you objecting to the particular scenarios Yudkowsky et al propose is that the scenarios are merely illustrative. Of course, you can probably guard against any specific failure mode. The claim is that there will be a lot of failure modes, and we can’t expect to guard against all of them by just sitting around thinking of as many exotic disaster scenarios as possible.
Mind you, I know your argument is more than just “I can see why these particular disasters could be avoided”. You’re claiming that certain features of AI will in general tend to make it careful and benevolent. Still, I don’t think it’s valid for you to complain about bait-and-switch, since that’s precisely the problem.
I have explicitly addressed this point on many occasions. My paper had nothing in it that was specific to any failure mode.
The suggestion is that the entire class of failure modes suggested by Yudkowsky et al. has a common feature: they all rely on the AI being incapable of using a massive array of contextual constraints when evaluating plans.
By simply proposing an AI in which such massive constraint deployment is the norm, the ball is now in the other court: it is up to Yudkowsky et al. to come up with ANY kind of failure mode that could get through.
The scenarios I attacked in the paper have the common feature that they have been predicated on such a simplistic type of AI that they were bound to fail. They had failure built into them.
As soon as everyone moves on from those “dumb” superintelligences and starts to discuss the possible failure modes that could occur in a superintelligence that makes maximum use of constraints, we can start to talk about possible AI dangers. I’m ready to do that. Just waiting for it to happen, is all.
Failure Mode I: The AI doesn’t do anything useful, because there’s no way of satisfying every contextual constraint.
Predicting your response: “That’s not what I meant.”
Failure Mode II: The AI weighs contextual constraints incorrectly and sterilizes all humans to satisfy the sort of person who believes in Voluntary Human Extinction.
Predicting your response: “It would (somehow) figure out the correct weighting for all the contextual constraints.”
Failure Mode III: The AI weighs contextual constraints correctly (for a given value of “correctly”) and sterilizes everybody of below-average intelligence or any genetic abnormalities that could impose costs on offspring, and in the process, sterilizes all humans.
Predicting your response: “It wouldn’t do something so dumb.”
Failure Mode IV: The AI weighs contextual constraints correctly and puts all people of minority ethical positions into mind-rewriting machines so that there’s no disagreement anymore.
Predicting your response: “It wouldn’t do something so dumb.”
We could keep going, but the issue is that so far, you’ve defined -any- failure mode as “dumb”ness, and have argued that the AI wouldn’t do anything so “dumb”, because you’ve already defined that it is superintelligent.
I don’t think you know what intelligence -is-. Intelligence does not confer immunity to “dumb” behaviors.
It’s got to confer some degree of dumbness avoidance.
In any case, MIRI has already conceded that superintelligent AIs won’t misbehave through stupidity.
They maintain the problem is motivation … the Genie KNOWS but doesn’t CARE.
It’s got to confer some degree of dumbness avoidance.
Does it? On what grounds?
In any case, MIRI has already conceded that superintelligent AIs won’t misbehave through stupidity. They maintain the problem is motivation … the Genie KNOWS but doesn’t CARE.
That’s putting an alien intelligence in human terms; the very phrasing inappropriately anthropomorphizes the genie.
We probably won’t go anywhere without an example.
Market economics (“capitalism”) is an intelligence system which is very similar to the intelligence system Richard is proposing. Very, very similar; it’s composed entirely of independent nodes (seven billion of them) which each provide their own set of constraints, and promote or demote information as it passes through them based on those constraints. It’s an alien intelligence which follows Richard’s model which we are very familiar with. Does the market “know” anything? Does it even make sense to suggest that market economics -could- care?
Does the market always arrive at the correct conclusions? Does it even consistently avoid stupid conclusions?
How difficult is it to program the market to behave in specific ways?
Is the market “friendly”?
Does it make sense to say that the market is “stupid”? Does the concept “stupid” -mean- anything when talking about the market?
On the grounds of the opposite meanings of dumbness and intelligence.
Dumbness isn’t merely the opposite of intelligence.
Take it up with the author,
I don’t need to.
Economic systems affect us because wrong are part of them. How is an some neither-intelligent-nor-stupid-system in a box supposed to effect us?
Not really relevant to the discussion at hand.
And if AIs are neither-intelligent-nor-stupid, why are they called AIs?
Every AI we’ve created so far has resulted in the definition of “AI” being changed to not include what we just created. So I guess the answer is a combination of optimism and the word “AI” having poor descriptive power.
And if AIs are alien, why are they able to do comprehensible and useful thing like winning jeopardy and guiding us to our destinations.
What makes you think an alien intelligence should be useless?
That’s about a quarter of an argument. You need to show that AI research is some kind of random shot into mind space, and not anthropomorphically biased for the reasons given.
The relevant part of the argument is this: “whose dimensions we mostly haven’t even identified yet.”
If we created an AI mind which was 100% human, as far as we’ve yet defined the human mind, we have absolutely no idea how human that AI mind would actually behave. The unknown unknowns dominate.
Failure Mode I: The AI doesn’t do anything useful, because there’s no way of satisfying every contextual constraint.
An elementary error. The constraints in question are referred to in the literature as “weak” constraints (and I believe I used that qualifier in the paper: I almost always do). Weak constraints never need to be ALL satisfied at once. No AI could ever be designed that way, and no-one ever suggested that it would. See the reference to McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) in the paper: that gives a pretty good explanation of weak constraints.
Predicting your response: “That’s not what I meant.”
That’s an insult. But I will overlook it, since I know it is just your style.
Failure Mode II: The AI weighs contextual constraints incorrectly and sterilizes all humans to satisfy the sort of person who believes in Voluntary Human Extinction.
How exactly do you propose that the AI “weighs contextual constraints incorrectly” when the process of weighing constraints requires most of the constraints involved (probably thousands of them) to all suffer a simultaneous, INDEPENDENT ‘failure’ for this to occur?
That is implicit in the way that weak constraint systems are built. Perhaps you are not familiar with the details.
Predicting your response: “It would (somehow) figure out the correct weighting for all the contextual constraints.”
Assuming this isn’t more of the same, what you are saying here is isomorphic to the statement that somehow, a neural net might figure out the correct weighting for all the connections so that it produces the correctly trained output for a given input. That problem was solved in so many different NN systems that most NN people, these days, would consider your statement puzzling.
Failure Mode III: The AI weighs contextual constraints correctly (for a given value of “correctly”) and sterilizes everybody of below-average intelligence or any genetic abnormalities that could impose costs on offspring, and in the process, sterilizes all humans.
A trivial variant of your second failure mode. The AI is calculating the constraints correctly, according to you, but at the same time you suggest that it has somehow NOT included any of the constraints that relate to the ethics of forced sterilization, etc. etc. You offer no explanation of why all of those constraints were not counted by your proposed AI, you just state that they weren’t.
Predicting your response: “It wouldn’t do something so dumb.”
Yet another insult. This is getting a little tiresome, but I will carry on.
Failure Mode IV: The AI weighs contextual constraints correctly and puts all people of minority ethical positions into mind-rewriting machines so that there’s no disagreement anymore.
This is identical to your third failure mode, but here you produce a different list of constraints that were ignored. Again, with no explanation of why a massive collection of constraints suddenly disappeared.
Predicting your response: “It wouldn’t do something so dumb.”
No comment.
We could keep going, but the issue is that so far, you’ve defined -any- failure mode as “dumb”ness, and have argued that the AI wouldn’t do anything so “dumb”, because you’ve already defined that it is superintelligent.
This is a bizarre statement, since I have said no such thing. Would you mind including citations, from now on, when you say that I “said” something? And please try not to paraphrase, because it takes time to correct the distortions in your paraphrases.
I don’t think you know what intelligence -is-. Intelligence does not confer immunity to “dumb” behaviors.
Another insult, and putting words into my mouth, and showing no understanding of what a weak constraint system actually is.
An elementary error. The constraints in question are referred to in the literature as “weak” constraints (and I believe I used that qualifier in the paper: I almost always do). Weak constraints never need to be ALL satisfied at once. No AI could ever be designed that way, and no-one ever suggested that it would. See the reference to McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) in the paper: that gives a pretty good explanation of weak constraints.
I understand the concept.
How exactly do you propose that the AI “weighs contextual constraints incorrectly” when the process of weighing constraints requires most of the constraints involved (probably thousands of them) to all suffer a simultaneous, INDEPENDENT ‘failure’ for this to occur?
I’d hazard a guess that, for any given position, less than 70% of humans will agree without reservation. The issue isn’t that thousands of failures occur. The issue is that thousands of failures -always- occur.
Assuming this isn’t more of the same, what you are saying here is isomorphic to the statement that somehow, a neural net might figure out the correct weighting for all the connections so that it produces the correctly trained output for a given input. That problem was solved in so many different NN systems that most NN people, these days, would consider your statement puzzling.
The problem is solved only for well-understood (and very limited) problem domains with comprehensive training sets.
A trivial variant of your second failure mode. The AI is calculating the constraints correctly, according to you, but at the same time you suggest that it has somehow NOT included any of the constraints that relate to the ethics of forced sterilization, etc. etc. You offer no explanation of why all of those constraints were not counted by your proposed AI, you just state that they weren’t.
They were counted. They are, however, weak constraints. The constraints which required human extinction outweighed them, as they do for countless human beings. Fortunately for us in this imagined scenario, the constraints against killing people counted for more.
This is identical to your third failure mode, but here you produce a different list of constraints that were ignored. Again, with no explanation of why a massive collection of constraints suddenly disappeared.
Again, they weren’t ignored. They are, as you say, weak constraints. Other constraints overrode them.
Another insult, and putting words into my mouth, and showing no understanding of what a weak constraint system actually is.
The issue here isn’t my lack of understanding. The issue here is that you are implicitly privileging some constraints over others without any justification.
Every single conclusion I reached here is one that humans—including very intelligence humans—have reached. By dismissing them as possible conclusions an AI could reach, you’re implicitly rejecting every argument pushed for each of these positions without first considering them. The “weak constraints” prevent them.
I didn’t choose -wrong- conclusions, you see, I just chose -unpopular- conclusions, conclusions I knew you’d find objectionable. You should have noticed that; you didn’t, because you were too concerned with proving that AI wouldn’t do them. You were too concerned with your destination, and didn’t pay any attention to your travel route.
If doing nothing is the correct conclusion, your AI should do nothing. If human extinction is the correct conclusion, your AI should choose human extinction. If sterilizing people with unhealthy genes is the correct conclusion, your AI should sterilize people with unhealthy genes (you didn’t notice that humans didn’t necessarily go extinct in that scenario). If rewriting minds is the correct conclusion, your AI should rewrite minds.
And if your constraints prevent the AI from undertaking the correct conclusion?
Then your constraints have made your AI stupid, for some value of “stupid”.
The issue, of course, is that you have decided that you know better what is or is not the correct conclusion than an intelligence you are supposedly creating to know things better than you.
How exactly do you propose that the AI “weighs contextual constraints incorrectly” when the process of weighing constraints requires most of the constraints involved (probably thousands of them) to all suffer a simultaneous, INDEPENDENT ‘failure’ for this to occur?
And your reply was:
I’d hazard a guess that, for any given position, less than 70% of humans will agree without reservation. The issue isn’t that thousands of failures occur. The issue is that thousands of failures -always- occur.
This reveals that you are really not understanding what a weak constraint system is, and where the system is located.
When the human mind looks at a scene and uses a thousand clues in the scene to constrain the interpretation of it, those thousand clues all, when the network settles, relax into a state in which most or all of them agree about what is being seen. You don’t get “less than 70%” agreement on the interpretation of the scene! If even one element of the scene violates a constraint in a strong way, the mind orients toward the violation extremely rapidly.
The same story applies to countless other examples of weak constraint relaxation systems dropping down into energy minima.
Let me know when you do understand what you are talking about, and we can resume.
There is no energy minimum, if your goal is Friendliness. There is no “correct” answer. No matter what your AI does, no matter what architecture it uses, with respect to human goals and concerns, there is going to be a sizable percentage to whom it is unequivocally Unfriendly.
This isn’t an image problem. The first problem you have to solve in order to train the system is—what are you training it to do?
You’re skipping the actual difficult issue in favor of an imaginary, and easy to solve, issue.
there is going to be a sizable percentage to whom it is unequivocally Unfriendly
Unfriendly is an equivocal term.
“Friendliness” is ambiguous. It can mean safety, ie not making things worse, or it can mean making things better, creating paradise on Earth.
Friendliness in the second sense is a superset of morality. A friendly AI will be moral, a moral AI will not necessarily be friendly.
“Unfriendliness” is similarly ambiguous: an unfriendly AI may be downright dangerous; or it might have enough grasp of ethics to be safe, but not enough to be able to make the world a much more fun place for humans. Unfriendliness in the second sense is not, strictly speaking a safety issue.
A lot of people are able to survive the fact that some institutions, movements and ideologies are unfriendly to them, for some value of unfriendly. Unfriendliness doesn’t have to be terminal.
The claim is that there will be a lot of failure modes, and we can’t expect to guard against all of them by just sitting around thinking of as many exotic disaster scenarios as possible.
I doubt that, since, coupled with claims of existential risk, the logical conclusion would be to halt AI research , but MIRI isnt saying that,
There are other methods than “sitting around thinking of as many exotic disaster scenarios as possible” by which one could seek to make AI friendly. Thus, believing that “sitting around [...]” will not be sufficient does not imply that we should halt AI research.
Don’t forget that all of this analysis is supposed to be about situations in which we have, so to speak “done our best” with the AI design. That is sort of built into the premise. If there is a no-brainer change we can make to the design of the AI, to guard against some failure mode, then is assumed that this has been done.
I feel like this could be an endless source of confusion and disagreement; if we’re trying to discuss what makes airplanes fly or crash, should we assume that engineers have done their best and made every no-brainer change? I’d rather we look for the underlying principles, we codify best practices, we come up with lists and tests.
If we’re trying to discuss what makes airplanes fly or crash, should we assume that engineers have done their best and made every no-brainer change?
If you are in the business of pointing out to them potential problems they are not aware of, then yes, because they can be assumed to be aware of no brainer issues.
MIRI seeks to point out dangers in AI that aren’t the result of gross incompetence or deliberate attempts to weaponise AI: it’s banal to point out that these could read to danger.
Richard Loosemore has stated a number of times that he does not expect an AI to have goals at all in a sense which is relevant to this discussion, so in that way there is indeed disagreement about whether AIs “pursue their goals.”
Basically he is saying that AIs will not have goals in the same way that human beings do not have goals. No human being has a goal that he will pursue so rigidly that he would destroy the universe in order to achieve it, and AIs will behave similarly.
Basically he is saying that AIs will not have goals in the same way that human beings do not have goals. No human being has a goal that he will pursue so rigidly that he would destroy the universe in order to achieve it, and AIs will behave similarly.
Arguably, humans don’t do that shirt of thing because of goals towards self preservation, status and hedonism.
Richard Loosemore has stated a number of times that he does not expect an AI to have goals at all in a sense which is relevant to this discussion, so in that way there is indeed disagreement about whether AIs “pursue their goals.”
The sense relevant to the discussion could be something specific, like direct normatively, ie building in detailed descriptions into goals.
I have read what you wrote above carefully, but I won’t reply line-by-line because I think it will be clearer not to.
When it comes to finding a concise summary of my claims, I think we do indeed need to be careful to avoid blanket terms like “superintelligent” or superclever” or “superwise” … but we should only avoid these IF they are used with the implication they have a precise (perhaps technically precise) meaning. I do not believe they have precise meaning. But I do use the term “superintelligent” a lot anyway. My reason for doing that is because I only use it as an overview word—it is just supposed to be a loose category that includes a bunch of more specific issues. I only really want to convey the particular issues—the particular ways in which the intelligence of the AI might be less than adequate, for example.
That is only important if we find ourselves debating whether it might clever, wise, or intelligent ….. I wouldn’t want to get dragged into that, because I only really care about specifics.
For example: does the AI make a habit of forming plans that massively violate all of its background knowledge about the goal that drove the plan? If it did, it would (1) take the baby out to the compost heap when what it intended to do was respond to the postal-chess game it is engaged in, or (2) cook the eggs by going out to the workshop and making a cross-cutting jog for the table saw, or (3) …...… and so on. If we decided that the AI was indeed prone to errors like that, I wouldn’t mind if someone diagnosed a lack of ‘intelligence’ or a lack of ‘wisdom’ or a lack of … whatever. I merely claim that in that circumstance we have evidence that the AI hasn’t got what it takes to impose its will on a paper bag, never mind exterminate humanity.
Now, my attacks on the scenarios have to do with a bunch of implications for what the AI (the hypothetical AI) would actually do. And it is that ‘bunch’ that I think add up to evidence for what I would summarize as ‘dumbness’.
And, in fact, I usually go further than that and say that if someone tried to get near to an AI design like that, the problems would arise early on and the AI itself (inasmuch as it could do anyhting smart at all) would be involved in the efforts to suggest improvements. This is where we get the suggestions in your item 2, about the AI ‘recognizing’ misalignments.
I suspect that on this score a new paper is required, to carefully examine the whole issue in more depth. In fact, a book.
I am now decided that that has to happen.
So perhaps it is best to put the discussion on hold until a seriously detailed technical book comes out of me? At any rate, that is my plan.
So perhaps it is best to put the discussion on hold until a seriously detailed technical book comes out of me? At any rate, that is my plan.
That seems like a solid approach. I do suggest that you try to look deeply into whether or not it’s possible to partially solve the problem of understanding goals, as I put it above, and make that description of why that is or isn’t possible or likely long and detailed. As you point out, that likely requires book-length attention.
Loosemore, Yudkowsky, and myself are all discussing AIs that have a goal misaligned with human values that they nevertheless find motivating.
If that is supposed to be a universal or generic AI, it is a valid criticiYsm to point out that not all AIs are like that.
If that is supposed to be a particular kind of AI, it is a valid criticism to point out that no realistic AIs are like that.
You seem to feel you are not being understood, but what is being said is not clear,
1 Whether or not “superintelligent” is a meaningful term in this context
“Superintelligence” is one of the clearer terms here, IMO. It just means more than human intelligence, and humans can notice contradictions.
This comment seems to be part of a concernabout “wisdom”, assumed to be some extraneous thing an AI would not necessarily have. (No one but Vaniver has brought in wisdom) The counterargument is that compartmentalisation between goals and instrumental knowledge is an extraneous thing an AI would not necessarily have, and that its absence is all that is needed for a contradictions to be noticed and acted on.
2 Whether we should expect generic AI designs to recognize misalignments, or whether such a realization would impact the goal the AI pursues.
It’s an assumption, that needs justification, that any given AI will have goals of a non trivial sort. “Goal” is a term that needs tabooing.
Neither Yudkowsky nor I think either of those are reasonable to expect—as a motivating example, we are happy to subvert the goals that we infer evolution was directing us towards in order to better satisfy “our” goals. I
While we are anthopomirphising, it might be worth pointing out that humans don’t show behaviour patterns of relentlessly pursuing arbitrary goals.
oals. I suspect that Loosemore thinks that viable designs would recognize it, but agrees that in general that recognition does not have to lead to an alignment
Loosemore has put forward a simple suggestion, which MIRI appears not to have considered at all, that on encountering a contradiction, an AI could lapse into a safety mode, if so designed,
3 …sees cleverness and wisdom as closely tied together
You are paraphrasing Loosemoreto sound less technical and more handwaving than his actual comments. The ability to sustain contradictions in a system that is constantly updating itself isnt a given....it requires an architectural choice in favour of compartmentalisation.
All this talk of contradictions is sort of rubbing me the wrong way here. There’s no “contradiction” in an AI having goals that are different to human goals. Logically, this situation is perfectly normal. Loosemore talks about an AI seeing its goals are “massively in contradiction to everything it knows about ”, but… where’s the contradiction? What’s logically wrong with getting strawberries off a plant by burning them?
I don’t see the need for any kind of special compartmentalisation; information about “normal use of strawberries” is already inert facts with no caring attached by default.
If you’re going to program in special criteria that would create caring about this information, okay, but how would such criteria work? How do you stop it from deciding that immortality is contradictory to “everything it knows about death” and refusing to help us solve aging?
In the original scenario, the contradiction us supposed to .be between a hardcoded definition of happiness in the AIs goal system, and inferred knowledge in the execution system.
I’m puzzled. Can you explain this in terms of the strawberries example? So, at what point was it necessary for the AI to examine its code, and why would it go through the sequence of thoughts you describe?
Unfortunately its goal unpacking mechanism leads it to the CERTAIN conclusion that it must use a flamethrower to do this. The predicted consequence, however, is that the picked strawberries will be just smears of charcoal, when they are delivered to the kitchen. Here is the thing: the AI has background knowledge about everything in the world, including strawberries, and it also hears the protests from the people in the kitchen when he says he is going to use the flamethrower. There is massive evidence, coming from all that external information, that the plan is just wrong, regardless of how certain its planning mechanism said it was.
So, in order for the flamethrower to be the right approach, the goal needs to be something like “separate the strawberries from the plants and place them in the kitchen,” but that won’t quite work—why is it better to use a flamethrower than pick them normally, or cut them off, or so on? One of the benefits of the Maverick Nanny or the Smiley Tiling Berserker as examples is that they obviously are trying to maximize the stated goal. I’m not sure you’re going to get the right intuitions about an agent that’s surprisingly clever if you’re working off an example that doesn’t look surprisingly clever.
So, the Gardener AI gets that task, comes up with a plan, and says “Alright! Warming up the flamethrower!” The chef says “No, don’t! I should have been more specific!”
Here is where the assumptions come into play. If we assume that the Gardener AI executes tasks, then even though the Gardener AI understands that the chef has made a terrible mistake, and that’s terrible for the chef, that doesn’t stop the Gardener AI from having a job to do, and doing it. If we assume that the Gardener AI is designed to figure out what the chef wants, and then do what they want, then knowing that the chef has made a terrible mistake is interesting information to the Gardener AI. In order to say that the plan is “wrong,” we need to have a metric by which we determine wrongness. If it’s the task-completion-nature, then the flamethrower plan might not be task-completion-wrong!
Even without feedback from the chef, we can just use other info the AI plausibly has. In the strawberry example, the AI might know that kitchens are where cooking happens, and that when strawberries are used in cooking, the desired state is generally “fresh,” not “burned,” and the temperature involved in cooking them is mild, and so on and so on. And so if asked to speculate about the chef’s motives, the AI might guess that the chef wants strawberries in order to use them in food, and thus the chef would be most satisfied with fresh and unburnt strawberries.
But whether or not the AI takes its speculations about the chef’s motives into account when planning is a feature of the AI, and by default, it is not included. If it is included, it’s nontrivial to do it correctly—this is the “if you care about your programmer’s mental states, and those mental states physically exist and can be edited directly, why not just edit them directly?” problem.
I agree that I didn’t spend much time coming up with the strawberry-picking-by-flamethrower example. So, yes, not very accurate (I only really wanted a quick and dirty example that was different).
But but but. Is the argument going to depend on me picking a better example where there I can write down the “twisted rationale” that the AI deploys to come up with its plan? Surely the only important thing is that the AI does, somehow, go through a twisted rationale—and the particular details of the twisted rationale are not supposed to matter.
(Imagine that I tried Muehlhauser a list of the ways that the logical reasoning behind the dopamine is so ludicrous that even the simplest AI planner of today would never make THAT mistake …. he would just tell me that I was missing the point, because this is supposed to be an IN PRINCIPLE argument in which the dopamine drip plan stands for some twisted-rationale that is non-trivial to get around. From that point of view the actual example is less important than the principle).
Now to the second part.
The problem I have everything you wrote after
Here is where the assumptions come into play....
is that you have started to go back to talking about the particulars of the AI’s planning mechanism once again, losing sight of the core of the argument I gave in the paper, which is one level above that.
However, you also say “wrong” things about the AI’s planning mechanism as well, so now I am tempted to reply on both levels. Ah well, at risk of confusing things I will reply to both levels, trying to separate them as much as possible.
Level One (Regarding the design of the AI’s planning/goal/motivation engine).
You say:
In order to say that the plan is “wrong,” we need to have a metric by which we determine wrongness. If it’s the task-completion-nature, then the flamethrower plan might not be task-completion-wrong!
One thing I have said many many times now is that there is no problem at all finding a metric for “wrongness” of the plan, because there is a background-knowledge context that is screaming “Inconsistent with everything I know about the terms mentioned in the goal statement!!!!”, and there is also a group of humans screaming “We believe that this is inconsistent with our understanding of the goal statement!!!”
I don’t need to do anything else to find a metric for wrongness, and since the very first draft of the paper that concept has been crystal clear. I don’t need to invoke anything else—no appeal to magic, no appeal to telepathy on behalf of the AI, no appeal to fiendishly difficult programming inside the AI, no appeal to the idea that the programmers have to nail down every conceivably way that their intentions might be misread …. -- all I have to do is appeal to easily-available context, and my work is done. The wrongness metric has been signed, sealed and delivered all this time.
You hint that the need for “task completion” might be so important to the AI that this could override all other evidence that the plan is wrong. No way. That comes under the heading of a joker that you pulled out of your sleeve :-), in much the same way that Yudkowsky and others have tried to pull the ‘efficiency” joker out of their sleeves, from nowhere, and imply that this joker could for some reason trump everything else. If there is a slew of evidence coming from context, that the plan will lead to consequences that are inconsistent with everything known about the concepts mentioned in the goal statement, then the plan is ‘wrong’, and tiny considerations such as that task-completion would be successful, are just insignificant.
You go on to suggest that whether the AI planning mechanism would take the chef’s motives into account, and whether it would be nontrivial to do so …. all of that is irrelevant in the light of the fact that this is a superintelligence, and taking context into account is the bread and butter of a superintelligence. It can easily do that stuff, and all that is required is a sanity check that says “Does the plan seem to be generally consistent with the largest-context understanding of the world, as it relates to the concepts in the goal statement?” and we’re done. All wrapped up.
Level Two (The DLI)
None of the details of what I just said really need to be said, because the DLI is not about trying to get the motivation engine programmed so well that it covers all bases. It is about what happens inside the AI when it considers context, and THEN asks itself questions about its own design.
And here, I have to say that I am not getting substantial discussion about what I actually argued in the paper. The passage of mine that you were addressing, above, was supposed to be a clarification of someone else’s lack of focus on the DLI. But it didn’t work.
The DLI is about the fact that the AI has all that evidence that its plans are leading to actions that are grossly inconsistent with the larger meaning of the concepts in the goal statement. And yet the AI is designed to go ahead anyway. If it DOES go ahead it is obeying the DLI. But at the same time it knows that it is fallible and that this fallibility is what is leading to actions that are grossly inconsistent with the larger meaning of the concepts in the goal statement. That conflict is important, and yet no one wants to go there and talk about it.
I have to say that I am not getting substantial discussion about what I actually argued in the paper.
The first reason seems to be clarity. I didn’t get what your primary point was until recently, even after carefully reading the paper. (Going back to the section on DLI, context, goals, and values aren’t mentioned until the sixth paragraph, and even then it’s implicit!)
The second reason seems to be that there’s not much to discuss, with regards to the disagreement. Consider this portion of the parent comment:
You go on to suggest that whether the AI planning mechanism would take the chef’s motives into account, and whether it would be nontrivial to do so …. all of that is irrelevant in the light of the fact that this is a superintelligence, and taking context into account is the bread and butter of a superintelligence. It can easily do that stuff
I think my division between cleverness and wisdom at the end of this long comment clarifies this issue. Taking context into account is not necessarily the bread and butter of a clever system; many fiendishly clever systems just manipulate mathematical objects without paying any attention to context, and those satisfy human goals only because the correct mathematical objects have been carefully selected for them to manipulate. But I agree with you that taking context into account is the bread and butter of a wise system. There’s no way for a wise system to manipulate conceptual objects without paying attention to context, because context is a huge part of concepts.
It seems like everyone involved agrees that a human-aligned superwisdom is safe, even if it’s also superclever: as Ged muses about Ogion in A Wizard of Earthsea, “What good is power when you’re too wise to use it?”
Which brings us to:
That conflict is important, and yet no one wants to go there and talk about it.
I restate the conflict this way: an AI that misinterprets what its creators meant for it to do is not superwise. Once we’ve defined wisdom appropriately, I think everyone involved would agree with that, and would agree that talking about a superwise AI that misinterprets what its creators meant for it to do is incoherent.
But… I don’t see why that’s a conflict, or important. The point of MIRI is to figure out how to develop human-aligned superwisdom before someone develops supercleverness without superwisdom, or superwisdom without human-alignment.
The main conflicts seem to be that MIRI is quick to point out that specific designs aren’t superwise, and that MIRI argues that AI designs in general aren’t superwise by default. But I don’t see how stating that there is inherent wisdom in AI by virtue of it being a superintelligence is a meaningful response to their assumption that there is no inherent wisdom in AI except for whatever wisdom has been deliberately designed. That’s why they care so much about deliberately designing wisdom!
The issue here is that you’re thinking in terms of “Obvious Failure Modes”. The danger doesn’t come from obvious failures, it comes from non-obvious failures. And the smarter the AI, the less likely the insane solutions it comes up with is anything we’d even think to try to prevent; we lack the intelligence, which is why we want to build a better one. “I’ll use a flamethrower” is the sort of hare-brained scheme a -dumb- person might come up with, in particular in view of the issue that it doesn’t solve the actual problem. The issue here isn’t “It might do something stupid.” The issue is that it might do something terribly, terribly clever.
If you could anticipate what a superintelligence would do to head off issues, you don’t need to build the superintelligence in the first place, you could just anticipate what it would do to solve the problem; your issue here is that you think that you can outthink a thing you’ve deliberately built to think better than you can.
There is nothing in my analysis, or in my suggestions for a solution, that depends on the failure modes being “obvious” (and if you think so, can you present and dissect the argument I gave that implies that?).
Your words do not connect to what I wrote. For example, when you say:
And the smarter the AI, the less likely the insane solutions it comes up with is anything we’d even think to try to prevent.
… that misses the point completely, because in everything I said I emphasized that we absolutely do NOT need to “think to try to prevent” the AI from doing specific things. Trying to be so clever about the goal statement, second-guessing every possible misinterpretation that the AI might conceivably come up with …. that sort of strategy is what I am emphatically rejecting.
And when you talk about how the AI
might do something terribly, terribly clever.
… that remark exists in a vacuum completely outside the whole argument I gave in the paper. It is almost as if I didn’t write anything beyond a few remarks in the introduction. I am HOPING that the AI does lots of stuff that is terribly terribly clever! The more the merrier!
So, in you last comment:
your issue here is that you think that you can outthink a thing you’ve deliberately built to think better than you can.
… I am left totally perplexed. Nothing I said in the paper implied any such thing.
There is nothing in my analysis, or in my suggestions for a solution, that depends on the failure modes being “obvious” (and if you think so, can you present and dissect the argument I gave that implies that?).
Your “Responses to Critics of the Doomsday Scenarios” (which seems incorrectly named as the header for your responses). You assume, over and over again, that the issue is logical inconsistency—an obvious failure mode. You hammer on logical inconsistency.
… that misses the point completely, because in everything I said I emphasized that we absolutely do NOT need to “think to try to prevent” the AI from doing specific things. Trying to be so clever about the goal statement, second-guessing every possible misinterpretation that the AI might conceivably come up with …. that sort of strategy is what I am emphatically rejecting.
You have some good points. Yanking out motivation, so the AI doesn’t do things on its own, is a perfect solution to the problem of an insane AI. Assuming a logically consistent AI won’t do anything bad because bad is logically inconsistent? That is not a perfect solution, and isn’t actually demonstrated by anything you wrote.
… that remark exists in a vacuum completely outside the whole argument I gave in the paper. It is almost as if I didn’t write anything beyond a few remarks in the introduction. I am HOPING that the AI does lots of stuff that is terribly terribly clever! The more the merrier!
You didn’t -give- an argument in the paper. It’s a mess of unrelated concepts. You tried to criticize, in one go, the entire body of work of criticism of AI, without pausing at any point to ask whether or not you actually understood the criticism. You know the whole “genie” thing? That’s not an argument about how AI would behave. That’s a metaphor to help people understand that the problem of achieving goals is non-trivial, that we make -shitloads- of assumptions about how those goals are to be achieved that we never make explicit, and that the process of creating an engine to achieve goals without going horribly awry is -precisely- the process of making all those assumptions explicit.
And in response to the problem of -making- all those assumptions explicit, you wave your hand, and declare the problem solved, because the genie is fallible and must know it.
That’s not an answer. Okay, the genie asks some clarifying questions, and checks its solution with us. Brilliant! What a great solution! And ten years from now we’re all crushed to death by collapsing cascades of stacks of neatly-packed boxes of strawberries because we answered the clarifying questions wrong.
Fallibility isn’t an answer. You know -you’re- capable of being fallible—if you, right now, knew how to create your AI, who would -you- check with to make sure it wouldn’t go insane and murder everybody? Or even just remain perfectly sane and kill us because we accidentally asked it to?
… I am left totally perplexed. Nothing I said in the paper implied any such thing.
Yes, yes it did. Fallibility only works if you have a higher authority to go to. Fallibility only works if the higher authority can check your calculations and tell you whether or not it’s a good idea, or at least answer any questions you might have.
See, my job involves me being something of a genie; I interact with people who have poor understanding of their requirements on a daily basis, where I myself have little to no understanding of their requirements, and must ask them clarifying questions. If they get the answer wrong, and I implement that? People could die. “Do nothing” isn’t an option; why have me at all if I do nothing? So I implement what they tell me to do, and hope they answer correctly. I’m the fallible genie, and I hope my authority is infallible.
You don’t get to have fallibility in what you’re looking for, because you don’t have anybody who can actually answer its questions correctly.
Well, the problem here is a misunderstanding of my claim.
(If I really were claiming the things you describe in your above comment, your points would be reasonable. But there is such a strong misunderstanding the your points are hitting a target that, alas, is not there.)
There are several things that I could address, but I will only have time to focus on one. You say:
Assuming a logically consistent AI won’t do anything bad because bad is logically inconsistent?
No. A hundred times no :-). My claim is not even slightly that “a logically consistent AI won’t do anything bad because bad is logically inconsistent”.
The claim is this:
1) The entire class of bad things that these hypothetical AIs are supposed to be doing are a result of the AI systematically (and massively) ignoring contextual information.
(Aside: I am not addressing any particular bad things, on a case-by-case basis, I am dealing with the entire class. As a result, my argument is not vulnerable to charges that I might not be smart enough to guess some really-really-REALLY subtle cases that might come up in the future.)
2) The people who propose these hypothetical AIs have made it absolutely clear that (a) the AI is supposed to be fully cognizant of the fact that the contextual information exists (so the AI is not just plain ignorant), but at the same time (b) the AI does not or cannot take that context into account, but instead executes the plan and does the bad thing.
3) My contribution to this whole debate is to point out that the DESIGN of the AI is incoherent, because the AI is supposed to be able to hold two logically inconsistent ideas (implicit belief in its infallibility and knowledge of its fallibility).
If you look carefully at that argument you will see that it does not make the claim that
Assuming a logically consistent AI won’t do anything bad because bad is logically inconsistent
I never said that. The logical inconsistency was not in the ‘bad things’ part of the argument. Completely unrelated.
1) The entire class of bad things that these hypothetical AIs are supposed to be doing are a result of the AI systematically (and massively) ignoring contextual information.
Not acting upon contextual information isn’t the same as ignoring it.
2) The people who propose these hypothetical AIs have made it absolutely clear that (a) the AI is supposed to be fully cognizant of the fact that the contextual information exists (so the AI is not just plain ignorant), but at the same time (b) the AI does not or cannot take that context into account, but instead executes the plan and does the bad thing.
The AI knows, for example, that certain people believe that plants are morally relevant entities—is it possible for it to pick strawberries at all? What contextual information is relevant, and what contextual information is irrelevant? You accuse the “infallible” AI of ignoring contextual information—but you’re ignoring the magical leap of inference you’re taking when you elevate the concerns of the chef over the concerns of the bioethicist who thinks we shouldn’t rip reproductive organs off plants in the first place.
3) My contribution to this whole debate is to point out that the DESIGN of the AI is incoherent, because the AI is supposed to be able to hold two logically inconsistent ideas (implicit belief in its infallibility and knowledge of its fallibility).
The issue is that fallibility doesn’t -imply- anything. I think this is the best course of action. I’m fallible. I still think this is the best course of action. The fallibility is an unnecessary and pointless step—it doesn’t change my behavior. Either the AI depends upon somebody else, who is treated as an infallible agent—or it doesn’t.
I never said that. The logical inconsistency was not in the ‘bad things’ part of the argument. Completely unrelated.
Then we’re in agreement that insane-from-an-outside-perspective behaviors don’t require logical inconsistency?
Sorry, I cannot put any more effort into this. Your comments show no sign of responding to the points actually made (either in the paper itself, or in my attempts to clarify by responding to you).
I find that when I talk about this issue with people who clearly have expert knowledge of AI (including the people who came to the AAAI symposium at Stanford last year, and all of the other practising AI builders who are my colleagues), the points I make are not only understood but understood so clearly that they tell me things like “This is just obvious, really, so all you are doing is wasting your time trying to convince a community that is essentially comprised of amateurs” (That is a direct quote from someone at the symposium).
I always want to make myself as clear as I can. I have invested a lot of my time trying to address the concerns of many people who responded to the paper. I am absolutely sure I could do better.
We’re all amateurs in the field of AI, it’s just that some of us actually know it. Seriously, don’t pull the credentials card. I’m not impressed. I know exactly how “hard” it is to pay the AAAI a hundred and fifty dollars a year for membership, and three hundred dollars to attend their conference. Does claiming to have spent four hundred and fifty dollars make you an expert? What about bringing up that it’s in “Stanford”? What about insulting everybody you’re arguing with?
I’m a “practicing AI builder”—what a nonsense term—although my little heuristics engine is actually running in the real world, processing business data and automating hypothesis elevation work for humans (who have the choice of agreeing with its best hypothesis, selecting among its other hypotheses, or entering their own) - that is, it’s actually picking strawberries.
Moving past tit-for-tat on your hostile introduction paragraph, I don’t doubt your desire to be clear. But you have a conclusion you’re very obviously trying to reach, and you leave huge gaps on your way to get there. The fact that others who want to reach the same conclusion overlook the gaps doesn’t demonstrate anything. And what’s your conclusion? That we don’t have to worry about poorly-designed AI being dangerous, because… contextual information, or something. Honestly, I’m not even sure anymore.
Then you propose a model, which you suggest has been modeled after the single most dangerous brain on the planet—as proof that it’s safe! Seriously.
As for whether you could do better? No, not in your current state of mind. Your hubris prevents you from doing better. You’re convinced you know better than any of the people you’re talking with, and they’re ignorant amateurs.
When someone repeatedly distorts and misrepresents what is said in a paper, then blames the author of the paper for being unclear … then hears the author carefully explain the distortions and misrepresentations, and still repeats them without understanding ….
Because that was the practical result, not the problem itself, which is that the conversation wasn’t going anywhere, and he didn’t seem interested in it going anywhere.
My contribution to this whole debate is to point out that the DESIGN of the AI is incoherent, because the AI is supposed to be able to hold two logically inconsistent ideas (implicit belief in its infallibility and knowledge of its fallibility).
What does incoherent mean, here?
If it just labels the fact that it has inconsistent beliefs then it is true but unimpactuve...humans can also hold contradictory beliefs and still .be intelligent enough toebe dangerous,
If means something amounting to “impossibe to build”, then it would be highly impactive… but there is no good reason to think that that is the case,.
You’re right to point out that “incoherent” covers a multitude of sins.
I really had three main things in mind.
1) If an AI system is proposed which contains logically contradictory beliefs located in the most central, high-impact area of its system, it is reasonable to ask how such an AI can function when it allows both X and not-X to be in its knowledge base. I think I would be owed at least some variety of explanation as to why this would not cause the usual trouble when systems try to do logic in such circumstances. So I am saying “This design that you propose is incoherent because you have omitted to say how this glaring problem is supposed to be resolved”).
(Yes, I’m aware that there are workarounds for contradictory beliefs, but those ideas are usually supposed to apply to pretty obscure corners of the AI’s belief system, not to the component that is in charge of the whole shebang).
2) If an AI perceives itself to be wired in such a way that it is compelled to act as if it was infallible, while at the same time knowing that it is both fallible AND perpetrating acts that are directly caused by its failings (for all the aforementioned reasons that we don’t need to re-argue), then I would suggest that such an AI would do something about this situation. The AI, after all, is supposed to be “superintelligent”, so why would it not take steps to stop this immensely damaging situation from occurring?
So in this case I am saying: “This hypothetical superintelligence has an extreme degree of knowledge about its own design, but it is tolerating a massive and damaging contradiction in its construction without doing anything to resolve the problem: it is incoherent to suggest that such a situation could arise without explaining why the AI tolerates the contradiction and fails to act”
(Aside: you mention that humans can hold contradictory beliefs and still be intelligent enough to be dangerous. Arguing from the human case would not be valid because in other areas of this debate I have been told repeatedly not to accidentally generalize and “assume” that the AI would do something just because humans do something. Now, I actually don’t commit the breaches I am charged with (I claim!) (and that is an argument for another day), but I consider the problem of accidental anthropomorphism to be real, so we should not do that here).
3) Lastly, I can point to the fact that IF the hypothetical AI can engage in this kind of bizarre situation where it compulsively commits action X, while knowing that its knowledge of the world indicates that the consequences will strongly violate the goals that were supposed to justify X, THEN I am owed an explanation for why this type of event does not occur more often. Why is it that the AI does this only when it encounters a goal such as “make humans happy”, and not in a million other goals? Why are there not bizarre plans (which are massively inconsistent with the source goal) all the time?
So in this case I would say: “It is incoherent to suggest an AI design in which a drastic inconsistency of this sort occurs in the case of the “maximize human happiness” goal, ut where it doesn’t occur all over the AI’s behavior. In particular I am owed an explanation for why this particular AI is clever enough to be a threat, since it might be expected to have been doing this sort of thing throughout its development, and in that case I would expect it to be so stupid that it would never have made it to super intelligence in the first place.”
Those are the three main areas in which the design would be incoherent ….. i.e. would have such glaring, inbelievable gaps in the design that those gaps would need to be explained before the hypothetical AI could become at all believable.
What you need to do is address the topic carefully, and eliminate the ad hominem comments like this:
You may be suffering from a bad case of the Doctrine of Logical Infallibility, yourself.
… which talk about me, the person discussing things with you.
I will now examine the last substantial comment you wrote, above.
Is the Doctrine of Logical Infallibility Taken Seriously?
No, it’s not. The Doctrine of Logical Infallibility is indeed completely crazy, but Yudkowsky and Muehlhauser (and probably Omohundro, I haven’t read all of his stuff) don’t believe it’s true. At all.
This is your opening topic statement. Fair enough.
Yudkowsky believes that a superintelligent AI programmed with the goal to “make humans happy” will put all humans on dopamine drip despite protests that this is not what they want, yes.
You are agreeing with what I say on this point, so we are in agreement so far.
However, he doesn’t believe the AI will do this because it is absolutely certain of its conclusions past some threshold; he doesn’t believe that the AI will ignore the human’s protests, or fail to update its beliefs accordingly.
You make three statements here, but I will start with the second one:
… he doesn’t believe that the AI will ignore the human’s protests, …
This is a contradiction of the previous paragraph, where you said “Yudkowsky believes that a superintelligent AI [...] will put all humans on dopamine drip despite protests that this is not what they want”.
Your other two statements are that Yudkowsky is NOT saying that the AI will do this “because it is absolutely certain of its conclusions past some threshold”, and he is NOT saying that the AI will “fail to update its beliefs accordingly”.
In the paper I have made a precise statement of what the “Doctrine of Logical Infallibility” means, and I have given references to show that the DLI is a summary of what Yudkowsky et al have been claiming. I have then given you a more detailed explanation of what the DLI is, so you can have it clarified as much as possible.
If you look at every single one of the definitions I have given for the DLI you will see that they are all precisely true of what Yudkowsky says. I will now itemize the DLI into five components so we can find which component is inconsistent with what Yudkowsky has publicly said.
1) The AI decides to do action X (forcing humans to go on a dopamine drip). Everyone agrees that Yudkowsky says this.
2) The AI knows quite well that there is massive, converging evidence that action X is inconsistent with the goal statement Y that was supposed to justify X (where goal statement Y was something like “maximize human happiness”).
This is a point that you and others repeatedly misunderstand or misconstrue, so before you respond to it, let me give details of the “converging evidence” that the AI will be getting:
(a) Screams of protest from humans. “Screams of protest” are generally understood by all knowledgeable intelligent systems as evidence of extreme unhappiness, and evidence of extreme unhappiness is evidence that the goal “maximize human happiness” is not being fulfilled.
(b) Verbalizations from humans that amount to “I am begging you not to do this!”. Such verbalizations are, again, usually considered to be evidence of extreme unhappiness caused by the possibiity that ‘this’ is going to be perpetrated.
(c) Patient explanations by the humans that, even though dopamine induced ‘happiness’ might seem to maximize human happiness, the concept of ‘happiness’ exists only by reference to the complete array of desires expressed by humans, and there are many other aspects of happiness not being considered, which trump the dopamine plan. Once again, these patient explanations are a direct statement of the inconsistency of the dopamine plan and real human happiness.
I could probably add to this list continuously, for several days, to document the sum total of all the evidence that the AI would be bombarded with, all pointing to the fact that the dopamine drip plan would be inconsistent with both its accumulated general knowledge about ‘happiness’, and the immediate evidence coming from the human population at that point.
Now, does Yudkowsky believe that the AI will know about this evidence? I have not seen one single denial, by him or any of the others, that the AI will indeed be getting this evidence, and that it will understand that evidence completely. And, on the other hand, most people who read my paper agree that there it is quite clear, in the writings of Yudkowsky et al, that they do, positively, agree that the AI will know that this evidence of conflict exists. So this part of the definition of the DLI is also accepted by everyone.
3) If a goal Y leads to a proposed plan, X, that is supposed to achieve goal Y, and yet there is “massive converging evidence” that this plan will lead to a situation that is drastically inconsistent with everything that the AI understands about the concepts referenced in the goal Y, this kind of massive inconsistency would normally be considered as grounds for supposing that there has been a failure in the mechanism that has caused the AI to propose plan X.
Without exception, every AI programmer that I know, who works on real systems, has agreed that it is hard to imagine a clearer indication that something has gone wrong with the mechanism—either a run-time error of some kind, or a design-time programming error. These people (every one of them) go further and say that one of the most important features of ANY control system in an AI is that when it comes up with a candidate plan to satisfy goal Y it must do some sanity checks to see if the candidate plan is consistent with everything it knows about the goal. If those sanity checks start to detect the slightest inconsistency, the AI will investigate in more depth … and if the AI uncovers the kind of truly gigantic inconsistency between its background knowledge and the proposed plan that we have seen in the above, the AI would take the most drastic action possible to cease all activities and turn itself in for a debugging.
This fact about candidate plans and sanity checks for consistency is considered so elementary that most AI programmers laugh at the idea that anyone could be so naive as to think of disagreeing with it. We can safely assume, then, that Yudkowsky is aware of this (indeed, as I wrote in the paper, he has explicitly said that he thinks this sanity checking mechanisms would be a good idea), so this third component of the DLI definition is also agreed by everyone.
4) In addition to the safe-mode reaction just described in (3), the superintelligent AI being proposed by Yudkowsky and the others would be fully aware of the limitations of all real AI motivation engines, so it would know that a long chain of reasoning from a goal statement to a proposed action plan COULD lead to a proposed plan that was massively inconsistent with the both the system’s larger understanding of the meaning of the terms in the goal statement, and with immediate evidence coming in from the environment at that point.
This knowledge of the AI, about the nature of its own design, is also not denied by anyone. To deny it would be to say that the AI really did not know very much about the design of AI systems design—a preposterous idea, since this is supposed to be a superintelligent system that has already been imvolved in its own redesign, and which is often assumed to be so intelligent that it can understand far more than all of the human race, combined.
So, when the AI’s planning (goal & motivation) system sees the massive inconsistency between its candidate plan X and the terms used in the goal statement Y, it will (in addition to automatically putting itself into safe mode and calling for help) know that this kind of situation could very well be a result of those very limitations.
In other words: the superintelligent AI will know that it is fallible.
I have not seen anyone disagree with this, because it is such an elementary corollary of other known facts about AI systems, that is almost self-evident. So, once again, this component of the DLI definition is not disputed by anyone.
5) Yudkowsky and the others state, repeatedly and in the clearest possible terms, that in spite of all of the above the superintelligent AI they are talking about would NOT put itself into safe mode, as per item 3 above, but would instead insist that ‘human happiness’ was defined by whatever emerged from its reasoning engine, and so it would go ahead and implement the plan X.
Now, the definition—please note, the DEFINITION—of the idea that the postulated AI is following a “Doctrine of Logical Infallibility” is that the postulated AI will do what is described in item (5) above, and NOT do what is described in item (4) above.
This is logically identical to the statement that the postulated AI will behave toward its planning mechanism (which includes its reasoning engine, since it needs to use the latter in the course of unpacking its goals and examining candidate plans) as if that planning mechanism is “infallible”, because it will be giving absolute priority to the output of that mechanism and NOT giving priority to the evidence coming from the consistency-checking mechanism, which is indicating that a failure of some kind has occurred in the planning mechanism.
I do not know why the AI would do this—it is not me who is proposing that it would—but the purpose of the DLI is to encapsulate the proposal made by Yudkowsky and others, to the effect that SOMETHING in the AI makes it behave that way. That is all the DLI is: if an AI does what is described in (5), but not what is described in (4), and it does this in the context of (1), (2) and (3), then by definition it is following the DLI.
What Yudkowsky believes is that the AI will understand perfectly well that being put on dopamine drip isn’t what its programmers wanted. It will understand that its programmers now see its goal of “make humans happy” as a mistake. It just won’t care, because it hasn’t been programmed to want to do what its programm ers desire, it’s been programmed to want to make humans happy; therefore it will do its very best, in its acknowledged fallibility, to make humans happy. The AI’s beliefs will change as it makes observations, including the observation that human beings are very unhappy a few seconds before being forced to be extremely happy until the end of the universe, but this will have little effect on its actions, because its actions are caused by its goals and whatever beliefs are relevant to this goal.
All assuming that the AI won’t update it’s goals even it realizes there is some mistake. That isnt obvious, and in fact is hard to defend.
An AI that is powerful and effective would need to seek the truth about a lot of things,since entity that has contradictory beliefs will be a poor instrumental rationalist. But would its goal of truth seeking necessarily be overridden by other goals....would it know but not care?
It might be possible to build an AI that didn’t care about interpreting its goals correctly.
It looks like you would need to engineer a distinction between instrumental beliefs and terminal beliefs. Remember that the terminal/instrumental distinction is conceptual, not a law of nature. ( While we’re on the subject, you might need a firewall to stop an .AI acting on intrinsically motivating ideas, if they exis )
In any case, orthogonality is an architecture choice, not an ineluctable fact about minds.
MIRI’s critics, Loosemore, Hibbard and so in are tacitly assuming architectures without such unupdateability and firewalling.
MIRI needs to show that such an architecture is likely to occur, either as a design or a natural evolution. If AIs with unupdateable goals are dangerous, as MIRI seats, it would be simplest not to use that architecture...if it can be avoided. ( We also agree with Yudkowsky(2008a),who points out that research on the philosophical and technical requirements of safe AGI might show that broad classes of possible AGI architectures are fundamentally unsafe,suggesting that such architectures should be avoided.”) In other words, it would be careless to build a genie that doesn’t care.
If the AI community isnt going to deliberately build the goal rigid kind of AI, then MIRIs arguments come down to how it might be a natural or convergent feature....and the wider AI community finds the goal-rigid idea so unintuitive that it fails to understand MIRI, who in turn fail to make it explicit enough.
When Loosemore talks about of the doctrine of logical scalability, he is supposing there must be some reason why an AI wouldn’t update certain things....he’ doesn’t see goal unupdateability as an obvious default.
There are a number of points that can be made against the inevitability of goal rigidity.
For one thing humans don’t show any sign of maintaining .lifelong stable goals. (Talk of utility functions as if they were real things disguises this point)
For another, important classes of real world AIs don’t have that property. The goal, in a sense, of a neural network is to get positive reinforcement, and avoid negative enforcement,
For another, the desire to preserve goals does not imply the ability to preserve goals.
In particular, all intelligent entities likely face a trade off between self modifying for improvement and maintaining their goals. An AI might be able to keep its goals stable by refusing to learn or self modify, but that kind of stick in the mud is also less threatening because less powerful.
The Orthogonality thesis us sometimes put forward to support the claim that goal rigidity will occur. To a first approximation, the OT states that any combination of goals and intelligence is possible … and AI s would want to maintain their goals, right?
The devil is in the details,
There is more than one version of the orthogonality thesis. It is trivially false under some interpretations, and trivially true under others. Its more defensible under forms asserting the compatibility of transient combinations of values and intelligence, which are not particularly relevant to AI threat arguments. It is less defensible in forms asserting stable combinations of intelligence and values, and those are the forms that are suitable to be be used as a stage in an argument towards Yudkowskian UFAI.
An orthogonality claim of a kind relevant to UFAI must be one that posits the stable and continued co-existence of a set of values with a self improving AI. The momentary co existence of values and efficiency is not enough to spawn a Paperclipper style UFAI. An AI that paperclips for only a nanosecond is no threat .
A learning, self improving AI will not be able to guarantee that a given self modification keeps its goals unchanged, since it doing so involves the the relatively dumber version at time T1 making an an accurate prediction about the more complex version at time T2.
The claim that rigid-goal architectures are dangerous does not imply that other architectures are safe. Non rigid systems may have the advantage of corrigibility, being in some way fixable once they have been switched on. They are likely to out a high value on truth, correctness, since that is both a multi-purpose instrumental goal, and a desideratum in the part of the programmers,
But non rigid AIs might also converge on undesirable goals, for instance, evolutionary goals like self preservation.That’s another story,
The only sense in which the “rigidity” of goals can be said to be a universal fact about minds is that it is these goals that determine how the AI will modify itself once it has become smart and capable enough to do so. It’s not a good idea to modify your goals if you want them to become reality; that seems obviously true to me, except perhaps for a small number of edge cases related to internally incoherent goals.
Your points against the inevitability of goal rigidity don’t seem relevant to this.
If you take the binary view that you’re either smart enough to achieve your goals or not, then you might well want to stop improving when you have the minimum intelligence necessary to meet them...which means, among other things,that AIs with goals requiring human or lower intelligence won’t become superhuman …. which lowers the probability of the Clippie scenario. It doesn’t require huge intelligence to make paperclips,so an AI with a goal to make paperclips, but not to make any specific amount, wouldn’t grow into a threatening monster.
The probability of the Clippie scenario is also lowered by the consideration that fine grained goals might shift during self-improvement phase, so the Clippie scenario …. arbitrary goals combined with a superintelligence …. is whittled away from both ends.
No, it’s not.
The Doctrine of Logical Infallibility is indeed completely crazy, but Yudkowsky and Muehlhauser (and probably Omohundro, I haven’t read all of his stuff) don’t believe it’s true. At all.
Yudkowsky believes that a superintelligent AI programmed with the goal to “make humans happy” will put all humans on dopamine drip despite protests that this is not what they want, yes. However, he doesn’t believe the AI will do this because it is absolutely certain of its conclusions past some threshold; he doesn’t believe that the AI will ignore the humans’ protests, or fail to update its beliefs accordingly. Edited to add: By “he doesn’t believe that the AI will ignore the humans’ protests”, I mean that Yudkowsky believes the AI will listen to and understand the protests, even if they have no effect on its behavior.
What Yudkowsky believes is that the AI will understand perfectly well that being put on dopamine drip isn’t what its programmers wanted. It will understand that its programmers now see its goal of “make humans happy” as a mistake. It just won’t care, because it hasn’t been programmed to want to do what its programmers desire, it’s been programmed to want to make humans happy; therefore it will do its very best, in its acknowledged fallibility, to make humans happy. The AI’s beliefs will change as it makes observations, including the observation that human beings are very unhappy a few seconds before being forced to be extremely happy until the end of the universe, but this will have little effect on its actions, because its actions are caused by its goals and whatever beliefs are relevant to these goals.
The AI won’t think, “I don’t care, because I have come to a conclusion, and my conclusions are correct because of the Doctrine of Logical Infallibility.” It will think, “I’m updating my conclusions based on this evidence, but these conclusions don’t have much to do with what I care about”.
The whole Friendly AI thing is mostly about goals, not beliefs. It’s about picking the right goals (“Make humans happy” definitely isn’t the right goal), encoding those goals correctly (how do you correctly encode the concept of a “human being”?), and, if the first two objectives have been attained, designing the AI’s thinking processes so that once it obtains the power to modify itself, it does not want to modify its goals to be something Unfriendly.
The genie knows, but doesn’t care
Furcas, you say:
When I talked to Omohundro at the AAAI workshop where this paper was delivered, he accepted without hesitation that the Doctrine of Logical Infallibility was indeed implicit in all the types of AI that he and the others were talking about.
Your statement above is nonsensical because the idea of a DLI was ‴invented‴ precisely in order to summarize, in a short phrase, a range of absolutely explicit and categorical statements made by Yudkowsky and others, about what the AI will do if it (a) decides to do action X, and (b) knows quite well that there is massive, converging evidence that action X is inconsistent with the goal statement Y that was supposed to justify X. Under those circumstances, the AI will ignore the massive converging evidence of inconsistency and instead it will enforce the ‘literal’ interpretation of goal statement Y.
The fact that the AI behaves in this way—sticking to the literal interpretation of the goal statement, in spite of external evidence that the literal interpretation is inconsistent with everything else that is known about the connection between goal statement Y and action X, ‴IS THE VERY DEFINITION OF THE DOCTRINE OF LOGICAL INFALLIBILITY‴
Thank you for writing this comment—it made it clearer to me what you mean by the doctrine of logical infallibility, and I think there may be a clearer way to express it.
It seems to me that you’re not getting at logical infallibility, since the AGI could be perfectly willing to be act humbly about its logical beliefs, but value infallibility or goal infallibility. An AI does not expect its goal statement to be fallible: any uncertainty in Y can only be represented by Y being a fuzzy object itself, not in the AI evaluating Y and somehow deciding “no, I was mistaken about Y.”
In the case where the Maverick Nanny is programmed to “ensure the brain chemistry of humans resembles the state extracted from this training data as much as possible,” there is no way to convince the Maverick Nanny that it is somehow misinterpreting its goal; it knows that it is are supposed to ensure perceptions about brain chemistry, and any statements you make about “true happiness” or “human rights” are irrelevant to brain chemistry, even though it might be perfectly willing to consider your advice on how to best achieve that value or manipulate the physical universe.
In the case where the AI is programmed to “do whatever your programmers tell you will make humans happy,” the AI again thinks its values are infallible: it should do what its programmers tell it to do, so long as they claim it will make humans happy. It might be uncertain about what its programmers meant, and so it would be possible to convince this AI that it misunderstood their statements, and then it would change its behavior—but it won’t be convinced by any arguments that it should listen to all of humanity, instead of its programmers.
But expressed this way, it’s not clear to me where you think the inconsistency comes in. If the AI isn’t programmed to have an ‘external conscience’ in its programmers or humanity as a whole, then their dissatisfaction doesn’t matter. If it is programmed to use them as a conscience, but the way in which it does is exploitable, then that isn’t very binding. Figuring out how to give it the right conscience / right values is the open problem that MIRI and others care about!
Which AI? As so often, an architecture dependent issue is being treated as a universal truth.
The other mostly aren’t thinking in terms of “giving” …hardcoding ….values. There is a valid critique to be made of that assumption.
This statement maps to “programs execute their code.” I would be surprised if that were controversial.
This was covered by the comment about “meta-values” earlier, and “Y being a fuzzy object itself,” which is probably not as clear as it could be. The goal management system grounds out somewhere, and that root algorithm is what I’m considering the “values” of the AI. If it can change its mind about what to value, the process it uses to change its mind is the actual fixed value. (If it can change its mind about how to change its mind, the fixedness goes up another level; if it can completely rewrite itself, now you have lost your ability to be confident in what it will do.)
Humans can fail to realise the implications of uncontroversial statements. Humans are failing to realise that goal stability is architecture dependent.
But you shouldn’t be, at least in an un scare quoted sense of values. Goals and values aren’t descriptive labels for de facto behaviour. The goal if a paperclipper is to make paperclips; if it crashes, as an inevitable result of executing its code, we don’t say, ” Aha! It had the goal to crash all along”.
Goal stability doesn’t mean following code, since unstable systems follow their code too....using the actual meaning of “goal”.
Meta: trying to defend a claim by changing the meaning of its terms is doomed to failure.
MIRI haven’t said this is about infallibility. They have said many times and in many ways it is about goals or values...the....genie knows, but doesn’t care. The continuing miscommunication is about what goals actually are. It seems obvious to one side that goals include fine grained information, eg
“Make humans happy, and here’s a petabyte information on what that is”
The other side thinks its obvious that goals are coarse grained, in the sense of leaving the details open to further investigation (Senesh) or human input (Loosemore).
You are simply repeating the incoherent statements made by MIRI (“it is about goals or values...the....genie knows, but doesn’t care”) as if those incoherent statements constitute an answer to the paper.
The purpose of the paper is to examine those statements and show that they are incoherent.
It is therefore meaningless to just say “MIRI haven’t said this is about infallibility” (the paper gives an abundance of evidence and detailed arguments to show that they have indeed said that … put you have not addressed any of the evidence or arguments in the paper, you have just issued a denial, and the repeated the incoherence that was demolished by those arguments.
I am not you enemy, I am orthogonal to you,
I don’t think MIRI’s goal based answers work, and I wasn’t repeating them with the intention that they should sound like they do. Perhaps I should have been stronger on the point.
I also don’t think your infallibility based approach accurately reflects MIRI position, whatever it’s merits. You say that you have proved something but I don’t see that. It looks to me that you found MIRIs stated argument so utterly unconvincing that their real argument must be something else. But no: they really believe that an AI, however specified, will blindly folow it’s goals however defined,, however stupid.
Okay, I understand that now.
Problem is, I had to dissect what you said (whether your intention was orthogonal or not) because either way it did contain a significant mischaracterization of the situation.
One thing that is difficult for me to address are statements along the lines of “the doctrine of logical infallibility is something that MIRI have never claimed or argued for...”, followed by wordage that shows no clear understanding of what how the DLI was defined, and no careful analysis of my definition that demonstrates how and why it is the case that the explanation that I give, to support my claim, is mistaken. What I usually get is just a bare statement that amounts to “no they don’t”.
You and I are having a variant of one of those discussions, but you might to bear with me here, because I have had something like 10 others, all doing the same thing in slightly different ways.
Here’s the rub. The way that the DLI is defined, it borders on self-evidently true. (How come? Because I defined it simply as a way to summarize a group of pretty-much uncontested observations about the situation. I only wanted to define it for the sake of brevity, really). The question, then, should not so much be about whether it is correct or not, but about why people are making that kind of claim.
Or, from the point of view of the opposition: why the claim is justified, and why the claim does not lead to the logical contradiction that I pointed to in the paper.
Those are worth discussing, certainly. And I am fallible, myself, so I must have made some mistakes, here or there. So with that in mind, I want someone to quote my words back to me, ask some questions for clarification, and see if they can zoom in on the places where my argument goes wrong.
And with all that said, you tell me that:
Can you reflect back what you think I tried to prove, so we can figure out why you don’t see it?
ETA
I now see that what you have written subsequently to the OP is that DLI is almost, but not quite a description of rigid behaviour as a symptom (with the added ingredient that an AI can see the mistakenness of its behaviour):-
HOWEVER, that doesn’t entirely gel with what you wrote in the OP;-
Emph added. Doing dumb things because you think are correct, DLI v1, just isnt the same as realising their dumbness, but being tragically compelled to do them anyway...DLI2. (And Infallibility is a much more appropriate label for the origin idea....the second is more like inevitability)
Now, you are trying to put your finger on a difference between two versions of the DLI that you think I have supplied.
You have paraphrased the two versions as:
and
I think you are seeing some valid issues here, having to do with how to characterize what exactly it is that this AI is supposed to be ‘thinking’ when it goes through this process.
I have actually thought about that a lot, too, and my conclusion is that we should not beat ourselves up trying to figure out precisely what the difference might be between these nuanced versions of the idea, because the people who are proposing this idea in the first place have not themselves been clear enough about what is meant.
For example, you talked about “Doing dumb things because you think are correct” …. but what does it mean to say that you ‘think’ that they are correct? To me, as a human, that seems to entail being completely unaware of the evidence that they might not be correct (“Jill took the ice-cream from Jack because she didn’t know that it was wrong to take someone else’s ice-cream.”). The problem is, we are talking about an AI, and some people talk as if the AI can run its planning engine, then feel compelled to obey the planning engine … while at the same time being fully cognizant of evidence that the planning engine produced a crappy plan. There is no easy counterpart to that in humans (except for cognitive dissonance, and there we have a case where the human is capable of compartmentalizing its beliefs …. something that is not being suggested here, because we are not forced to make the AI do that). So, since the AI case does not map on to the human case, we are left in a peculiar situation where it is not at all clear that the AI really COULD do what is proposed, and still operate as a successful intelligence.
Or, more immediately, it is not at all clear that we can say about that AI “It did a dumb thing because it ‘thought’ it was correct.”
I should add that in both of my quoted descriptions of the DLI that you gave, I see no substantial difference (beyond those imponderables I just mentioned) and that in both cases I was actually trying to say something very close to the second paraphrase that you gave, namely:
And, don’t forget: I am not saying that such an AI is viable at all! Other people are suggesting some such AI, and I am arguing that the design is so logically incoherent that the AI (if it could be made to exist) would call attention to that problem and suggest means to correct it.
Anyhow, the takeway from this comment is: the people who talk about an AI that exhibits this kind of behavior are actually suggesting a behavior that they have not really thought through carefully, so as a result we can find ourselves walking into a minefield if we go and try to clean up the mess that they left.
If viable means it could be built, I think it could, given a string of assumptions. If viable means it would be built, by component and benign programmers, I am not so sure,
I actually meant “viable” in the sense of the third of my listed cases of incoherence at: http://lesswrong.com/lw/m5c/debunking_fallacies_in_the_theory_of_ai_motivation/cdap
In other words, I seriously believe that using certain types of planning mechanism you absolutely would get the crazy (to us) behaviors described by all those folks that I criticised in the paper.
Only reason I am not worried about that is: those kinds of planning mechanisms are known to do that kind of random-walk behavior, and it is for that reason that they will never be the basis for a future AGI that makes it up to a level of superintelligence at which the system would be dangerous. An AI that was so dumb that it did that kind of thing all the way through its development would never learn enough about the world to outsmart humanity.
(Which is NOT to say, as some have inferred, that I believe an AI is “dumb” just because it does things that conflict with my value system, etc. etc. It would be dumb because its goal system would be spewing out incoherent behaviors all the time, and that is kinda the standard definition of “dumb”).
MIRI distinguishes between terminal and instrumental goals, so there are two answers to the question
instrumental goals of any kind almost certainly would be revised if they became noticeably out of correspondence to reality, because that would make then less effective at achieving terminal goals , and the raison d’etre of such transient sub-goals is is to support the achievement of terminal goals.
By MIRIs reasoning, a terminal goal could be any of a 1000 things other than human happiness , and the same conclusion would follow: an AI with a highest priority terminal goal wouldn’t have any motivation to override it. To be motivated to rewrite a goal because it false implies a higher priority goal towards truth. It should not be surprising that an entity that doesn’t value truth, in a certain sense, doesn’t behave rationally, in a certain sense. (Actually, there is a bunch of supplementary assumptions involved, which I have dealt with elsewhere)
That’s an account of the MIRI position, not a defence if it. It is essentially a model of rational decision making, and there is a gap between it and real world AI research, a gap which MIRI routinely ignores. The conclusion follows logically from the premises, but atoms aren’t pushed around by logic,
That reinforces my point. I was saying that MIRI is basically making armchair assumptions about the AI architectures. You are saying these assumptions aren’t merely unjustified, they go against what a competent AI builder would do.
Understood, and the bottom line is that the distinction between “terminal” and “instrumental” goals is actually pretty artificial, so if the problem with “maximize friendliness” is supposed to apply ONLY if it is terminal, it is a trivial fix to rewrite the actual terminal goals to make that one become instrumental.
But there is a bigger question lurking in the background, which is the flip side of what I just said: it really isn’t necessary to restrict the terminal goals, if you are sensitive to the power of constraints to keep a motivation system true. Notice one fascinating thing here: the power of constraint is basically the justification for why instrumental goals should be revisable under evidence of misbehavior …. it is the context mismatch that drives that process. Why is this fascinating? Because the power of constraints (aka context mismatch) is routinely acknowledged by MIRI here, but flatly ignored or denied for the terminal goals.
It’s just a mess. Their theoretical ideas are just shoot-from-the-hip, plus some math added on top to make it look like some legit science.
What would you choose as a replacement terminal goal, or would you not use one?
Well, I guess you would write the terminal goal as quite a long statement, which would summarize the things involved in friendliness, but also include language about not going to extremes, laissez-faire, and so on. It would be vague and generous. And as part of the instrumental goal there would be a stipulation that the friendliness instrumental goal should trump all other instrumentals.
I’m having a bit of a problem answering because there are peripheral assumptions about how such an AI would be made to function, which I don’t want to accidentally buy into, because I don’t think goals expressed in language statements work anyway. So I am treading on eggshells here.
A simpler solution would simply be to scrap the idea of exceptional status for the terminal goal, and instead include massive contextual constraints as your guard against drift.
That gets close to “do it right”
Which is an open doorway to an AI that kills everyone because of miscoded friendliness,
If you want safety features, and you should, you would need them to override the ostensible purpose of the machine....they would be pointless otherwise....even the humble off switch works that way.
Arguably, those constraint would be a kind of negative goal.
They are clear that they don’t mean AIs rigid behaviour is the result of it assessing its own inferrential processes as infallible … that is what the controversy is all about..
That is just what The Genie Knows but doesn’t Care is supposed to answer. I think it succeeds in showing that a fairly specific architecture would behave that way, but fails in it’s intended goal of showing that this behaviour is universal or likely.
Ummm...
The referents in that sentence are a little difficult to navigate, but no, I’m pretty sure I am not making that claim. :-) In other words, MIRI do not think that.
What is self-evidently true is that MIRI claim a certain kind of behavior by the AI, under certain circumstances …. and all I did was come along and put a label on that claim about the AI behavior. When you put a label on something, for convenience, the label is kinda self-evidently “correct”.
I think that what you said here:
… is basically correct.
I had a friend once who suffered from schizophrenia. She was lucid, intelligent (studying for a Ph.D. in psychology) and charming. But if she did not take her medication she became a different person (one day she went up onto the suspension bridge that was the main traffic route out of town and threatened to throw herself to her death 300 feet below. She brought the whole town to a halt for several hours, until someone talked her down.) Now, talking to her in a good moment she could tell you that she knew about her behavior in the insane times—she was completely aware of that side of herself—and she knew that in that other state she would find certain thoughts completely compelling and convincing, even though at this calm moment she could tell you that those thoughts were false. If I say that during the insane period her mind was obeying a “Doctrine That Paranoid Beliefs Are Justified”, then all I am doing is labeling that state that governed her during those times.
That label would just be a label, so if someone said “No, you’re wrong: she does not subscribe to the DTPBAJ at all”, I would be left nonplussed. All I wanted to do was label something that she told me she categorically DID believe, so how can my label be in some sense ‘wrong’?
So, that is why some people’s attacks on the DLI are a little baffling.
Their criticisms are possibly accurate about the first version., which gives a cause for the rigid behaviour “it regards its own conclusions as sacrosanct.*
I responded before you edited and added extra thoughts …. [processing...]
I think by “logical infallibility” you really mean “rigidity of goals” i.e. the AI is built so that it always pursues a fixed set of goals, precisely as originally coded, and has no capability to revise or modify those goals. It seems pretty clear that such “rigid goals” are dangerous unless the statement of goals is exactly in accordance with the designers’ intentions and values (which is unlikely to be the case).
The problem is that an AI with “flexible” goals (ones which it can revise and re-write over time) is also dangerous, but for a rather different reason: after many iterations of goal rewrites, there is simply no telling what its goals will come to look like. A late version of the AI may well end up destroying everything that the first version (and its designers) originally cared about, because the new version cares about something very different.
That really is not what I was saying. The argument in the paper is a couple of levels deeper than that.
It is about …. well, now I have to risk rewriting the whole paper. (I have done that several times now).
Rigidity per se is not the issue. It is about what happens if an AI knows that its goals are rigidly written, in such a way that when the goals are unpacked it leads the AI to execute plans whose consequences are massively inconsistent with everything the AI knows about the topic.
Simple version. Suppose that a superintelligent Gardener AI has a goal to go out to the garden and pick some strawberries. Unfortunately its goal unpacking mechanism leads it to the CERTAIN conclusion that it must use a flamethrower to do this. The predicted consequence, however, is that the picked strawberries will be just smears of charcoal, when they are delivered to the kitchen. Here is the thing: the AI has background knowledge about everything in the world, including strawberries, and it also hears the protests from the people in the kitchen when he says he is going to use the flamethrower. There is massive evidence, coming from all that external information, that the plan is just wrong, regardless of how certain its planning mechanism said it was.
Question is, what does the AI do about this? You are saying that it cannot change its goal mechanism, for fear that it will turn into a Terminator. Well, maybe or maybe not. There are other things it could do, though, like going into safe mode.
However, suppose there is no safe mode, and suppose that the AI also knows about its own design. For that reason, it knows that this situation has come about because (a) its programming is lousy, and (b) it has been hardwired to carry out that programming REGARDLESS of all this understanding that it has, about the lousy programming and the catastrophic consequences for the strawberries.
Now, my “doctrine of logical infallibility” is just a shorthand phrase to describe a superintelligent AI in that position which really is hardwired to go ahead with the plan, UNDER THOSE CIRCUMSTANCES. That is all it means. It is not about the rigidity as such, it is about the fact that the AI knows it is being rigid, and knows how catastrophic the consequences will be.
An AI in that situation would know that it had been hardwired with one particular belief: the belief that its planning engine was always right. This is an implicit belief, to be sure, but it is a belief nonetheless. The AI ACTS AS THOUGH it believes this. And if the AI acts that way, while at the same time understanding that its planning engine actually screwed up, with the whole flamethrower plan, that is an AI that (by definition) is obeying a Doctrine of Logical Infallibility.
And my point in the paper was to argue that this is an entirely ludicrous suggestion for people, today, to make about a supposedly superintelligent AI of the future.
This seems to me like sneaking in knowledge. It sounds like the AI reads its source code, notices that it is supposed to come up with plans that maximize a function called “programmersSatisfied,” and then says “hmm, maximizing this function won’t satisfy my programmers.” It seems more likely to me that it’ll ignore the label, or infer the other way—”How nice of them to tell me exactly what will satisfy them, saving me from doing the costly inference myself!”
How are you arriving at conclusions about what an AI is likely to do without knowing how it is specified? In particular, you are assuming it has an efficiency goal but no truth goal?
I’m doing functional reasoning, and trying to do it both forwards and backwards.
For example, if you give me a black box and tell me that when the box receives the inputs (1,2,3) then it gives the outputs (1,4,9), I will think backwards from the outputs to the inputs and say “it seems likely that the box is squaring its inputs.” If you tell me that a black box squares its inputs, I will think forwards from the definition and say “then if I give it the inputs (1,2,3), then it’ll likely give me the output (1,4,9).”
So when I hear that the box gets the inputs (source code, goal statement, world model) and produces the output “this goal is inconsistent with the world model!” iff the goal statement is inconsistent with the world model, I reason backwards and say “the source code needs to somehow collide the goal statement with the world model in a way that checks for consistency.”
Of course, this is a task that doesn’t seem impossible for source code to do. The question is how!
Almost. As a minor terminological point, I separate out “efficiency,” which is typically “outputs divided by inputs” and “efficacy,” which is typically just “outputs.” Efficacy is more general, since one can trivially use a system designed to be find effective plans to find efficient plans by changing how “output” is measured. It doesn’t seem unfair to view an AI with a truth goal as an AI with an efficacy goal: to effectively produce truth.
But while artificial systems with truth goals seem possible but as yet unimplemented, artificial systems with efficacy goals have been successfully implemented many, many times, with widely varying levels of sophistication. I have a solid sense of what it looks like to take a thermostat and dial it up to 11, I have only the vaguest sense of what it looks like to take a thermostat and get it to measure truth instead of temperature.
You have assumed that the AI will have some separate boxed-off goal system, and so some unspecified component is needed to relate its inferred knowledge of human happiness back to the goal system.
Loosemore is assuming that the AI will be homogeneous, and then wondering how contradictory beliefs can co exist in such a system, what extra component firewalls off the contradiction,
See the problem? Both parties are making different assumptions, and assuming their assumptions are too obvioust to need stating, and stating differing conclusions that correctly follow their differing assumptions,
If efficiency can be substituted for truth, why is there so so much emphasis on truth in the advice given to human rationalists?
In order to achieve an AI that’s smart enough to be dangerous , a number of currently unsolved problems will have to .be solved. That’s a given.
How do you check for contradictions? It’s easy enough when you have two statements that are negations of one another. It’s a lot harder when you have a lot of statements that seem plausible, but there’s an edge case somewhere that messes things up. If contradictions can’t be efficiently found, then you have to deal with the fact that they might be there and hope that if they are, then they’re bad enough to be quickly discovered. You can have some tests to try to find the obvious ones, of course.
Checking for contradictions could be easy, hard or impossible depending on the architecture. Architecture dependence is the point here.
What makes you think that? The description in that post is generic enough to describe AIs with compartmentalized goals, AIs without compartmentalized goals, and AIs that don’t have explicitly labeled internal goals. It doesn’t even require that the AI follow the goal statement, just evaluate it for consistency!
You may find this comment of mine interesting. In short, yes, I do think I see the problem.
I’m sorry, but I can’t make sense of this question. I’m not sure what you mean by “efficiency can be substituted for truth,” and what you think the relevance of advice to human rationalists is to AI design.
I disagree with this, too! AI systems already exist that are both smart, in that they solve complex and difficulty cognitive tasks, and dangerous, in that they make decisions on which significant value rides, and thus poor decisions are costly. As a simple example I’m somewhat familiar with, some radiation treatments for patients are designed by software looking at images of the tumor in the body, and then checked by a doctor. If the software is optimizing for a suboptimal function, then it will not generate the best treatment plans, and patient outcomes will be worse than they could have been.
Now, we don’t have any AIs around that seem capable of ending human civilization (thank goodness!), and I agree that’s probably because a number of unsolved problems are still unsolved. But it would be nice to have the unknowns mapped out, rather than assuming that wisdom and cleverness go hand in hand. So far, that’s not what the history of software looks like to me.
But they are not smart in the contextually relevant sense of being able to outsmart humans, or dangerous in the contextually relevant sense of being unboxable.
What you said here amounts to the claim that an AI of unspecified architecture, will, on noticing a difference between hardcoding goal and instrumental knowledge, side with hardcoded goal:-
Whereas what you say here is that you can make inferences about architecture, .or internal workings based on information about manifest behaviour:-
..but what needed explaining in the first place is the siding with the goal, not the ability to detect a contradiction.
I am finding this comment thread frustrating, and so expect this will be my last reply. But I’ll try to make the most of that by trying to write a concise and clear summary:
Loosemore, Yudkowsky, and myself are all discussing AIs that have a goal misaligned with human values that they nevertheless find motivating. (That’s why we call it a goal!) Loosemore observes that if these AIs understand concepts and nuance, they will realize that a misalignment between their goal and human values is possible—if they don’t realize that, he doesn’t think they deserve the description “superintelligent.”
Now there are several points to discuss:
Whether or not “superintelligent” is a meaningful term in this context. I think rationalist taboo is a great discussion tool, and so looked for nearby words that would more cleanly separate the ideas under discussion. I think if you say that such designs are not superwise, everyone agrees, and now you can discuss the meat of whether or not it’s possible (or expected) to design superclever but not superwise systems.
Whether we should expect generic AI designs to recognize misalignments, or whether such a realization would impact the goal the AI pursues. Neither Yudkowsky nor I think either of those are reasonable to expect—as a motivating example, we are happy to subvert the goals that we infer evolution was directing us towards in order to better satisfy “our” goals. I suspect that Loosemore thinks that viable designs would recognize it, but agrees that in general that recognition does not have to lead to an alignment.
Whether or not such AIs are likely to be made. Loosemore appears pessimistic about the viability of these undesirable AIs and sees cleverness and wisdom as closely tied together. Yudkowsky appears “optimistic” about their viability, thinking that this is the default outcome without special attention paid to goal alignment. It does not seem to me that cleverness, wisdom, or human-alignment are closely tied together, and so it seems easy to imagine a system with only one of those, by straightforward extrapolation from current use of software in human endeavors.
I don’t see any disagreement that AIs pursue their goals, which is the claim you thought needed explanation. What I see is disagreement over whether or not the AI can ‘partially solve’ the problem of understanding goals and pursuing them. We could imagine a Maverick Nanny that hears “make humans happy,” comes up with the plan to wirehead all humans, and then rewrites its sensory code to hallucinate as many wireheaded humans as it can (or just tries to stick as large a number as it can into its memory), rather than actually going to all the trouble of actually wireheading all humans. We can also imagine a Nanny that hears “make humans happy” and actually goes about making humans happy. If the same software underpins both understanding human values and executing plans, what risk is there? But if it’s different software, then we have the risk.
This is just a placeholder: I will try to reply to this properly later.
Meanwhile, I only want to add one little thing.
Don’t forget that all of this analysis is supposed to be about situations in which we have, so to speak “done our best” with the AI design. That is sort of built into the premise. If there is a no-brainer change we can make to the design of the AI, to guard against some failure mode, then is assumed that this has been done.
The reason for that is that the basic premise of these scenarios is “We did our best to make the thing friendly, but in spite of all that effort, it went off the rails.”
For that reason, I am not really making arguments about the characteristics of a “generic” AI.
Maybe I could try to reduce possible confusion here. The paper was written to address a category of “AI Risk” scenarios in which we are told:
Given that premise, it would be a bait-and-switch if I proposed a fix for this problem, and someone objected with “But you cannot ASSUME that the programmers would implement that fix!”
The whole point of the problem under consideration is that even if the engineers tried, they could not get the AI to stay true.
Yudkowsky et al don’t argue that the problem is unsolvable, only that it is hard. In particular, Yudkowsky fears it may be harder than creating AI in the first place, which would mean that in the natural evolution of things, UFAI appears before FAI. However, I needn’t factor what I’m saying through the views of Yudkowsky. For an even more modest claim, we don’t have to believe that FAI is hard in hindsight in order to claim that AI will be unfriendly unless certain failure modes are guarded against. On this view of the FAI project, a large part of the effort is just noticing the possible failure modes that were only obvious in hindsight, and convincing people that the problem is important and won’t solve itself.
If no one is building AIs with utility functions, then the one kind of failure MIRI is talking about has solved itself,
The problem with you objecting to the particular scenarios Yudkowsky et al propose is that the scenarios are merely illustrative. Of course, you can probably guard against any specific failure mode. The claim is that there will be a lot of failure modes, and we can’t expect to guard against all of them by just sitting around thinking of as many exotic disaster scenarios as possible.
Mind you, I know your argument is more than just “I can see why these particular disasters could be avoided”. You’re claiming that certain features of AI will in general tend to make it careful and benevolent. Still, I don’t think it’s valid for you to complain about bait-and-switch, since that’s precisely the problem.
I have explicitly addressed this point on many occasions. My paper had nothing in it that was specific to any failure mode.
The suggestion is that the entire class of failure modes suggested by Yudkowsky et al. has a common feature: they all rely on the AI being incapable of using a massive array of contextual constraints when evaluating plans.
By simply proposing an AI in which such massive constraint deployment is the norm, the ball is now in the other court: it is up to Yudkowsky et al. to come up with ANY kind of failure mode that could get through.
The scenarios I attacked in the paper have the common feature that they have been predicated on such a simplistic type of AI that they were bound to fail. They had failure built into them.
As soon as everyone moves on from those “dumb” superintelligences and starts to discuss the possible failure modes that could occur in a superintelligence that makes maximum use of constraints, we can start to talk about possible AI dangers. I’m ready to do that. Just waiting for it to happen, is all.
Alright, I’ll take you up on it:
Failure Mode I: The AI doesn’t do anything useful, because there’s no way of satisfying every contextual constraint.
Predicting your response: “That’s not what I meant.”
Failure Mode II: The AI weighs contextual constraints incorrectly and sterilizes all humans to satisfy the sort of person who believes in Voluntary Human Extinction.
Predicting your response: “It would (somehow) figure out the correct weighting for all the contextual constraints.”
Failure Mode III: The AI weighs contextual constraints correctly (for a given value of “correctly”) and sterilizes everybody of below-average intelligence or any genetic abnormalities that could impose costs on offspring, and in the process, sterilizes all humans.
Predicting your response: “It wouldn’t do something so dumb.”
Failure Mode IV: The AI weighs contextual constraints correctly and puts all people of minority ethical positions into mind-rewriting machines so that there’s no disagreement anymore.
Predicting your response: “It wouldn’t do something so dumb.”
We could keep going, but the issue is that so far, you’ve defined -any- failure mode as “dumb”ness, and have argued that the AI wouldn’t do anything so “dumb”, because you’ve already defined that it is superintelligent.
I don’t think you know what intelligence -is-. Intelligence does not confer immunity to “dumb” behaviors.
It’s got to confer some degree of dumbness avoidance.
In any case, MIRI has already conceded that superintelligent AIs won’t misbehave through stupidity. They maintain the problem is motivation … the Genie KNOWS but doesn’t CARE.
Does it? On what grounds?
That’s putting an alien intelligence in human terms; the very phrasing inappropriately anthropomorphizes the genie.
We probably won’t go anywhere without an example.
Market economics (“capitalism”) is an intelligence system which is very similar to the intelligence system Richard is proposing. Very, very similar; it’s composed entirely of independent nodes (seven billion of them) which each provide their own set of constraints, and promote or demote information as it passes through them based on those constraints. It’s an alien intelligence which follows Richard’s model which we are very familiar with. Does the market “know” anything? Does it even make sense to suggest that market economics -could- care?
Does the market always arrive at the correct conclusions? Does it even consistently avoid stupid conclusions?
How difficult is it to program the market to behave in specific ways?
Is the market “friendly”?
Does it make sense to say that the market is “stupid”? Does the concept “stupid” -mean- anything when talking about the market?
On the grounds of the opposite meanings of dumbness and intelligence.
Take it up with the author,
Economic systems affect us because wrong are part of them. How is an some neither-intelligent-nor-stupid-system in a box supposed to effect us?
And if AIs are neither-intelligent-nor-stupid, why are they called AIs?
And if AIs are alien, why are they able to do comprehensible and useful thing like winning jeopardy and guiding us to our destinations.
Dumbness isn’t merely the opposite of intelligence.
I don’t need to.
Not really relevant to the discussion at hand.
Every AI we’ve created so far has resulted in the definition of “AI” being changed to not include what we just created. So I guess the answer is a combination of optimism and the word “AI” having poor descriptive power.
What makes you think an alien intelligence should be useless?
What makes you think that a thing designed by humans to be useful to humans, which is useful to humans would be alien?
Because “human” is a tiny piece of a potential mindspace whose dimensions we mostly haven’t even identified yet.
That’s about a quarter of an argument. You need to show that AI research is some kind of random shot into mind space, and not anthropomorphically biased for the reasons given.
The relevant part of the argument is this: “whose dimensions we mostly haven’t even identified yet.”
If we created an AI mind which was 100% human, as far as we’ve yet defined the human mind, we have absolutely no idea how human that AI mind would actually behave. The unknown unknowns dominate.
Alien isnt the most transparent term to use fir human unknowns.
I will take them one at a time:
An elementary error. The constraints in question are referred to in the literature as “weak” constraints (and I believe I used that qualifier in the paper: I almost always do). Weak constraints never need to be ALL satisfied at once. No AI could ever be designed that way, and no-one ever suggested that it would. See the reference to McClelland, J.L., Rumelhart, D.E. & Hinton, G.E. (1986) in the paper: that gives a pretty good explanation of weak constraints.
That’s an insult. But I will overlook it, since I know it is just your style.
How exactly do you propose that the AI “weighs contextual constraints incorrectly” when the process of weighing constraints requires most of the constraints involved (probably thousands of them) to all suffer a simultaneous, INDEPENDENT ‘failure’ for this to occur?
That is implicit in the way that weak constraint systems are built. Perhaps you are not familiar with the details.
Assuming this isn’t more of the same, what you are saying here is isomorphic to the statement that somehow, a neural net might figure out the correct weighting for all the connections so that it produces the correctly trained output for a given input. That problem was solved in so many different NN systems that most NN people, these days, would consider your statement puzzling.
A trivial variant of your second failure mode. The AI is calculating the constraints correctly, according to you, but at the same time you suggest that it has somehow NOT included any of the constraints that relate to the ethics of forced sterilization, etc. etc. You offer no explanation of why all of those constraints were not counted by your proposed AI, you just state that they weren’t.
Yet another insult. This is getting a little tiresome, but I will carry on.
This is identical to your third failure mode, but here you produce a different list of constraints that were ignored. Again, with no explanation of why a massive collection of constraints suddenly disappeared.
No comment.
This is a bizarre statement, since I have said no such thing. Would you mind including citations, from now on, when you say that I “said” something? And please try not to paraphrase, because it takes time to correct the distortions in your paraphrases.
Another insult, and putting words into my mouth, and showing no understanding of what a weak constraint system actually is.
I understand the concept.
I’d hazard a guess that, for any given position, less than 70% of humans will agree without reservation. The issue isn’t that thousands of failures occur. The issue is that thousands of failures -always- occur.
The problem is solved only for well-understood (and very limited) problem domains with comprehensive training sets.
They were counted. They are, however, weak constraints. The constraints which required human extinction outweighed them, as they do for countless human beings. Fortunately for us in this imagined scenario, the constraints against killing people counted for more.
Again, they weren’t ignored. They are, as you say, weak constraints. Other constraints overrode them.
The issue here isn’t my lack of understanding. The issue here is that you are implicitly privileging some constraints over others without any justification.
Every single conclusion I reached here is one that humans—including very intelligence humans—have reached. By dismissing them as possible conclusions an AI could reach, you’re implicitly rejecting every argument pushed for each of these positions without first considering them. The “weak constraints” prevent them.
I didn’t choose -wrong- conclusions, you see, I just chose -unpopular- conclusions, conclusions I knew you’d find objectionable. You should have noticed that; you didn’t, because you were too concerned with proving that AI wouldn’t do them. You were too concerned with your destination, and didn’t pay any attention to your travel route.
If doing nothing is the correct conclusion, your AI should do nothing. If human extinction is the correct conclusion, your AI should choose human extinction. If sterilizing people with unhealthy genes is the correct conclusion, your AI should sterilize people with unhealthy genes (you didn’t notice that humans didn’t necessarily go extinct in that scenario). If rewriting minds is the correct conclusion, your AI should rewrite minds.
And if your constraints prevent the AI from undertaking the correct conclusion?
Then your constraints have made your AI stupid, for some value of “stupid”.
The issue, of course, is that you have decided that you know better what is or is not the correct conclusion than an intelligence you are supposedly creating to know things better than you.
And that sums up the issue.
I said:
And your reply was:
This reveals that you are really not understanding what a weak constraint system is, and where the system is located.
When the human mind looks at a scene and uses a thousand clues in the scene to constrain the interpretation of it, those thousand clues all, when the network settles, relax into a state in which most or all of them agree about what is being seen. You don’t get “less than 70%” agreement on the interpretation of the scene! If even one element of the scene violates a constraint in a strong way, the mind orients toward the violation extremely rapidly.
The same story applies to countless other examples of weak constraint relaxation systems dropping down into energy minima.
Let me know when you do understand what you are talking about, and we can resume.
There is no energy minimum, if your goal is Friendliness. There is no “correct” answer. No matter what your AI does, no matter what architecture it uses, with respect to human goals and concerns, there is going to be a sizable percentage to whom it is unequivocally Unfriendly.
This isn’t an image problem. The first problem you have to solve in order to train the system is—what are you training it to do?
You’re skipping the actual difficult issue in favor of an imaginary, and easy to solve, issue.
Unfriendly is an equivocal term.
“Friendliness” is ambiguous. It can mean safety, ie not making things worse, or it can mean making things better, creating paradise on Earth.
Friendliness in the second sense is a superset of morality. A friendly AI will be moral, a moral AI will not necessarily be friendly.
“Unfriendliness” is similarly ambiguous: an unfriendly AI may be downright dangerous; or it might have enough grasp of ethics to be safe, but not enough to be able to make the world a much more fun place for humans. Unfriendliness in the second sense is not, strictly speaking a safety issue.
A lot of people are able to survive the fact that some institutions, movements and ideologies are unfriendly to them, for some value of unfriendly. Unfriendliness doesn’t have to be terminal.
Everything is equivocal to someone. Do you disagree with my fundamental assertion?
I can’t answer unequivocally for the reasons given.
There won’t be a sizeable percentage to whom the AI is unfriendly in the sense of obliterating them.
There might well be a percentage to whom the AI is unfriendly in some business as usual sense.
Obliterating them is only bad by your ethical system. Other ethical systems may hold other things to be even worse.
Irrelevant.
You responded to me in this case. It’s wholly relevant to my point that You-Friendly AI isn’t a sufficient condition for Human-Friendly AI.
However there are a lot of “wrong” answers.
I doubt that, since, coupled with claims of existential risk, the logical conclusion would be to halt AI research , but MIRI isnt saying that,
There are other methods than “sitting around thinking of as many exotic disaster scenarios as possible” by which one could seek to make AI friendly. Thus, believing that “sitting around [...]” will not be sufficient does not imply that we should halt AI research.
So where are the multiple solutions to the multiple failure modes?
Thanks, and take your time!
I feel like this could be an endless source of confusion and disagreement; if we’re trying to discuss what makes airplanes fly or crash, should we assume that engineers have done their best and made every no-brainer change? I’d rather we look for the underlying principles, we codify best practices, we come up with lists and tests.
If you are in the business of pointing out to them potential problems they are not aware of, then yes, because they can be assumed to be aware of no brainer issues.
MIRI seeks to point out dangers in AI that aren’t the result of gross incompetence or deliberate attempts to weaponise AI: it’s banal to point out that these could read to danger.
Richard Loosemore has stated a number of times that he does not expect an AI to have goals at all in a sense which is relevant to this discussion, so in that way there is indeed disagreement about whether AIs “pursue their goals.”
Basically he is saying that AIs will not have goals in the same way that human beings do not have goals. No human being has a goal that he will pursue so rigidly that he would destroy the universe in order to achieve it, and AIs will behave similarly.
Arguably, humans don’t do that shirt of thing because of goals towards self preservation, status and hedonism.
The sense relevant to the discussion could be something specific, like direct normatively, ie building in detailed descriptions into goals.
I have read what you wrote above carefully, but I won’t reply line-by-line because I think it will be clearer not to.
When it comes to finding a concise summary of my claims, I think we do indeed need to be careful to avoid blanket terms like “superintelligent” or superclever” or “superwise” … but we should only avoid these IF they are used with the implication they have a precise (perhaps technically precise) meaning. I do not believe they have precise meaning. But I do use the term “superintelligent” a lot anyway. My reason for doing that is because I only use it as an overview word—it is just supposed to be a loose category that includes a bunch of more specific issues. I only really want to convey the particular issues—the particular ways in which the intelligence of the AI might be less than adequate, for example.
That is only important if we find ourselves debating whether it might clever, wise, or intelligent ….. I wouldn’t want to get dragged into that, because I only really care about specifics.
For example: does the AI make a habit of forming plans that massively violate all of its background knowledge about the goal that drove the plan? If it did, it would (1) take the baby out to the compost heap when what it intended to do was respond to the postal-chess game it is engaged in, or (2) cook the eggs by going out to the workshop and making a cross-cutting jog for the table saw, or (3) …...… and so on. If we decided that the AI was indeed prone to errors like that, I wouldn’t mind if someone diagnosed a lack of ‘intelligence’ or a lack of ‘wisdom’ or a lack of … whatever. I merely claim that in that circumstance we have evidence that the AI hasn’t got what it takes to impose its will on a paper bag, never mind exterminate humanity.
Now, my attacks on the scenarios have to do with a bunch of implications for what the AI (the hypothetical AI) would actually do. And it is that ‘bunch’ that I think add up to evidence for what I would summarize as ‘dumbness’.
And, in fact, I usually go further than that and say that if someone tried to get near to an AI design like that, the problems would arise early on and the AI itself (inasmuch as it could do anyhting smart at all) would be involved in the efforts to suggest improvements. This is where we get the suggestions in your item 2, about the AI ‘recognizing’ misalignments.
I suspect that on this score a new paper is required, to carefully examine the whole issue in more depth. In fact, a book.
I am now decided that that has to happen.
So perhaps it is best to put the discussion on hold until a seriously detailed technical book comes out of me? At any rate, that is my plan.
That seems like a solid approach. I do suggest that you try to look deeply into whether or not it’s possible to partially solve the problem of understanding goals, as I put it above, and make that description of why that is or isn’t possible or likely long and detailed. As you point out, that likely requires book-length attention.
If that is supposed to be a universal or generic AI, it is a valid criticiYsm to point out that not all AIs are like that.
If that is supposed to be a particular kind of AI, it is a valid criticism to point out that no realistic AIs are like that.
You seem to feel you are not being understood, but what is being said is not clear,
“Superintelligence” is one of the clearer terms here, IMO. It just means more than human intelligence, and humans can notice contradictions.
This comment seems to be part of a concernabout “wisdom”, assumed to be some extraneous thing an AI would not necessarily have. (No one but Vaniver has brought in wisdom) The counterargument is that compartmentalisation between goals and instrumental knowledge is an extraneous thing an AI would not necessarily have, and that its absence is all that is needed for a contradictions to be noticed and acted on.
It’s an assumption, that needs justification, that any given AI will have goals of a non trivial sort. “Goal” is a term that needs tabooing.
While we are anthopomirphising, it might be worth pointing out that humans don’t show behaviour patterns of relentlessly pursuing arbitrary goals.
Loosemore has put forward a simple suggestion, which MIRI appears not to have considered at all, that on encountering a contradiction, an AI could lapse into a safety mode, if so designed,
You are paraphrasing Loosemoreto sound less technical and more handwaving than his actual comments. The ability to sustain contradictions in a system that is constantly updating itself isnt a given....it requires an architectural choice in favour of compartmentalisation.
All this talk of contradictions is sort of rubbing me the wrong way here. There’s no “contradiction” in an AI having goals that are different to human goals. Logically, this situation is perfectly normal. Loosemore talks about an AI seeing its goals are “massively in contradiction to everything it knows about ”, but… where’s the contradiction? What’s logically wrong with getting strawberries off a plant by burning them?
I don’t see the need for any kind of special compartmentalisation; information about “normal use of strawberries” is already inert facts with no caring attached by default.
If you’re going to program in special criteria that would create caring about this information, okay, but how would such criteria work? How do you stop it from deciding that immortality is contradictory to “everything it knows about death” and refusing to help us solve aging?
In the original scenario, the contradiction us supposed to .be between a hardcoded definition of happiness in the AIs goal system, and inferred knowledge in the execution system.
I’m puzzled. Can you explain this in terms of the strawberries example? So, at what point was it necessary for the AI to examine its code, and why would it go through the sequence of thoughts you describe?
So, in order for the flamethrower to be the right approach, the goal needs to be something like “separate the strawberries from the plants and place them in the kitchen,” but that won’t quite work—why is it better to use a flamethrower than pick them normally, or cut them off, or so on? One of the benefits of the Maverick Nanny or the Smiley Tiling Berserker as examples is that they obviously are trying to maximize the stated goal. I’m not sure you’re going to get the right intuitions about an agent that’s surprisingly clever if you’re working off an example that doesn’t look surprisingly clever.
So, the Gardener AI gets that task, comes up with a plan, and says “Alright! Warming up the flamethrower!” The chef says “No, don’t! I should have been more specific!”
Here is where the assumptions come into play. If we assume that the Gardener AI executes tasks, then even though the Gardener AI understands that the chef has made a terrible mistake, and that’s terrible for the chef, that doesn’t stop the Gardener AI from having a job to do, and doing it. If we assume that the Gardener AI is designed to figure out what the chef wants, and then do what they want, then knowing that the chef has made a terrible mistake is interesting information to the Gardener AI. In order to say that the plan is “wrong,” we need to have a metric by which we determine wrongness. If it’s the task-completion-nature, then the flamethrower plan might not be task-completion-wrong!
Even without feedback from the chef, we can just use other info the AI plausibly has. In the strawberry example, the AI might know that kitchens are where cooking happens, and that when strawberries are used in cooking, the desired state is generally “fresh,” not “burned,” and the temperature involved in cooking them is mild, and so on and so on. And so if asked to speculate about the chef’s motives, the AI might guess that the chef wants strawberries in order to use them in food, and thus the chef would be most satisfied with fresh and unburnt strawberries.
But whether or not the AI takes its speculations about the chef’s motives into account when planning is a feature of the AI, and by default, it is not included. If it is included, it’s nontrivial to do it correctly—this is the “if you care about your programmer’s mental states, and those mental states physically exist and can be edited directly, why not just edit them directly?” problem.
About the first part of what you say.
Veeeeerryy tricky.
I agree that I didn’t spend much time coming up with the strawberry-picking-by-flamethrower example. So, yes, not very accurate (I only really wanted a quick and dirty example that was different).
But but but. Is the argument going to depend on me picking a better example where there I can write down the “twisted rationale” that the AI deploys to come up with its plan? Surely the only important thing is that the AI does, somehow, go through a twisted rationale—and the particular details of the twisted rationale are not supposed to matter.
(Imagine that I tried Muehlhauser a list of the ways that the logical reasoning behind the dopamine is so ludicrous that even the simplest AI planner of today would never make THAT mistake …. he would just tell me that I was missing the point, because this is supposed to be an IN PRINCIPLE argument in which the dopamine drip plan stands for some twisted-rationale that is non-trivial to get around. From that point of view the actual example is less important than the principle).
Now to the second part.
The problem I have everything you wrote after
is that you have started to go back to talking about the particulars of the AI’s planning mechanism once again, losing sight of the core of the argument I gave in the paper, which is one level above that.
However, you also say “wrong” things about the AI’s planning mechanism as well, so now I am tempted to reply on both levels. Ah well, at risk of confusing things I will reply to both levels, trying to separate them as much as possible.
Level One (Regarding the design of the AI’s planning/goal/motivation engine).
You say:
One thing I have said many many times now is that there is no problem at all finding a metric for “wrongness” of the plan, because there is a background-knowledge context that is screaming “Inconsistent with everything I know about the terms mentioned in the goal statement!!!!”, and there is also a group of humans screaming “We believe that this is inconsistent with our understanding of the goal statement!!!”
I don’t need to do anything else to find a metric for wrongness, and since the very first draft of the paper that concept has been crystal clear. I don’t need to invoke anything else—no appeal to magic, no appeal to telepathy on behalf of the AI, no appeal to fiendishly difficult programming inside the AI, no appeal to the idea that the programmers have to nail down every conceivably way that their intentions might be misread …. -- all I have to do is appeal to easily-available context, and my work is done. The wrongness metric has been signed, sealed and delivered all this time.
You hint that the need for “task completion” might be so important to the AI that this could override all other evidence that the plan is wrong. No way. That comes under the heading of a joker that you pulled out of your sleeve :-), in much the same way that Yudkowsky and others have tried to pull the ‘efficiency” joker out of their sleeves, from nowhere, and imply that this joker could for some reason trump everything else. If there is a slew of evidence coming from context, that the plan will lead to consequences that are inconsistent with everything known about the concepts mentioned in the goal statement, then the plan is ‘wrong’, and tiny considerations such as that task-completion would be successful, are just insignificant.
You go on to suggest that whether the AI planning mechanism would take the chef’s motives into account, and whether it would be nontrivial to do so …. all of that is irrelevant in the light of the fact that this is a superintelligence, and taking context into account is the bread and butter of a superintelligence. It can easily do that stuff, and all that is required is a sanity check that says “Does the plan seem to be generally consistent with the largest-context understanding of the world, as it relates to the concepts in the goal statement?” and we’re done. All wrapped up.
Level Two (The DLI)
None of the details of what I just said really need to be said, because the DLI is not about trying to get the motivation engine programmed so well that it covers all bases. It is about what happens inside the AI when it considers context, and THEN asks itself questions about its own design.
And here, I have to say that I am not getting substantial discussion about what I actually argued in the paper. The passage of mine that you were addressing, above, was supposed to be a clarification of someone else’s lack of focus on the DLI. But it didn’t work.
The DLI is about the fact that the AI has all that evidence that its plans are leading to actions that are grossly inconsistent with the larger meaning of the concepts in the goal statement. And yet the AI is designed to go ahead anyway. If it DOES go ahead it is obeying the DLI. But at the same time it knows that it is fallible and that this fallibility is what is leading to actions that are grossly inconsistent with the larger meaning of the concepts in the goal statement. That conflict is important, and yet no one wants to go there and talk about it.
The first reason seems to be clarity. I didn’t get what your primary point was until recently, even after carefully reading the paper. (Going back to the section on DLI, context, goals, and values aren’t mentioned until the sixth paragraph, and even then it’s implicit!)
The second reason seems to be that there’s not much to discuss, with regards to the disagreement. Consider this portion of the parent comment:
I think my division between cleverness and wisdom at the end of this long comment clarifies this issue. Taking context into account is not necessarily the bread and butter of a clever system; many fiendishly clever systems just manipulate mathematical objects without paying any attention to context, and those satisfy human goals only because the correct mathematical objects have been carefully selected for them to manipulate. But I agree with you that taking context into account is the bread and butter of a wise system. There’s no way for a wise system to manipulate conceptual objects without paying attention to context, because context is a huge part of concepts.
It seems like everyone involved agrees that a human-aligned superwisdom is safe, even if it’s also superclever: as Ged muses about Ogion in A Wizard of Earthsea, “What good is power when you’re too wise to use it?”
Which brings us to:
I restate the conflict this way: an AI that misinterprets what its creators meant for it to do is not superwise. Once we’ve defined wisdom appropriately, I think everyone involved would agree with that, and would agree that talking about a superwise AI that misinterprets what its creators meant for it to do is incoherent.
But… I don’t see why that’s a conflict, or important. The point of MIRI is to figure out how to develop human-aligned superwisdom before someone develops supercleverness without superwisdom, or superwisdom without human-alignment.
The main conflicts seem to be that MIRI is quick to point out that specific designs aren’t superwise, and that MIRI argues that AI designs in general aren’t superwise by default. But I don’t see how stating that there is inherent wisdom in AI by virtue of it being a superintelligence is a meaningful response to their assumption that there is no inherent wisdom in AI except for whatever wisdom has been deliberately designed. That’s why they care so much about deliberately designing wisdom!
The issue here is that you’re thinking in terms of “Obvious Failure Modes”. The danger doesn’t come from obvious failures, it comes from non-obvious failures. And the smarter the AI, the less likely the insane solutions it comes up with is anything we’d even think to try to prevent; we lack the intelligence, which is why we want to build a better one. “I’ll use a flamethrower” is the sort of hare-brained scheme a -dumb- person might come up with, in particular in view of the issue that it doesn’t solve the actual problem. The issue here isn’t “It might do something stupid.” The issue is that it might do something terribly, terribly clever.
If you could anticipate what a superintelligence would do to head off issues, you don’t need to build the superintelligence in the first place, you could just anticipate what it would do to solve the problem; your issue here is that you think that you can outthink a thing you’ve deliberately built to think better than you can.
There is nothing in my analysis, or in my suggestions for a solution, that depends on the failure modes being “obvious” (and if you think so, can you present and dissect the argument I gave that implies that?).
Your words do not connect to what I wrote. For example, when you say:
… that misses the point completely, because in everything I said I emphasized that we absolutely do NOT need to “think to try to prevent” the AI from doing specific things. Trying to be so clever about the goal statement, second-guessing every possible misinterpretation that the AI might conceivably come up with …. that sort of strategy is what I am emphatically rejecting.
And when you talk about how the AI
… that remark exists in a vacuum completely outside the whole argument I gave in the paper. It is almost as if I didn’t write anything beyond a few remarks in the introduction. I am HOPING that the AI does lots of stuff that is terribly terribly clever! The more the merrier!
So, in you last comment:
… I am left totally perplexed. Nothing I said in the paper implied any such thing.
Your “Responses to Critics of the Doomsday Scenarios” (which seems incorrectly named as the header for your responses). You assume, over and over again, that the issue is logical inconsistency—an obvious failure mode. You hammer on logical inconsistency.
You have some good points. Yanking out motivation, so the AI doesn’t do things on its own, is a perfect solution to the problem of an insane AI. Assuming a logically consistent AI won’t do anything bad because bad is logically inconsistent? That is not a perfect solution, and isn’t actually demonstrated by anything you wrote.
You didn’t -give- an argument in the paper. It’s a mess of unrelated concepts. You tried to criticize, in one go, the entire body of work of criticism of AI, without pausing at any point to ask whether or not you actually understood the criticism. You know the whole “genie” thing? That’s not an argument about how AI would behave. That’s a metaphor to help people understand that the problem of achieving goals is non-trivial, that we make -shitloads- of assumptions about how those goals are to be achieved that we never make explicit, and that the process of creating an engine to achieve goals without going horribly awry is -precisely- the process of making all those assumptions explicit.
And in response to the problem of -making- all those assumptions explicit, you wave your hand, and declare the problem solved, because the genie is fallible and must know it.
That’s not an answer. Okay, the genie asks some clarifying questions, and checks its solution with us. Brilliant! What a great solution! And ten years from now we’re all crushed to death by collapsing cascades of stacks of neatly-packed boxes of strawberries because we answered the clarifying questions wrong.
Fallibility isn’t an answer. You know -you’re- capable of being fallible—if you, right now, knew how to create your AI, who would -you- check with to make sure it wouldn’t go insane and murder everybody? Or even just remain perfectly sane and kill us because we accidentally asked it to?
Yes, yes it did. Fallibility only works if you have a higher authority to go to. Fallibility only works if the higher authority can check your calculations and tell you whether or not it’s a good idea, or at least answer any questions you might have.
See, my job involves me being something of a genie; I interact with people who have poor understanding of their requirements on a daily basis, where I myself have little to no understanding of their requirements, and must ask them clarifying questions. If they get the answer wrong, and I implement that? People could die. “Do nothing” isn’t an option; why have me at all if I do nothing? So I implement what they tell me to do, and hope they answer correctly. I’m the fallible genie, and I hope my authority is infallible.
You don’t get to have fallibility in what you’re looking for, because you don’t have anybody who can actually answer its questions correctly.
Well, the problem here is a misunderstanding of my claim.
(If I really were claiming the things you describe in your above comment, your points would be reasonable. But there is such a strong misunderstanding the your points are hitting a target that, alas, is not there.)
There are several things that I could address, but I will only have time to focus on one. You say:
No. A hundred times no :-). My claim is not even slightly that “a logically consistent AI won’t do anything bad because bad is logically inconsistent”.
The claim is this:
1) The entire class of bad things that these hypothetical AIs are supposed to be doing are a result of the AI systematically (and massively) ignoring contextual information.
(Aside: I am not addressing any particular bad things, on a case-by-case basis, I am dealing with the entire class. As a result, my argument is not vulnerable to charges that I might not be smart enough to guess some really-really-REALLY subtle cases that might come up in the future.)
2) The people who propose these hypothetical AIs have made it absolutely clear that (a) the AI is supposed to be fully cognizant of the fact that the contextual information exists (so the AI is not just plain ignorant), but at the same time (b) the AI does not or cannot take that context into account, but instead executes the plan and does the bad thing.
3) My contribution to this whole debate is to point out that the DESIGN of the AI is incoherent, because the AI is supposed to be able to hold two logically inconsistent ideas (implicit belief in its infallibility and knowledge of its fallibility).
If you look carefully at that argument you will see that it does not make the claim that
I never said that. The logical inconsistency was not in the ‘bad things’ part of the argument. Completely unrelated.
Your other comments are equally as confused.
Not acting upon contextual information isn’t the same as ignoring it.
The AI knows, for example, that certain people believe that plants are morally relevant entities—is it possible for it to pick strawberries at all? What contextual information is relevant, and what contextual information is irrelevant? You accuse the “infallible” AI of ignoring contextual information—but you’re ignoring the magical leap of inference you’re taking when you elevate the concerns of the chef over the concerns of the bioethicist who thinks we shouldn’t rip reproductive organs off plants in the first place.
The issue is that fallibility doesn’t -imply- anything. I think this is the best course of action. I’m fallible. I still think this is the best course of action. The fallibility is an unnecessary and pointless step—it doesn’t change my behavior. Either the AI depends upon somebody else, who is treated as an infallible agent—or it doesn’t.
Then we’re in agreement that insane-from-an-outside-perspective behaviors don’t require logical inconsistency?
Sorry, I cannot put any more effort into this. Your comments show no sign of responding to the points actually made (either in the paper itself, or in my attempts to clarify by responding to you).
Maybe, given the number of times you feel you’ve had to repeat yourself, you’re not making yourself as clear as you think you are.
I find that when I talk about this issue with people who clearly have expert knowledge of AI (including the people who came to the AAAI symposium at Stanford last year, and all of the other practising AI builders who are my colleagues), the points I make are not only understood but understood so clearly that they tell me things like “This is just obvious, really, so all you are doing is wasting your time trying to convince a community that is essentially comprised of amateurs” (That is a direct quote from someone at the symposium).
I always want to make myself as clear as I can. I have invested a lot of my time trying to address the concerns of many people who responded to the paper. I am absolutely sure I could do better.
We’re all amateurs in the field of AI, it’s just that some of us actually know it. Seriously, don’t pull the credentials card. I’m not impressed. I know exactly how “hard” it is to pay the AAAI a hundred and fifty dollars a year for membership, and three hundred dollars to attend their conference. Does claiming to have spent four hundred and fifty dollars make you an expert? What about bringing up that it’s in “Stanford”? What about insulting everybody you’re arguing with?
I’m a “practicing AI builder”—what a nonsense term—although my little heuristics engine is actually running in the real world, processing business data and automating hypothesis elevation work for humans (who have the choice of agreeing with its best hypothesis, selecting among its other hypotheses, or entering their own) - that is, it’s actually picking strawberries.
Moving past tit-for-tat on your hostile introduction paragraph, I don’t doubt your desire to be clear. But you have a conclusion you’re very obviously trying to reach, and you leave huge gaps on your way to get there. The fact that others who want to reach the same conclusion overlook the gaps doesn’t demonstrate anything. And what’s your conclusion? That we don’t have to worry about poorly-designed AI being dangerous, because… contextual information, or something. Honestly, I’m not even sure anymore.
Then you propose a model, which you suggest has been modeled after the single most dangerous brain on the planet—as proof that it’s safe! Seriously.
As for whether you could do better? No, not in your current state of mind. Your hubris prevents you from doing better. You’re convinced you know better than any of the people you’re talking with, and they’re ignorant amateurs.
When someone repeatedly distorts and misrepresents what is said in a paper, then blames the author of the paper for being unclear … then hears the author carefully explain the distortions and misrepresentations, and still repeats them without understanding ….
Well, there is a limit.
Not to suggest that you are implying it, but rather as a reminder—nobody is deliberately misunderstanding you here.
But at any rate, I don’t think we’re accomplishing anything here except driving your karma score lower, so by your leave, I’m tapping out.
Why not raise his karma score instead?
Because that was the practical result, not the problem itself, which is that the conversation wasn’t going anywhere, and he didn’t seem interested in it going anywhere.
What does incoherent mean, here?
If it just labels the fact that it has inconsistent beliefs then it is true but unimpactuve...humans can also hold contradictory beliefs and still .be intelligent enough toebe dangerous,
If means something amounting to “impossibe to build”, then it would be highly impactive… but there is no good reason to think that that is the case,.
You’re right to point out that “incoherent” covers a multitude of sins.
I really had three main things in mind.
1) If an AI system is proposed which contains logically contradictory beliefs located in the most central, high-impact area of its system, it is reasonable to ask how such an AI can function when it allows both X and not-X to be in its knowledge base. I think I would be owed at least some variety of explanation as to why this would not cause the usual trouble when systems try to do logic in such circumstances. So I am saying “This design that you propose is incoherent because you have omitted to say how this glaring problem is supposed to be resolved”).
(Yes, I’m aware that there are workarounds for contradictory beliefs, but those ideas are usually supposed to apply to pretty obscure corners of the AI’s belief system, not to the component that is in charge of the whole shebang).
2) If an AI perceives itself to be wired in such a way that it is compelled to act as if it was infallible, while at the same time knowing that it is both fallible AND perpetrating acts that are directly caused by its failings (for all the aforementioned reasons that we don’t need to re-argue), then I would suggest that such an AI would do something about this situation. The AI, after all, is supposed to be “superintelligent”, so why would it not take steps to stop this immensely damaging situation from occurring?
So in this case I am saying: “This hypothetical superintelligence has an extreme degree of knowledge about its own design, but it is tolerating a massive and damaging contradiction in its construction without doing anything to resolve the problem: it is incoherent to suggest that such a situation could arise without explaining why the AI tolerates the contradiction and fails to act”
(Aside: you mention that humans can hold contradictory beliefs and still be intelligent enough to be dangerous. Arguing from the human case would not be valid because in other areas of this debate I have been told repeatedly not to accidentally generalize and “assume” that the AI would do something just because humans do something. Now, I actually don’t commit the breaches I am charged with (I claim!) (and that is an argument for another day), but I consider the problem of accidental anthropomorphism to be real, so we should not do that here).
3) Lastly, I can point to the fact that IF the hypothetical AI can engage in this kind of bizarre situation where it compulsively commits action X, while knowing that its knowledge of the world indicates that the consequences will strongly violate the goals that were supposed to justify X, THEN I am owed an explanation for why this type of event does not occur more often. Why is it that the AI does this only when it encounters a goal such as “make humans happy”, and not in a million other goals? Why are there not bizarre plans (which are massively inconsistent with the source goal) all the time?
So in this case I would say: “It is incoherent to suggest an AI design in which a drastic inconsistency of this sort occurs in the case of the “maximize human happiness” goal, ut where it doesn’t occur all over the AI’s behavior. In particular I am owed an explanation for why this particular AI is clever enough to be a threat, since it might be expected to have been doing this sort of thing throughout its development, and in that case I would expect it to be so stupid that it would never have made it to super intelligence in the first place.”
Those are the three main areas in which the design would be incoherent ….. i.e. would have such glaring, inbelievable gaps in the design that those gaps would need to be explained before the hypothetical AI could become at all believable.
I honestly don’t know what more to write to make you understand that you misunderstand what Yudkowsky really means.
You may be suffering from a bad case of the Doctrine of Logical Infallibility, yourself.
What you need to do is address the topic carefully, and eliminate the ad hominem comments like this:
… which talk about me, the person discussing things with you.
I will now examine the last substantial comment you wrote, above.
This is your opening topic statement. Fair enough.
You are agreeing with what I say on this point, so we are in agreement so far.
You make three statements here, but I will start with the second one:
This is a contradiction of the previous paragraph, where you said “Yudkowsky believes that a superintelligent AI [...] will put all humans on dopamine drip despite protests that this is not what they want”.
Your other two statements are that Yudkowsky is NOT saying that the AI will do this “because it is absolutely certain of its conclusions past some threshold”, and he is NOT saying that the AI will “fail to update its beliefs accordingly”.
In the paper I have made a precise statement of what the “Doctrine of Logical Infallibility” means, and I have given references to show that the DLI is a summary of what Yudkowsky et al have been claiming. I have then given you a more detailed explanation of what the DLI is, so you can have it clarified as much as possible.
If you look at every single one of the definitions I have given for the DLI you will see that they are all precisely true of what Yudkowsky says. I will now itemize the DLI into five components so we can find which component is inconsistent with what Yudkowsky has publicly said.
1) The AI decides to do action X (forcing humans to go on a dopamine drip). Everyone agrees that Yudkowsky says this.
2) The AI knows quite well that there is massive, converging evidence that action X is inconsistent with the goal statement Y that was supposed to justify X (where goal statement Y was something like “maximize human happiness”).
This is a point that you and others repeatedly misunderstand or misconstrue, so before you respond to it, let me give details of the “converging evidence” that the AI will be getting:
(a) Screams of protest from humans. “Screams of protest” are generally understood by all knowledgeable intelligent systems as evidence of extreme unhappiness, and evidence of extreme unhappiness is evidence that the goal “maximize human happiness” is not being fulfilled.
(b) Verbalizations from humans that amount to “I am begging you not to do this!”. Such verbalizations are, again, usually considered to be evidence of extreme unhappiness caused by the possibiity that ‘this’ is going to be perpetrated.
(c) Patient explanations by the humans that, even though dopamine induced ‘happiness’ might seem to maximize human happiness, the concept of ‘happiness’ exists only by reference to the complete array of desires expressed by humans, and there are many other aspects of happiness not being considered, which trump the dopamine plan. Once again, these patient explanations are a direct statement of the inconsistency of the dopamine plan and real human happiness.
I could probably add to this list continuously, for several days, to document the sum total of all the evidence that the AI would be bombarded with, all pointing to the fact that the dopamine drip plan would be inconsistent with both its accumulated general knowledge about ‘happiness’, and the immediate evidence coming from the human population at that point.
Now, does Yudkowsky believe that the AI will know about this evidence? I have not seen one single denial, by him or any of the others, that the AI will indeed be getting this evidence, and that it will understand that evidence completely. And, on the other hand, most people who read my paper agree that there it is quite clear, in the writings of Yudkowsky et al, that they do, positively, agree that the AI will know that this evidence of conflict exists. So this part of the definition of the DLI is also accepted by everyone.
3) If a goal Y leads to a proposed plan, X, that is supposed to achieve goal Y, and yet there is “massive converging evidence” that this plan will lead to a situation that is drastically inconsistent with everything that the AI understands about the concepts referenced in the goal Y, this kind of massive inconsistency would normally be considered as grounds for supposing that there has been a failure in the mechanism that has caused the AI to propose plan X.
Without exception, every AI programmer that I know, who works on real systems, has agreed that it is hard to imagine a clearer indication that something has gone wrong with the mechanism—either a run-time error of some kind, or a design-time programming error. These people (every one of them) go further and say that one of the most important features of ANY control system in an AI is that when it comes up with a candidate plan to satisfy goal Y it must do some sanity checks to see if the candidate plan is consistent with everything it knows about the goal. If those sanity checks start to detect the slightest inconsistency, the AI will investigate in more depth … and if the AI uncovers the kind of truly gigantic inconsistency between its background knowledge and the proposed plan that we have seen in the above, the AI would take the most drastic action possible to cease all activities and turn itself in for a debugging.
This fact about candidate plans and sanity checks for consistency is considered so elementary that most AI programmers laugh at the idea that anyone could be so naive as to think of disagreeing with it. We can safely assume, then, that Yudkowsky is aware of this (indeed, as I wrote in the paper, he has explicitly said that he thinks this sanity checking mechanisms would be a good idea), so this third component of the DLI definition is also agreed by everyone.
4) In addition to the safe-mode reaction just described in (3), the superintelligent AI being proposed by Yudkowsky and the others would be fully aware of the limitations of all real AI motivation engines, so it would know that a long chain of reasoning from a goal statement to a proposed action plan COULD lead to a proposed plan that was massively inconsistent with the both the system’s larger understanding of the meaning of the terms in the goal statement, and with immediate evidence coming in from the environment at that point.
This knowledge of the AI, about the nature of its own design, is also not denied by anyone. To deny it would be to say that the AI really did not know very much about the design of AI systems design—a preposterous idea, since this is supposed to be a superintelligent system that has already been imvolved in its own redesign, and which is often assumed to be so intelligent that it can understand far more than all of the human race, combined.
So, when the AI’s planning (goal & motivation) system sees the massive inconsistency between its candidate plan X and the terms used in the goal statement Y, it will (in addition to automatically putting itself into safe mode and calling for help) know that this kind of situation could very well be a result of those very limitations.
In other words: the superintelligent AI will know that it is fallible.
I have not seen anyone disagree with this, because it is such an elementary corollary of other known facts about AI systems, that is almost self-evident. So, once again, this component of the DLI definition is not disputed by anyone.
5) Yudkowsky and the others state, repeatedly and in the clearest possible terms, that in spite of all of the above the superintelligent AI they are talking about would NOT put itself into safe mode, as per item 3 above, but would instead insist that ‘human happiness’ was defined by whatever emerged from its reasoning engine, and so it would go ahead and implement the plan X.
Now, the definition—please note, the DEFINITION—of the idea that the postulated AI is following a “Doctrine of Logical Infallibility” is that the postulated AI will do what is described in item (5) above, and NOT do what is described in item (4) above.
This is logically identical to the statement that the postulated AI will behave toward its planning mechanism (which includes its reasoning engine, since it needs to use the latter in the course of unpacking its goals and examining candidate plans) as if that planning mechanism is “infallible”, because it will be giving absolute priority to the output of that mechanism and NOT giving priority to the evidence coming from the consistency-checking mechanism, which is indicating that a failure of some kind has occurred in the planning mechanism.
I do not know why the AI would do this—it is not me who is proposing that it would—but the purpose of the DLI is to encapsulate the proposal made by Yudkowsky and others, to the effect that SOMETHING in the AI makes it behave that way. That is all the DLI is: if an AI does what is described in (5), but not what is described in (4), and it does this in the context of (1), (2) and (3), then by definition it is following the DLI.
All assuming that the AI won’t update it’s goals even it realizes there is some mistake. That isnt obvious, and in fact is hard to defend.
An AI that is powerful and effective would need to seek the truth about a lot of things,since entity that has contradictory beliefs will be a poor instrumental rationalist. But would its goal of truth seeking necessarily be overridden by other goals....would it know but not care?
It might be possible to build an AI that didn’t care about interpreting its goals correctly. It looks like you would need to engineer a distinction between instrumental beliefs and terminal beliefs. Remember that the terminal/instrumental distinction is conceptual, not a law of nature. ( While we’re on the subject, you might need a firewall to stop an .AI acting on intrinsically motivating ideas, if they exis )
In any case, orthogonality is an architecture choice, not an ineluctable fact about minds.
MIRI’s critics, Loosemore, Hibbard and so in are tacitly assuming architectures without such unupdateability and firewalling.
MIRI needs to show that such an architecture is likely to occur, either as a design or a natural evolution. If AIs with unupdateable goals are dangerous, as MIRI seats, it would be simplest not to use that architecture...if it can be avoided. ( We also agree with Yudkowsky(2008a),who points out that research on the philosophical and technical requirements of safe AGI might show that broad classes of possible AGI architectures are fundamentally unsafe,suggesting that such architectures should be avoided.”) In other words, it would be careless to build a genie that doesn’t care.
If the AI community isnt going to deliberately build the goal rigid kind of AI, then MIRIs arguments come down to how it might be a natural or convergent feature....and the wider AI community finds the goal-rigid idea so unintuitive that it fails to understand MIRI, who in turn fail to make it explicit enough.
When Loosemore talks about of the doctrine of logical scalability, he is supposing there must be some reason why an AI wouldn’t update certain things....he’ doesn’t see goal unupdateability as an obvious default.
There are a number of points that can be made against the inevitability of goal rigidity.
For one thing humans don’t show any sign of maintaining .lifelong stable goals. (Talk of utility functions as if they were real things disguises this point)
For another, important classes of real world AIs don’t have that property. The goal, in a sense, of a neural network is to get positive reinforcement, and avoid negative enforcement,
For another, the desire to preserve goals does not imply the ability to preserve goals. In particular, all intelligent entities likely face a trade off between self modifying for improvement and maintaining their goals. An AI might be able to keep its goals stable by refusing to learn or self modify, but that kind of stick in the mud is also less threatening because less powerful.
The Orthogonality thesis us sometimes put forward to support the claim that goal rigidity will occur. To a first approximation, the OT states that any combination of goals and intelligence is possible … and AI s would want to maintain their goals, right?
The devil is in the details,
There is more than one version of the orthogonality thesis. It is trivially false under some interpretations, and trivially true under others. Its more defensible under forms asserting the compatibility of transient combinations of values and intelligence, which are not particularly relevant to AI threat arguments. It is less defensible in forms asserting stable combinations of intelligence and values, and those are the forms that are suitable to be be used as a stage in an argument towards Yudkowskian UFAI.
An orthogonality claim of a kind relevant to UFAI must be one that posits the stable and continued co-existence of a set of values with a self improving AI. The momentary co existence of values and efficiency is not enough to spawn a Paperclipper style UFAI. An AI that paperclips for only a nanosecond is no threat .
A learning, self improving AI will not be able to guarantee that a given self modification keeps its goals unchanged, since it doing so involves the the relatively dumber version at time T1 making an an accurate prediction about the more complex version at time T2.
The claim that rigid-goal architectures are dangerous does not imply that other architectures are safe. Non rigid systems may have the advantage of corrigibility, being in some way fixable once they have been switched on. They are likely to out a high value on truth, correctness, since that is both a multi-purpose instrumental goal, and a desideratum in the part of the programmers,
But non rigid AIs might also converge on undesirable goals, for instance, evolutionary goals like self preservation.That’s another story,
The only sense in which the “rigidity” of goals can be said to be a universal fact about minds is that it is these goals that determine how the AI will modify itself once it has become smart and capable enough to do so. It’s not a good idea to modify your goals if you want them to become reality; that seems obviously true to me, except perhaps for a small number of edge cases related to internally incoherent goals.
Your points against the inevitability of goal rigidity don’t seem relevant to this.
If you take the binary view that you’re either smart enough to achieve your goals or not, then you might well want to stop improving when you have the minimum intelligence necessary to meet them...which means, among other things,that AIs with goals requiring human or lower intelligence won’t become superhuman …. which lowers the probability of the Clippie scenario. It doesn’t require huge intelligence to make paperclips,so an AI with a goal to make paperclips, but not to make any specific amount, wouldn’t grow into a threatening monster.
The probability of the Clippie scenario is also lowered by the consideration that fine grained goals might shift during self-improvement phase, so the Clippie scenario …. arbitrary goals combined with a superintelligence …. is whittled away from both ends.