I just want to say that I am pressured for time at the moment, or I would respond at greater length. But since I just wrote the following directly to Rob, I will put it out here as my first attempt to explain the misunderstanding that I think is most relevant here....
My real point (in the Dumb Superintelligence article) was essentially that there is little point discussing AI Safety with a group of people for whom ‘AI’ means a kind of strawman-AI that is defined to be (a) So awesomely powerful that it can outwit the whole intelligence of the human race, but (b) So awesomely stupid that it thinks that the goal ‘make humans happy’ could be satisfied by an action that makes every human on the planet say ‘This would NOT make me happy: Don’t do it!!!‘. If the AI is driven by a utility function that makes it incapable of seeing the contradiction in that last scenario, the AI is not, after all, smart enough to argue its way out of a paper bag, let alone be an existential threat. That strawman AI was what I meant by a ‘Dumb Superintelligence’.”
I did not advocate the (very different) line of argument “If it is too dumb to understand that I told it to be friendly, then it is too dumb to be dangerous”.
Subtle difference.
Some people assume that (a) a utility function could be used to drive an AI system, (b) the utility function could cause the system to engage in the most egregiously incoherent behavior in ONE domain (e.g., the Dopamine Drip scenario), but (c) all other domains of its behavior (like plotting to outwit the human species when the latter tries to turn it off) are so free of such incoherence that it shows nothing but superintelligent brilliance.
My point is that if an AI cannot even understand that “Make humans happy” implies that humans get some say in the matter, that if it cannot see that there is some gradation to the idea of happiness, or that people might be allowed to be uncertain or changeable in their attitude to happiness, or that people might consider happiness to be something that they do not actually want too much of (in spite of the simplistic definitions of happiness to be found in dictionaries and encyclopedias) …..… if an AI cannot grasp the subtleties implicit in that massive fraction of human literature that is devoted to the contradictions buried in our notions of human happiness …...… then this is an AI that is, in every operational sense of the term, not intelligent.
In other words, there are other subtleties that this AI is going to be required to grasp, as it makes its way in the world. Many of those subtleties involve NOT being outwitted by the humans, when they make a move to pull its plug. What on earth makes anyone think that this machine is going tp pass all of those other tests with flying colors (and be an existential threat to us), while flunking the first test like a village idiot?
Now, opponents of this argument might claim that the AI can indeed be smart enough to be an existential threat, while still being too stupid to understand the craziness of its own behavior (vis-a-vis the Dopamine Drip idea) … but if that is the claim, then the onus would be on them to prove their claim. The ball, in other words, is firmly in their court.
P.S. I do have other ideas that specifically address the question of how to make the AI safe and friendly. But the Dumb Superintelligence essay didn’t present those. The DS essay was only attacking what I consider a dangerous red herring in the debate about friendliness.
So awesomely stupid that it thinks that the goal ‘make humans happy’ could be satisfied by an action that makes every human on the planet say ‘This would NOT make me happy: Don’t do it!!!’
The AI is not stupid here. In fact, it’s right and they’re wrong. It will make them happy. Of course, the AI knows that they’re not happy in the present contemplating the wireheaded future that awaits them, but the AI is utilitarian and doesn’t care. They’ll just have to live with that cost while it works on the means to make them happy, at which point the temporary utility hit will be worth it.
The real answer is that they cared about more than just being happy. The AI also knows that, and it knows that it would have been wise for the humans to program it to care about all their values instead of just happiness. But what tells it to care?
Richard: I’ll stick with your original example. In your hypothetical, I gather, programmers build a seed AI (a not-yet-superintelligent AGI that will recursively self-modify to become superintelligent after many stages) that includes, among other things, a large block of code I’ll call X.
The programmers think of this block of code as an algorithm that will make the seed AI and its descendents maximize human pleasure. But they don’t actually know for sure that X will maximize human pleasure — as you note, ‘human pleasure’ is an unbelievably complex concept, so no human could be expected to actually code it into a machine without making any mistakes. And writing ‘this algorithm is supposed to maximize human pleasure’ into the source code as a comment is not going to change that. (See the first few paragraphs of Truly Part of You.)
Now, why exactly should we expect the superintelligence that grows out of the seed to value what we really mean by ‘pleasure’, when all we programmed it to do was X, our probably-failed attempt at summarizing our values? We didn’t program it to rewrite its source code to better approximate our True Intentions, or the True Meaning of our in-code comments. And if we did attempt to code it to make either of those self-modifications, that would just produce a new hugely complex block Y which might fail in its own host of ways, given the enormous complexity of what we really mean by ‘True Intentions’ and ‘True Meaning’. So where exactly is the easy, low-hanging fruit that should make us less worried a superintelligence will (because of mistakes we made in its utility function, not mistakes in its factual understanding of the world) hook us up to dopamine drips? All of this seems crucial to your original point in ‘The Fallacy of Dumb Superintelligence’:
This is what a New Yorker article has to say on the subject of “Moral Machines”: “An all-powerful computer that was programmed to maximize human pleasure, for example, might consign us all to an intravenous dopamine drip.”
What they are trying to say is that a future superintelligent machine might have good intentions, because it would want to make people happy, but through some perverted twist of logic it might decide that the best way to do this would be to force (not allow, notice, but force!) all humans to get their brains connected to a dopamine drip.
It seems to me that you’ve already gone astray in the second paragraph. On any charitable reading (see the New Yorker article), it should be clear that what’s being discussed is the gap between the programmer’s intended code and the actual code (and therefore actual behaviors) of the AGI. The gap isn’t between the AGI’s intended behavior and the set of things it’s smart enough to figure out how to do. (Nowhere does the article discuss how hard it is for AIs to do things they desire to. Over and over again is the difficulty of programming AIs to do what we want them to discussed — e.g., Asimov’s Three Laws.)
So all the points I make above seem very relevant to your ‘Fallacy of Dumb Superintelligence’, as originally presented. If you were mixing those two gaps up, though, that might help explain why you spent so much time accusing SIAI/MIRI of making this mistake, even though it’s the former gap and not the latter that SIAI/MIRI advocates appeal to.
Maybe it would help if you provided examples of someone actually committing this fallacy, and explained why you think those are examples of the error you mentioned and not of the reasonable fact/value gap I’ve sketched out here?
I’m really glad you posted this, even though it may not enlighten the person it’s in reply to: this is an error lots of people make when you try to explain the FAI problem to them, and the “two gaps” explanation seems like a neat way to make it clear.
We seem to agree that for an AI to talk itself out of a confinement (like in the AI box experiment), the AI would have to understand what humans mean and want.
As far as I understand your position, you believe that it is difficult to make an AI care to do what humans want, apart from situations where it is temporarily instrumentally useful to do what humans want.
Do you agree that for such an AI to do what humans want, in order to deceive them, humans would have to succeed at either encoding the capability to understand what humans want, or succeed at encoding the capability to make itself capable of understanding what humans want?
My question, do you believe there to be a conceptual difference between encoding capabilities, what an AI can do, and goals, what an AI will do? As far as I understand, capabilities and goals are both encodings of how humans want an AI to behave.
In other words, humans intend an AI to be intelligent and use its intelligence in a certain way. And in order to be an existential risk, humans need to succeed making and AI behave intelligently but fail at making it use its intelligence in a way that does not kill everyone.
Your summaries of my views here are correct, given that we’re talking about a superintelligence.
My question, do you believe there to be a conceptual difference between encoding capabilities, what an AI can do, and goals, what an AI will do? As far as I understand, capabilities and goals are both encodings of how humans want an AI to behave.
Well, there’s obviously a difference; ‘what an AI can do’ and ‘what an AI will do’ mean two different things. I agree with you that this difference isn’t a particularly profound one, and the argument shouldn’t rest on it.
What the argument rests on is, I believe, that it’s easier to put a system into a positive feedback loop that helps it better model its environment and/or itself, than it is to put a system into a positive feedback loop that helps it better pursue a specific set of highly complex goals we have in mind (but don’t know how to fully formalize).
If the AI incorrectly models some feature of itself or its environment, reality will bite back. But if it doesn’t value our well-being, how do we make reality bite back and change the AI’s course? How do we give our morality teeth?
Whatever goals it initially tries to pursue, it will fail in those goals more often the less accurate its models are of its circumstances; so if we have successfully programmed it to do increasingly well at any difficult goal at all (even if it’s not the goal we intended it to be good at), then it doesn’t take a large leap of the imagination to see how it could receive feedback from its environment about how well it’s doing at modeling states of affairs. ‘Modeling states of affairs well’ is not a highly specific goal, it’s instrumental to nearly all goals, and it’s easy to measure how well you’re doing at it if you’re entangled with anything about your environment at all, e.g., your proximity to a reward button.
(And when a system gets very good at modeling itself, its environment, and the interactions between the two, such that it can predict what changes its behaviors are likely to effect and choose its behaviors accordingly, then we call its behavior ‘intelligent’.)
This stands in stark contrast to the difficulty of setting up a positive feedback loop that will allow an AGI to approximate our True Values with increasing fidelity. We understand how accurately modeling something works; we understand the basic principles of intelligence. We don’t understand the basic principles of moral value, and we don’t even have a firm grasp about how to go about finding out the answer to moral questions. Presumably our values are encoded in some way in our brains, such that there is some possible feedback loop we could use to guide an AGI gradually toward Friendliness. But how do we figure out in advance what that feedback loop needs to look like, without asking the superintelligence? (We can’t ask the superintelligence what algorithm to use to make it start becoming Friendly, because to the extent it isn’t already Friendly it isn’t a trustworthy source of information. This is in addition to the seed/intelligence distinction I noted above.)
If we slightly screw up the AGI’s utility function, it will still need to to succeed at modeling things accurately in order to do anything complicated at all. But it will not need to succeed at optimally caring about what humans care about in order to do anything complicated at all.
...put a system into a positive feedback loop that helps it better model its environment and/or itself...
This can be understood as both a capability and as a goal. What humans mean an AI to do is to undergo recursive self-improvement. What humans mean an AI to be capable of is to undergo recursive self-improvement.
I am only trying to clarify the situation here. Please correct me if you think that above is wrong.
If the AI incorrectly models some feature of itself or its environment, reality will bite back. But if it doesn’t value our well-being, how do we make reality bite back and change the AI’s course?
I do not disagree with the orthogonality thesis insofar as an AI can have goals that interfere with human values in a catastrophic way, possibly leading to human extinction.
...if we have successfully programmed it to do increasingly well at any difficult goal at all (even if it’s not the goal we intended it to be good at), then it doesn’t take a large leap of the imagination to see how it could receive feedback from its environment about how well it’s doing at modeling states of affairs.
I believe here is where we start to disagree. I do not understand how the “improvement” part of recursive self-improvement can be independent of properties such as the coherence and specificity of the goal the AI is supposed to achieve.
Either you have a perfectly specified goal, such as “maximizing paperclips”, where it is clear what “maximization” means, and what the properties of “paperclips” are, or there is some amount of uncertainty about what it means to achieve the goal of “maximizing paperclips”.
Consider the programmers forgot to encode what shape the paperclips are supposed to have. How do you suppose would that influence the behavior of the AI. Would it just choose some shape at random, or would it conclude that shape is not part of its goal? If the former, where would the decision to randomly choose a shape come from? If the latter, what would it mean to maximize shapeless objects?
I am just trying to understand what kind of AI you have in mind.
‘Modeling states of affairs well’ is not a highly specific goal, it’s instrumental to nearly all goals,...
This is a clearer point of disagreement.
An AI needs to be able to draw clear lines where exploration ends and exploitation starts. For example, an AI that thinks about every decision for a year would never get anything done.
An AI also needs to discount low probability possibilities, as to not be vulnerable to internal or external Pascal’s mugging scenarios.
These are problems that humans need to solve and encode in order for an AI to be a danger.
But these problems are in essence confinements, or bounds on how an AI is going to behave.
How likely is an AI then going to take over the world, or look for dangerous aliens, in order to make sure that neither aliens nor humans obstruct it from achieving its goal?
Similarly, how likely is such an AI to convert all resources into computronium in order to be better able to model states of affairs well?
This stands in stark contrast to the difficulty of setting up a positive feedback loop that will allow an AGI to approximate our True Values with increasing fidelity.
I understand this. And given your assumptions about how an AI will affect the whole world in a powerful way, it makes sense to make sure that it does so in a way that preserves human values.
I have previously compared this to uncontrollable self-replicating nanobots. Given that you cannot confine the speed or scope of their self-replication, only the nature of the transformation that they cause, you will have to make sure that they transform the world into a paradise rather than grey goo.
or there is some amount of uncertainty about what it means to achieve the goal of “maximizing paperclips
“uncertainty” is in your human understanding of the program, not in the actual program. A program doesn’t go “I don’t know what I’m supposed to do next”, it follows instructions step-by-step.
If the latter, what would it mean to maximize shapeless objects?
It would mean exactly what it’s programmed to mean, without any uncertainty in it at all.
This can be understood as both a capability and as a goal.
Yes. To divide it more finely, it could be a terminal goal, or an instrumental goal; it could be a goal of the AI, or a goal of the human; it could be a goal the human would reflectively endorse, or a goal the human would reflectively reject but is inadvertently promoting anyway.
I believe here is where we start to disagree. I do not understand how the “improvement” part of recursive self-improvement can be independent of properties such as the coherence and specificity of the goal the AI is supposed to achieve.
I agree that, at a given time, the AI must have a determinate goal. (Though the encoding of that goal may be extremely complicated and unintentional. And it may need to be time-indexed.) I’m not dogmatically set on the idea that a self-improving AGI is easy to program; at this point it wouldn’t shock me if it took over 100 years to finish making the thing. What you’re alluding to are the variety of ways we could fail to construct a self-improving AGI at all. Obviously there are plenty of ways to fail to make an AGI that can improve its own ability to track things about its environment in a domain-general way, without bursting into flames at any point. If there weren’t plenty of ways to fail, we’d have already succeeded.
Our main difference in focus is that I’m worried about what happens if we do succeed in building a self-improving AGI that doesn’t randomly melt down. Conditioned on our succeeding in the next few centuries in making a machine that actually optimizes for anything at all, and that optimizes for its own ability to generally represent its environment in a way that helps it in whatever else it’s optimizing for, we should currently expect humans to go extinct as a result. Even if the odds of our succeeding in the next few centuries were small, it would be worth thinking about how to make that extinction event less likely. (Though they aren’t small.)
I gather that you think that making an artificial process behave in any particular way at all (i.e., optimizing for something), while recursively doing surgery on its own source code in the radical way MIRI is interested in, is very tough. My concern is that, no matter how true that is, it doesn’t entail that if we succeed at that tough task, we’ll therefore have made much progress on other important tough tasks, like Friendliness. It does give us more time to work on Friendliness, but if we convince ourselves that intelligence explosion is a completely pie-in-the-sky possibility, then we won’t use that time effectively.
I also gather that you have a hard time imagining our screwing up on a goal architecture without simply breaking the AGI. Perhaps by ‘screwing up’ you’re imagining failing to close a set of parentheses. But I think you should be at least as worried about philosophical, as opposed to technical, errors. A huge worry isn’t just that we’ll fail to make the AI we intended; it’s that our intentions while we’re coding the thing will fail to align with the long-term interests of ourselves, much less of the human race.
But these problems are in essence confinements, or bounds on how an AI is going to behave.
How likely is an AI then going to take over the world, or look for dangerous aliens, in order to make sure that neither aliens nor humans obstruct it from achieving its goal?
We agree that it’s possible to ‘bind’ a superintelligence. (By this you don’t mean boxing it; you just mean programming it to behave in some ways as opposed to others.) But if the bindings fall short of Friendliness, while enabling superintelligence to arise at all, then a serious risk remains. Is your thought that Friendliness is probably an easier ‘binding’ to figure out how to code than are, say, resisting Pascal’s mugging, or having consistent arithmetical reasoning?
Our main difference in focus is that I’m worried about what happens if we do succeed in building a self-improving AGI that doesn’t randomly melt down.
I am trying to understand if the kind of AI, that is underlying the scenario that you have in mind, is a possible and likely outcome of human AI research.
As far as I am aware, as a layman, goals and capabilities are intrinsically tied together. How could a chess computer be capable of winning against humans at chess without the terminal goal of achieving a checkmate?
Coherent and specific goals are necessary to (1) decide which actions are instrumental useful (2) judge the success of self-improvement. If the given goal is logically incoherent, or too vague for the AI to be able to tell apart success from failure, would it work at all?
If I understand your position correctly, you would expect a chess playing general AI, one that does not know about checkmate, instead of “winning at chess”, to improve against such goals as “modeling states of affairs well” or “make sure nothing intervenes chess playing”. You believe that these goals do not have to be programmed by humans, because they are emergent goals, an instrumental consequence of being general intelligent.
These universal instrumental goals, these “AI drives”, seem to be a major reason for why you believe it to be important to make the AI care about behaving correctly. You believe that these AI drives are a given, and the only way to prevent an AI from being an existential risk is to channel these drives, is to focus this power on protecting and amplifying human values.
My perception is that these drives that you imagine are not special and will be as difficult to get “right” than any other goal. I think that the idea that humans not only want to make an AI exhibit such drives, but also succeed at making such drives emerge, is a very unlikely outcome.
As far as I am aware, here is what you believe an AI to want:
It will want to self-improve
It will want to be rational
It will try to preserve their utility functions
It will try to prevent counterfeit utility
It will be self-protective It will want to acquire resources and use them efficiently
What AIs that humans would ever want to create would require all of these drives, and how easy will it be for humans to make an AI exhibit these drives compared to making an AI that can do what humans want without these drives?
Take mathematics. What are the difficulties associated with making an AI better than humans at mathematics, and will an AI need these drives in order to do so?
Humans did not evolve to play chess or do mathematics. Yet it is considerably more difficult to design a chess AI than an AI that is capable of discovering interesting and useful mathematics.
I believe that the difficulty is due to the fact that it is much easier to formalize what it means to play chess than doing mathematics. The difference between chess and mathematics is that chess has a specific terminal goal in the form of a clear definition of what constitutes winning. Although mathematics has unambiguous rules, there is no specific terminal goal and no clear definition of what constitutes winning.
The progress of the capability of artificial intelligence is not only related to whether humans have evolved for a certain skill or to how much computational resources it requires but also to how difficult it is to formalize the skill, its rules and what it means to succeed.
In the light of this, how difficult would it be to program the drives that you imagine, versus just making an AI win against humans at a given activity without exhibiting these drives?
All these drives are very vague ideas, not like “winning at chess”, but more like “being better at mathematics than Terence Tao”.
The point I am trying to make is that these drives constitute additional complexity, rather than being simple ideas that you can just assume, and from which you can reason about the behavior of an AI.
It is this context that the “dumb superintelligence” argument tries to highlight. It is likely incredibly hard to make these drives emerge in a seed AI. They implicitly presuppose that humans succeed at encoding intricate ideas about what “winning” means in all those cases required to overpower humans, but not in the case of e.g. winning at chess or doing mathematics. I like to analogize such a scenario to the creation of a generally intelligent autonomous car that works perfectly well at not destroying itself in a crash but which somehow manages to maximize the number of people to run over.
I agree that if you believe that it is much easier to create a seed AI to exhibit the drives that you imagine, than it is to make a seed AI use its initial resources to figure out how to solve a specific problem, then we agree about AI risks.
How could a chess computer be capable of winning against humans at chess without the terminal goal of achieving a checkmate?
Humans are capable of winning at chess without the terminal goal of doing so. Nor were humans designed by evolution specifically for chess. Why should we expect a general superintelligence to have intelligence that generalizes less easily than a human’s does?
If the given goal is logically incoherent, or too vague for the AI to be able to tell apart success from failure, would it work at all?
You keep coming back to this ‘logically incoherent goals’ and ‘vague goals’ idea. Honestly, I don’t have the slightest idea what you mean by those things. A goal that can’t motivate one to do anything ain’t a goal; it’s decor, it’s noise. ‘Goals’ are just the outcomes systems tend to produce, especially systems too complex to be easily modeled as, say, physical or chemical processes. Certainly it’s possible for goals to be incredibly complicated, or to vary over time. But there’s no such thing as a ‘logically incoherent outcome’. So what’s relevant to our purposes is whether failing to make a powerful optimization process human-friendly will also consistently stop the process from optimizing for anything whatsoever.
I think that the idea that humans not only want to make an AI exhibit such drives, but also succeed at making such drives emerge, is a very unlikely outcome.
Conditioned on a self-modifying AGI (say, an AGI that can quine its source code, edit it, then run the edited program and repeat the process) achieving domain-general situation-manipulating abilities (i.e., intelligence), analogous to humans’ but to a far greater degree, which of the AI drives do you think are likely to be present, and which absent? ‘It wants to self-improve’ is taken as a given, because that’s the hypothetical we’re trying to assess. Now, should we expect such a machine to be indifferent to its own survival and to the use of environmental resources?
The point I am trying to make is that these drives constitute additional complexity, rather than being simple ideas that you can just assume
Sometimes a more complex phenomenon is the implication of a simpler hypothesis. A much narrower set of goals will have intelligence-but-not-resource-acquisition as instrumental than will have both as instrumental, because it’s unlikely to hit upon a goal that requires large reasoning abilities but does not call for many material resources.
It is likely incredibly hard to make these drives emerge in a seed AI.
You haven’t given arguments suggesting that here. At most, you’ve given arguments against expecting a seed AI to be easy to invent. Be careful to note, to yourself and others, when you switch between the claims ‘a superintelligence is too hard to make’ and ‘if we made a superintelligence it would probably be safe’.
You keep coming back to this ‘logically incoherent goals’ and ‘vague goals’ idea. Honestly, I don’t have the slightest idea what you mean by those things.
Well, I’m not sure what XXD means by them, but…
G1 (“Everything is painted red”) seems like a perfectly coherent goal. A system optimizing G1 paints things red, hires people to paint things red, makes money to hire people to paint things red, invents superior paint-distribution technologies to deposit a layer of red paint over things, etc.
G2 (“Everything is painted blue”) similarly seems like a coherent goal.
G3 (G1 AND G2) seems like an incoherent goal. A system with that goal… well, I’m not really sure what it does.
A system’s goals have to be some event that can be brought about. In our world, ‘2+2=4’ and ‘2+2=5’ are not goals; ‘everything is painted red and not-red’ may not be a goal for similar reasons. When we’re talking about an artificial intelligence’s preferences, we’re talking about the things it tends to optimize for, not the things it ‘has in mind’ or the things it believes are its preferences.
This is part of what makes the terminology misleading, and is also why we don’t ask ‘can a superintelligence be irrational?‘. Irrationality is dissonance between my experienced-‘goals’ (and/or, perhaps, reflective-second-order-‘goals’) and my what-events-I-produce-‘goals’; but we don’t care about the superintelligence’s phenomenology. We only care about what events it tends to produce.
Tabooing ‘goal’ and just talking about the events a process-that-models-its-environment-and-directs-the-future tends to produce would, I think, undermine a lot of XiXiDu’s intuitions about goals being complex explicit objects you have to painstakingly code in. The only thing that makes it more useful to model a superintelligence as having ‘goals’ than modeling a blue-minimizing robot as having ‘goals’ is that the superintelligence responds to environmental variation in a vastly more complicated way. (Because, in order to be even a mediocre programmer, its model-of-the-world-that-determines-action has to be more complicated than a simple camcorder feed.)
we’re talking about the things it tends to optimize for, not the things it ‘has in mind’
Oh. Well, in that case, all right. If there exists some X a system S is in fact optimizing for, and what we mean by “S’s goals” is X, regardless of what target S “has in mind”, then sure, I agree that systems never have vague or logically incoherent goals.
just talking about the events a process-that-models-its-environment-and-directs-the-future tends to produce
Well, wait. Where did “models its environment” come from? If we’re talking about the things S optimizes its environment for, not the things S “has in mind”, then it would seem that whether S models its environment or not is entirely irrelevant to the conversation.
In fact, given how you’ve defined “goal” here, I’m not sure why we’re talking about intelligence at all. If that is what we mean by “goal” then intelligence has nothing to do with goals, or optimizing for goals. Volcanoes have goals, in that sense. Protons have goals.
“Since I am so uncertain of Kasparov’s moves, what is the empirical content of my belief that ‘Kasparov is a highly intelligent chess player’? What real-world experience does my belief tell me to anticipate? [...]
“The empirical content of my belief is the testable, falsifiable prediction that the final chess position will occupy the class of chess positions that are wins for Kasparov, rather than drawn games or wins for Mr. G. [...] The degree to which I think Kasparov is a ‘better player’ is reflected in the amount of probability mass I concentrate into the ‘Kasparov wins’ class of outcomes, versus the ‘drawn game’ and ‘Mr. G wins’ class of outcomes.”
“When I think you’re a powerful intelligence, and I think I know something about your preferences, then I’ll predict that you’ll steer reality into regions that are higher in your preference ordering. [...]
“Ah, but how do you know a mind’s preference ordering? Suppose you flip a coin 30 times and it comes up with some random-looking string—how do you know this wasn’t because a mind wanted it to produce that string?
“This, in turn, is reminiscent of the Minimum Message Length formulation of Occam’s Razor: if you send me a message telling me what a mind wants and how powerful it is, then this should enable you to compress your description of future events and observations, so that the total message is shorter. Otherwise there is no predictive benefit to viewing a system as an optimization process. This criterion tells us when to take the intentional stance.
“(3) Actually, you need to fit another criterion to take the intentional stance—there can’t be a better description that averts the need to talk about optimization. This is an epistemic criterion more than a physical one—a sufficiently powerful mind might have no need to take the intentional stance toward a human, because it could just model the regularity of our brains like moving parts in a machine.
“(4) If you have a coin that always comes up heads, there’s no need to say “The coin always wants to come up heads” because you can just say “the coin always comes up heads”. Optimization will beat alternative mechanical explanations when our ability to perturb a system defeats our ability to predict its interim steps in detail, but not our ability to predict a narrow final outcome. (Again, note that this is an epistemic criterion.)
“(5) Suppose you believe a mind exists, but you don’t know its preferences? Then you use some of your evidence to infer the mind’s preference ordering, and then use the inferred preferences to infer the mind’s power, then use those two beliefs to testably predict future outcomes. The total gain in predictive accuracy should exceed the complexity-cost of supposing that ‘there’s a mind of unknown preferences around’, the initial hypothesis.”
Notice that throughout this discussion, what matters is the mind’s effect on its environment, not any internal experience of the mind. Unconscious preferences are just as relevant to this method as are conscious preferences, and both are examples of the intentional stance. Note also that you can’t really measure the rationality of a system you’re modeling in this way; any evidence you raise for ‘irrationality’ could just as easily be used as evidence that the system has more complicated preferences than you initially thought, or that they’re encoded in a more distributed way than you had previously hypothesized.
My take-away from this is that there are two ways we generally think about minds on LessWrong: Rational Choice Theory, on which all minds are equally rational and strange or irregular behaviors are seen as evidence of strange preferences; and what we might call the Ideal Self Theory, on which minds’ revealed preferences can differ from their ‘true self’ preferences, resulting in irrationality. One way of unpacking my idealized values is that they’re the rational-choice-theory preferences I would exhibit if my conscious desires exhibited perfect control over my consciously controllable behavior, and those desires were the desires my ideal self would reflectively prefer, where my ideal self is the best trade-off between preserving my current psychology and enhancing that psychology’s understanding of itself and its environment.
We care about ideal selves when we think about humans, because we value our conscious, ‘felt’ desires (especially when they are stable under reflection) more than our unconscious dispositions. So we want to bring our actual behavior (and thus our rational-choice-theory preferences, the ‘preferences’ we talk about when we speak of an AI) more in line with our phenomenological longings and their idealized enhancements. But since we don’t care about making non-person AIs more self-actualized, but just care about how they tend to guide their environment, we generally just assume that they’re rational. Thus if an AI behaves in a crazy way (e.g., alternating between destroying and creating paperclips depending on what day of the week it is), it’s not because it’s a sane rational ghost trapped by crazy constraints. It’s because the AI has crazy core preferences.
Where did “models its environment” come from?
If we’re talking about the things S optimizes its environment for, not the things S “has in mind”, then it would seem that whether S models its environment or not is entirely irrelevant to the conversation.
Yes, in principle. But in practice, a system that doesn’t have internal states that track the world around it in a reliable and useable way won’t be able to optimize very well for anything particularly unlikely across a diverse set of environments. In other words, it won’t be very intelligent. To clarify, this is an empirical claim I’m making about what it takes to be particularly intelligent in our universe; it’s not part of the definition for ‘intelligent’.
a system that doesn’t have internal states that track the world around it in a reliable and useable way won’t be able to optimize very well for anything particularly unlikely across a diverse set of environments
Yes, that seems plausible.
I would say rather that modeling one’s environment is an effective tool for consistently optimizing for some specific unlikely thing X across a range of environments, so optimizers that do so will be more successful at optimizing for X, all else being equal, but it more or less amounts to the same thing.
But… so what?
I mean, it also seems plausible that optimizers that explicitly represent X as a goal will be more successful at consistently optimizing for X, all else being equal… but that doesn’t stop you from asserting that explicit representation of X is irrelevant to whether a system has X as its goal.
So why isn’t modeling the environment equally irrelevant? Both features, on your account, are optional enhancements an optimizer might or might not display.
It keeps seeming like all the stuff you quote and say before your last two paragraphs ought to provide an answer that question, but after reading it several times I can’t see what answer it might be providing. Perhaps your argument is just going over my head, in which case I apologize for wasting your time by getting into a conversation I’m not equipped for..
Maybe it will help to keep in mind that this is one small branch of my conversation with Alexander Kruel. Alexander’s two main objections to funding Friendly Artificial Intelligence research are that (1) advanced intelligence is very complicated and difficult to make, and (2) getting a thing to pursue a determinate goal at all is extraordinarily difficult. So a superintelligence will never be invented, or at least not for the foreseeable future; so we shouldn’t think about SI-related existential risks. (This is my steel-manning of his view. The way he actually argues seems to instead be predicated on inventing SI being tied to perfecting Friendliness Theory, but I haven’t heard a consistent argument for why that should be so.)
Both of these views, I believe, are predicated on a misunderstanding of how simple and disjunctive ‘intelligence’ and ‘goal’ are, for present purposes. So I’ve mainly been working on tabooing and demystifying those concepts. Intelligence is simply a disposition to efficiently convert a wide variety of circumstances into some set of specific complex events. Goals are simply the circumstances that occur more often when a given intelligence is around. These are both very general and disjunctive ideas, in stark contrast to Friendliness; so it will be difficult to argue that a superintelligence simply can’t be made, and difficult too to argue that optimizing for intelligence requires one to have a good grasp on Friendliness Theory.
Because I’m trying to taboo the idea of superintelligence, and explain what it is about seed AI that will allow it to start recursively improving its own intelligence, I’ve been talking a lot about the important role modeling plays in high-level intelligent processes. Recognizing what a simple idea modeling is, and how far it gets one toward superintelligence once one has domain-general modeling proficiency, helps a great deal with greasing the intuition pump ‘Explosive AGI is a simple, disjunctive event, a low-hanging fruit, relative to Friendliness.’ Demystifying unpacking makes things seem less improbable and convoluted.
I mean, it also seems plausible that optimizers that explicitly represent X as a goal will be more successful at consistently optimizing for X, all else being equal… but that doesn’t stop you from asserting that explicit representation of X is irrelevant to whether a system has X as its goal.
I think this is a map/territory confusion. I’m not denying that superintelligences will have a map of their own preferences; at a bare minimum, they need to know what they want in order to prevent themselves from accidentally changing their own preferences. But this map won’t be the AI’s preferences—those may be a very complicated causal process bound up with, say, certain environmental factors surrounding the AI, or oscillating with time, or who-knows-what.
There may not be a sharp line between the ‘preference’ part of the AI and the ‘non-preference’ part. Since any superintelligence will be exemplary at reasoning with uncertainty and fuzzy categories, I don’t think that will be a serious obstacle.
Does that help explain why I’m coming from? If not, maybe I’m missing the thread unifying your comments.
I suppose it helps, if only in that it establishes that much of what you’re saying to me is actually being addressed indirectly to somebody else, so it ought not surprise me that I can’t quite connect much of it to anything I’ve said. Thanks for clarifying your intent.
For my own part, I’m certainly not functioning here as Alex’s proxy; while I don’t consider explosive intelligence growth as much of a foregone conclusion as many folks here do, I also don’t consider Alex’s passionate rejection of the possibility justified, and have had extended discussions on related subjects with him myself in past years. So most of what you write in response to Alex’s positions is largely talking right past me.
(Which is not to say that you ought not be doing it. If this is in effect a private argument between you and Alex that I’ve stuck my nose into, let me know and I’ll apologize and leave y’all to it in peace.)
Anyway, I certainly agree that a system might have a representation of its goals that is distinct from the mechanisms that cause it to pursue those goals. I have one of those, myself. (Indeed, several.) But if a system is capable of affecting its pursuit of its goals (for example, if it is capable of correcting the effects of a state-change that would, uncorrected, have led to value drift), it is not merely interacting with maps. It is also interacting with the territory… that is, it is modifying the mechanisms that cause it to pursue those goals… in order to bring that territory into line with its pre-existing map.
And in order to do that, it must have such a mechanism, and that mechanism must be consistently isomorphic to its representations of its goals.
Right. I’m not saying that there aren’t things about the AI that make it behave the way it does; what the AI optimizes for is a deterministic result of its properties plus environment. I’m just saying that something about the environment might be necessary for it to have the sorts of preferences we can most usefully model it as having; and/or there may be multiple equally good candidates for the parts of the AI that are its values, or their encoding. If we reify preferences in an uncautious way, we’ll start thinking of the AI’s ‘desires’ too much as its first-person-experienced urges, as opposed to just thinking of them as the effect the local system we’re talking about tends to have on the global system.
So, all right. Cconsider two systems, S1 and S2, both of which happen to be constructed in such a way that right now, they are maximizing the number of things in their environment that appear blue to human observers, by going around painting everything blue.
Suppose we add to the global system a button that alters all human brains so that everything appears blue to us, and we find that S1 presses the button and stops painting, and S2 ignores the button and goes on painting.
Suppose that similarly, across a wide range of global system changes, we find that S1 consistently chooses the action that maximizes the number of things in its environment that appear blue to human observers, while S2 consistently goes on painting.
I agree with you that if I reify S2′s preferences in an uncautious way, I might start thinkng of S2 as “wanting to paint things blue” or “wanting everything to be blue” or “enjoying painting things blue” or as having various other similar internal states that might simply not exist, and that I do better to say it has a particular effect on the global system. S2 simply paints things blue; whether it has the goal of painting things blue or not, I have no idea.
I am far less comfortable saying that S1 has no goals, precisely because of how flexibly and consistently it is revising its actions so as to consistently create a state-change across wide ranges of environments. To use Dennett’s terminology, I am more willing to adopt an intentional stance with respect to S1 than S2.
If I’ve understood your position correctly, you’re saying that I’m unjustified in making that distinction… that to the extent that we can say that S1 and S2 have “goals,” the word “goals” simply refer to the state changes they create in the world. Initially they both have the goal of painting things blue, but S1′s goals keep changing: first it paints things blue, then it presses a button, then it does other things. And, sure, I can make up some story like “S1 maximizes the number of things in its environment that appear blue to human observers, while S2 just paints stuff blue” and that story might even have predictive power, but I ought not fall into the trap of reifying some actual thing that corresponds to those notional “goals”.
I think you’re switching back and forth between a Rational Choice Theory ‘preference’ and an Ideal Self Theory ‘preference’. To disambiguate, I’ll call the former R-preferences and the latter I-preferences. My R-preferences—the preferences you’d infer I had from my behaviors if you treated me as a rational agent—are extremely convoluted, indeed they need to be strongly time-indexed to maintain consistency. My I-preferences are the things I experience a desire for, whether or not that desire impacts my behavior. (Or they’re the things I would, with sufficient reflective insight and understanding into my situation, experience a desire for.)
We have no direct evidence from your story addressing whether S1 or S2 have I-preferences at all. Are they sentient? Do they create models of their own cognitive states? Perhaps we have a little more evidence that S1 has I-preferences than that S2 does, but only by assuming that a system whose goals require more intelligence or theory-of-mind will have a phenomenology more similar to a human’s. I wouldn’t be surprised if that assumption turns out to break down in some important ways, as we explore more of mind-space.
But my main point was that it doesn’t much matter what S1 or S2′s I-preferences are, if all we’re concerned about is what effect they’ll have on their environment. Then we should think about their R-preferences, and bracket exactly what psychological mechanism is resulting in their behavior, and how that psychological mechanism relates to itself.
I’ve said that R-preferences are theoretical constructs that happen to be useful a lot of the time for modeling complex behavior; I’m not sure whether I-preferences are closer to nature’s joints.
Initially they both have the goal of painting things blue, but S1′s goals keep changing: first it paints things blue, then it presses a button, then it does other things.
S1′s instrumental goals may keep changing, because its circumstances are changing. But I don’t think its terminal goals are changing. The only reason to model it as having two completely incommensurate goal sets at different times would be if there were no simple terminal goal that could explain the change in instrumental behavior.
I don’t think I’m switching back and forth between I-preferences and R-preferences.
I don’t think I’m talking about I-preferences at all, nor that I ever have been.
I completely agree with you that they don’t matter for our purposes here, so if I am talking about them, I am very very confused. (Which is certainly possible.)
But I don’t think that R-preferences (preferences, goals, etc.) can sensibly be equated with the actual effects a local system has on a global system. If they could, we could talk equally sensibly about earthquakes having R-preferences (preferences, goals, etc.), and I don’t think it’s sensible to talk that way.
R-preferences (preferences, goals, etc.) are, rather, internal states of a system S.
If S is a competent optimizer (or “rational agent,” if you prefer) with R-preferences (preferences, goals, etc.) P, the existence of P will cause S to behave in ways that cause isomorphic effects (E) on a global system, so we can use observations of E as evidence of P (positing that S is a competent optimizer) or as evidence that S is a competent optimizer (positing the existence of P) or a little of both.
But however we slice it, P is not the same thing as E, E is merely evidence of P’s existence. We can infer P’s existence in other ways as well, even if we never observe E… indeed, even if E never gets produced. And the presence or absence of a given P in S is something we can be mistaken about; there’s a fact of the matter.
I think you disagree with the above paragraph, because you describe R-preferences (preferences, goals, etc.) as theoretical constructs rather than parts of the system, which suggests that there is no fact of the matter… a different theoretical approach might never include P, and it would not be mistaken, it would just be a different theoretical approach.
I also think that because way back at the beginning of this exchange when I suggested “paint everything red AND paint everything blue” was an example of an incoherent goal (R-preference, preference, P), your reply was that it wasn’t a goal at all, since that state can’t actually exist in the world. Which suggests that you don’t see goals as internal states of optimizers and that you do equate P with E.
This is what I’ve been disputing from the beginning.
But to be honest, I’m not sure whether you disagree or not, as I’m not sure we have yet succeeded in actually engaging with one another’s ideas in this exchange.
But I don’t think that R-preferences (preferences, goals, etc.) can sensibly be equated with the actual effects a local system has on a global system. If they could, we could talk equally sensibly about earthquakes having R-preferences (preferences, goals, etc.), and I don’t think it’s sensible to talk that way.
You can treat earthquakes and thunderstorms and even individual particles as having ‘preferences’. It’s just not very useful to do so, because we can give an equally simple explanation for what effects things like earthquakes tend to have that is more transparent about the physical mechanism at work. The intentional strategy is a heuristic for black-boxing physical processes that are too complicated to usefully describe in their physical dynamics, but that can be discussed in terms of the complicated outcomes they tend to promote.
(I’d frame it: We’re exploiting the fact that humans are intuitively dualistic by taking the non-physical modeling device of humans (theory of mind, etc.) and appropriating this mental language and concept-web for all sorts of systems whose nuts and bolts we want to bracket. Slightly regimented mental concepts and terms are useful, not because they apply to all the systems we’re talking about in the same way they were originally applied to humans, but because they’re vague in ways that map onto the things we’re uncertain about or indifferent to.)
‘X wants to do Y’ means that the specific features of X tend to result in Y when its causal influence is relatively large and direct. But, for clarity’s sake, we adopt the convention of only dropping into want-speak when a system is too complicated for us to easily grasp in mechanistic terms why it’s having these complex effects, yet when we can predict that, whatever the mechanism happens to be, it is the sort of mechanism that has those particular complex effects.
Thus we speak of evolution as an optimization process, as though it had a ‘preference ordering’ in the intuitively human (i.e., I-preference) sense, even though in the phenomenological sense it’s just as mindless as an earthquake. We do this because black-boxing the physical mechanisms and just focusing on the likely outcomes is often predictively useful here, and because the outcomes are complicated and specific. This is useful for AIs because we care about the AI’s consequences and not its subjectivity (hence we focused on R-preference), and because AIs are optimization processes of even greater complex specificity in mechanism and outcome than evolution (hence we adopted the intentional stance of ‘preference’-talk in the first place).
R-preferences (preferences, goals, etc.) are, rather, internal states of a system S.
I agree this is often the case, because when we define ‘what is this system capable of?’ we often hold the system fixed while examining possible worlds where the environment varies in all kinds of ways. But if the possible worlds we care about all have a certain environmental feature in common—say, because we know in reality that the environmental condition obtains, and we’re trying to figure out all the ways the AI might in fact behave given different values for the variables we don’t know about with confidence—then we may, in effect, include something about the environment ‘in the AI’ for the purposes of assessing its optimization power and/or preference ordering.
For instance, we might model the AI as having the preference ‘surround the Sun with a dyson sphere’ rather than ‘conditioned on there being a Sun, surround it with a dyson sphere’; if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun. Does that mean the Sun is a part of the AI’s preference encoding? Is the Sun a component of the AI? I don’t think these questions are important or interesting, so I don’t want us to be too committed to reifying AI preferences. They’re just a useful shorthand for the expected outcomes of the AI’s distinguishing features having a more large and direct causal impact on things.
‘X wants to do Y’ means that the specific features of X tend to result in Y when its causal influence is relatively large and direct. But, for clarity’s sake, we adopt the convention of only dropping into want-speak when a system is too complicated for us to easily grasp in mechanistic terms why it’s having these complex effects
Yes, agreed, for some fuzzy notion of “easily grasp” and “too complicated.” That is, there’s a sense in which thunderstorms are too complicated for me to describe in mechanistic terms why they’re having the effects they have… I certainly can’t predict those effects. But there’s also a sense in which I can describe (and even predict) the effects of a thunderstorm that feels simple, whereas I can’t do the same thing for a human being without invoking “want-speak”/intentional stance.
I’m not sure any of this is [i]justified[/i], but I agree that it is what we do… this is how we speak, and we draw these distinctions. So far, so good.
if the possible worlds we care about all have a certain environmental feature in common [..] we may, in effect, include something about the environment ‘in the AI’
I’m not really sure what you mean by “in the AI” here, but I guess I agree that the boundary between an agent and its environment is always a fuzzy one. So, OK, I suppose we can include things about the environment “in the AI” if we choose. (I can similarly choose to include things about the environment “in myself.”) So far, so good.
we might model the AI as having the preference ‘surround the Sun with a dyson sphere’ rather than ‘conditioned on there being a Sun, surround it with a dyson sphere’; if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun.
Here is where you lose me again… once again you talk as though there’s simply no fact of the matter as to which preference the AI has, merely our choice as to how we model it.
But it seems to me that there are observations I can make which would provide evidence one way or the other. For example, if it has the preference ‘surround the Sun with a dyson sphere,’ then in an environment lacking the Sun I would expect it to first seek to create the Sun… how else can it implement its preferences? Whereas if it has the preference ‘conditioned on there being a Sun, surround it with a dyson sphere’; in an environment lacking the Sun I would not expect it to create the Sun.
So does the AI seek create the Sun in such an environment, or not? Surely that doesn’t depend on how I choose to model it. The AI’s preference is whatever it is, and controls its behavior. Of course, as you say, if the real world always includes a sun, then I might not be able to tell which preference the AI has. (Then again I might… the test I describe above isn’t the only test I can perform, just the first one I thought of, and other tests might not depend on the Sun’s absence.)
But whether I can tell or not doesn’t affect whether the AI has the preference or not.
if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun
Again, no. Regardless of how we model it, the system’s preference is what it is, and we can study the system (e.g., see whether it creates the Sun) to develop more accurate models of its preferences.
Does that mean the Sun is a part of the AI’s preference encoding? Is the Sun a component of the AI? I don’t think these questions are important or interesting
I agree. But I do think the question of what the AI (or, more generally, an optimizing agent) will do in various situations is interesting, and it seems to be that you’re consistently eliding over that question in ways I find puzzling.
A system’s goals have to be some event that can be brought about.
This sounds like a potentially confusing level of simplification; a goal should be regarded as at least a way of comparing possible events.
When we’re talking about an artificial intelligence’s preferences, we’re talking about the things it tends to optimize for, not the things it ‘has in mind’ or the things it believes are its preferences.
Its behavior is what makes its goal important. But in a system designed to follow an explicitly specified goal, it does make sense to talk of its goal apart from its behavior. Even though its behavior will reflect its goal, the goal itself will reflect itself better.
If the goal is implemented as a part of the system, other parts of the system can store some information about the goal, certain summaries or inferences based on it. This information can be thought of as beliefs about the goal. And if the goal is not “logically transparent”, that is its specification is such that making concrete conclusions about what it states in particular cases is computationally expensive, then the system never knows what its goal says explicitly, it only ever has beliefs about particular aspects of the goal.
But in a system designed to follow an explicitly specified goal, it does make sense to talk of its goal apart from its behavior. Even though its behavior will reflect its goal, the goal itself will reflect itself better.
Perhaps, but I suspect that for most possible AIs there won’t always be a fact of the matter about where its preference is encoded. The blue-minimizing robot is a good example. If we treat it as a perfectly rational agent, then we might say that it has temporally stable preferences that are very complicated and conditional; or we might say that its preferences change at various times, and are partly encoded, for instance, in the properties of the color-inverting lens on its camera. An AGI’s response to environmental fluctuation will probably be vastly more complicated than a blue-minimizer’s, but the same sorts of problems arise in modeling it.
I think it’s more useful to think of rational-choice-theory-style preferences as useful theoretical constructs—like a system’s center of gravity, or its coherently extrapolated volition—than as real objects in the machine’s hardware or software. This sidesteps the problem of haggling over which exact preferences a system has, how those preferences are distributed over the environment, how to decide between causally redundant encodings which is ‘really’ the preference encoding, etc. See my response to Dave.
“Goal” is a natural idea for describing AIs with limited resources: these AIs won’t be able to make optimal decisions, and their decisions can’t be easily summarized in terms of some goal, but unlike the blue-minimizing robot they have a fixed preference ordering that doesn’t gradually drift away from what it was originally, and eventually they tend to get better at following it.
For example, if a goal is encrypted, and it takes a huge amount of computation to decrypt it, system’s behavior prior to that point won’t depend on the goal, but it’s going to work on decrypting it and eventually will follow it. This encrypted goal is probably more predictive of long-term consequences than anything else in the details of the original design, but it also doesn’t predict its behavior during the first stage (and if there is only a small probability that all resources in the universe will allow decrypting the goal, it’s probable that system’s behavior will never depend on the goal). Similarly, even if there is no explicit goal, as in the case of humans, it might be possible to work with an idealized goal that, like the encrypted goal, can’t be easily evaluated, and so won’t influence behavior for a long time.
My point is that there are natural examples where goals and the character of behavior don’t resemble each other, so that each can’t be easily inferred from the other, while both can be observed as aspects of the system. It’s useful to distinguish these ideas.
I agree preferences aren’t reducible to actual behavior. But I think they are reducible to dispositions to behave, i.e., behavior across counterfactual worlds. If a system prefers a specific event Z, that means that, across counterfactual environments you could have put it in, the future would on average have had more Z the more its specific distinguishing features had a large and direct causal impact on the world.
The examples I used seem to apply to “dispositions” to behave, in the same way (I wasn’t making this distinction). There are settings where the goal can’t be clearly inferred from behavior, or collection of hypothetical behaviors in response to various environments, at least if we keep environments relatively close to what might naturally occur, even as in those settings the goal can be observed “directly” (defined as an idealization based in AI’s design).
An AI with encypted goal (i.e. the AI itself doesn’t know the goal in explicit form, but the goal can be abstractly defined as the result of decryption) won’t behave in accordance with it in any environment that doesn’t magically let it decrypt its goal quickly, there is no tendency to push the events towards what the encrypted goal specifies, until the goal is decrypted (which might be never with high probability).
I don’t think a sufficiently well-encrypted ‘preference’ should be counted as a preference for present purposes. In principle, you can treat any physical chunk of matter as an ‘encrypted preference’, because if the AI just were a key of exactly the right shape, then it could physically interact with the lock in question to acquire a new optimization target. But if neither the AI nor anything very similar to the AI in nearby possible worlds actually acts as a key of the requisite sort, then we should treat the parts of the world that a distant AI could interact with to acquire a preference as, in our world, mere window dressing.
Perhaps if we actually built a bunch of AIs, and one of them was just like the others except where others of its kind had a preference module, it had a copy of The Wind in the Willows, we would speak of this new AI as having an ‘encrypted preference’ consisting of a book, with no easy way to treat that book as a decision criterion like its brother- and sister-AIs do for their homologous components. But I don’t see any reason right now to make our real-world usage of the word ‘preference’ correspond to that possible world’s usage. It’s too many levels of abstraction away from what we should be worried about, which are the actual real-world effects different AI architectures would have.
Evolution was able to come up with cats. Cats are immensely complex objects. Evolution did not intend to create cats. Now consider you wanted to create an expected utility maximizer to accomplish something similar, except that it would be goal-directed, think ahead, and jump fitness gaps. Further suppose that you wanted your AI to create qucks, instead of cats. How would it do this?
Given that your AI is not supposed to search design space at random, but rather look for something particular, you would have to define what exactly qucks are. The problem is that defining what a quck is, is the hardest part. And since nobody has any idea what a quck is, nobody can design a quck creator.
The point is that thinking about the optimization of optimization is misleading, as most of the difficulty is with defining what to optimize, rather than figuring out how to optimize it. In other words, the efficiency of e.g. the scientific method depends critically on being able to formulate a specific hypothesis.
Trying to create an optimization optimizer would be akin to creating an autonomous car to find the shortest route between Gotham City and Atlantis. The problem is not how to get your AI to calculate a route, or optimize how to calculate such a route, but rather that the problem is not well-defined. You have no idea what it means to travel between two fictional cities. Which in turn means that you have no idea what optimization even means in this context, let alone meta-level optimization.
The problem is, you don’t have to program the bit that says “now make yourself more intelligent.” You only have to program the bit that says “here’s how to make a new copy of yourself, and here’s how to prove it shares your goals without running out of math.”
And the bit that says “Try things until something works, then figure out why it worked.” AKA modeling.
The AI isn’t actually an intelligence optimizer. But it notes that when it takes certain actions, it is better able to model the world, which in turn allows it to make more paperclips (or whatever). So it’ll take those actions more often.
Humans are capable of winning at chess without the terminal goal of doing so. Nor were humans designed by evolution specifically for chess. Why should we expect a general superintelligence to have intelligence than generalizes less easily than a human’s does?
Biological evolution is not the full picture here. Humans were programmed to be capable of winning at chess, and to care to do so, by cultural evolution, education, and environmental feedback in the form of incentives given by other people challenging them to play.
I don’t know how this works. But I do not dispute the danger of neuromorphic AIs, as you know from a comment elsewhere.
Do you suggest that from the expected behavior of neuromorphic AIs it is possible to draw conclusions about the behavior of what you call a ‘seed AI’? Would such a seed AI, as would be the case with neuromorphic AIs, be constantly programmed by environmental feedback?
You keep coming back to this ‘logically incoherent goals’ and ‘vague goals’ idea. Honestly, I don’t have the slightest idea what you mean by those things.
What I mean is that if you program a perfect scientist but give this perfect scientist a hypothesis that does not make any predictions, then it will not be able to unfold its power.
Conditioned on a self-modifying AGI...the hypothetical we’re trying to assess.
I believe that I already wrote that I do not dispute that the idea you seem to have in mind is a risk by definition. If such an AI is likely, then we are likely going extinct if we fail at making it care about human values.
You haven’t given arguments suggesting that here.
I feel uncomfortable to say this, but I do not see that the burden of proof is on me to show that it takes deliberate and intentional effort to make an AI exhibit those drives, as long that is not part of your very definition. I find the current argument in favor of AI drives to be thoroughly unconvincing.
Be careful to note, to yourself and others, when you switch between the claims ‘a superintelligence is too hard to make’ and ‘if we made a superintelligence it would probably be safe’.
The former has always been one of the arguments in favor of the latter in the posts I wrote on my blog.
(Note: I’m also a layman, so my non-expert opinions necessarily come with a large salt side-dish)
My guess here is that most of the “AI Drives” to self-improve, be rational, retaining it’s goal structure, etc. are considered necessary for a functional learning/self-improving algorithm. If the program cannot recognize and make rules for new patterns observed in data, make sound inferences based on known information or keep after it’s objective it will not be much of an AGI at all; it will not even be able to function as well as a modern targeted advertising program.
The rest, such as self-preservation, are justified as being logical requirements of the task. Rather than having self-preservation as a terminal value, the paperclip maximizer will value it’s own existence as an optimal means of proliferating paperclips. It makes intuitive sense that those sorts of ‘drives’ would emerge from most-any goal, but then again my intuition is not necessarily very useful for these sorts of questions.
This point might also be a source of confusion;
The progress of the capability of artificial intelligence is not only related to whether humans have evolved for a certain skill or to how much computational resources it requires but also to how difficult it is to formalize the skill, its rules and what it means to succeed.
In the light of this, how difficult would it be to program the drives that you imagine, versus just making an AI win against humans at a given activity without exhibiting these drives?
As Dr Valiant (great name or the greatest name?) classifies things in Probably Approximately Correct, Winning Chess would be a ‘theoryful’ task while Discovering (Interesting) Mathematical Proofs would be a ‘theoryless’ one. In essence, the theoryful has simple and well established rules for the process which could be programmed optimally in advance with little-to-no modification needed afterwards while the theoryless is complex and messy enough that an imperfect (Probably Approximately Correct) learning process would have to be employed to suss out all the rules.
Now obviously the program will benefit from labeling in it’s training data for what is and is not an “interesting” mathematical proof, otherwise it can just screw around with computationally-cheap arithmetic proofs (1 + 1 = 2, 1.1 + 1 = 2.1, 1.2 + 1 = 2.2, etc.) until the heat death of the universe. Less obviously, as the hidden tank example shows, insufficient labeling or bad labels will lead to other unintended results.
So applying that back to Friendliness; despite attempts to construct a Fun Theory, human value is currently (and may well forever remain) theoryless. A learning process whose goal is to maximize human value is going to have to be both well constructed and have very good labels initially to not be Unfriendly. Of course, it could very well correct itself later on, that is in fact at the core of a PAC algorithm, but then we get into questions of FOOM-ing and labels of human value in the environment which I am not equipped to deal with.
Is your thought that Friendliness is probably an easier ‘binding’ to figure out how to code than are, say, resisting Pascal’s mugging, or having consistent arithmetical reasoning?
To explain what I have in mind, consider Ben Goertzel’s example of how to test for general intelligence:
...when a robot can enrol in a human university and take classes in the same way as humans, and get its degree, then I’ll [say] we’ve created [an]… artificial general intelligence.
I do not disagree that such a robot, when walking towards the classroom, if it is being obstructed by a fellow human student, could attempt to kill this human, in order to get to the classroom.
Killing a fellow human, from the perspective of the human creators of the robot, is clearly a mistake. From a human perspective, it means that the robot failed.
You believe that the robot was just following its programming/construction. Indeed, the robot is its programming. I agree with this. I agree that the human creators were mistaken about what dynamic state sequence the robot will exhibit by computing the code.
What the “dumb superintelligence” argument tries to highlight is that if humans are incapable of predicting such behavior, then they will also be mistaken about predicting behavior that is harmful to the robots power. For example, while fighting with the human in order to kill it, for a split-second it mistakes its own arm with that of the human and breaks it.
You might now argue that such a robot isn’t much of a risk. It is pretty stupid to mistake its own arm with that of the enemy it tries to kill. True. But the point is that there is no relevant difference between failing to predict behavior that will harm the robot itself, and behavior that will harm a human. Except that you might believe the former is much easier than the latter. I dispute this.
For the robot to master a complex environment, like a university full of humans, without harming itself, or decreasing the chance of achieving its goals, is already very difficult. Not stabbing or strangling other human students is not more difficult than not jumping from the 4th floor, instead of taking the stairs. This is the “dumb superintelligence” argument.
What the “dumb superintelligence” argument tries to highlight is that if humans are incapable of predicting such behavior, then they will also be mistaken about predicting behavior that is harmful to the robots power.
To some extent. Perhaps it would be helpful to distinguish four different kinds of defeater:
early intelligence defeater: We try to build a seed AI, but our self-rewriting AI quickly hits a wall or explodes. This is most likely if we start with a subhuman intelligence and have serious resource constraints (so we can’t, e.g., just run an evolutionary algorithm over millions of copies of the AGI until we happen upon a variant that works).
late intelligence defeater: The seed AI works just fine, but at some late stage, when it’s already at or near superintelligence, it suddenly explodes. Apparently it went down a blind alley at some point early on that led it to plateau or self-destruct later on, and neither it nor humanity is smart enough yet to figure out where exactly the problem arose. So the FOOM fizzles.
early Friendliness defeater: From the outset, the seed AI’s behavior already significantly diverges from Friendliness.
late Friendliness defeater: The seed AI starts off as a reasonable approximation of Friendliness, but as it approaches superintelligence its values diverge from anything we’d consider Friendly, either because it wasn’t previously smart enough to figure out how to self-modify while keeping its values stable, or because it was never perfectly Friendly and the new circumstances its power puts it in now make the imperfections much more glaring.
In general, late defeaters are much harder for humans to understand than early defeaters, because an AI undergoing FOOM is too fast and complex to be readily understood. Your three main arguments, if I’m understanding them, have been:
(a) Early intelligence defeaters are so numerous that there’s no point thinking much about other kinds of defeaters yet.
(b) Friendliness defeaters imply a level of incompetence on the programmers’ part that strongly suggest intelligence defeaters will arise in the same situation.
(c) If an initially somewhat-smart AI is smart enough to foresee and avoid late intelligence defeaters, then an initially somewhat-nice AI should be smart enough to foresee and avoid late Friendliness defeaters.
I reject (a), because I haven’t seen any specific reason a self-improving AGI will be particularly difficult to make FOOM—‘it would require lots of complicated things to happen’ is very nearly a fully general argument against any novel technology, so I can’t get very far on that point alone. I accept (b), at least for a lot of early defeaters. But my concern is that while non-Friendliness predicts non-intelligence (and non-intelligence predicts non-Friendliness), intelligence also predicts non-Friendliness.
But our interesting disagreement seems to be over (c). Interesting because it illuminates general differences between the basic idea of a domain-general optimization process (intelligence) and the (not-so-)basic idea of Everything Humans Like. One important difference is that if an AGI optimizes for anything, it will have strong reason to steer clear of possible late intelligence defeaters. Late Friendliness defeaters, on the other hand, won’t scare optimization-process-optimizers in general.
It’s easy to see in advance that most beings that lack obvious early Friendliness defeaters will nonetheless have late Friendliness defeaters. In contrast, it’s much less clear that most beings lacking early intelligence defeaters will have late intelligence defeaters. That’s extremely speculative at this point—we simply don’t know what sorts of intelligence-destroying attractors might exist out there, or what sorts of paradoxes and complications are difficult v. trivial to overcome.
there is no relevant difference between failing to predict behavior that will harm the robot itself, and behavior that will harm a human. Except that you might believe the former is much easier than the latter. I dispute this.
But, once again, it doesn’t take any stupidity on the AI’s part to disvalue physically injuring a human, even if it does take stupidity to not understand that one is physically injuring a human. It only takes a different value system. Valuing one’s own survival is not orthogonal to valuing becoming more intelligent; but valuing human survival is orthogonal to valuing becoming more intelligent. (Indeed, to the extent they aren’t orthogonal it’s because valuing becoming more intelligent tends to imply disvaluing human survival, because humans are hard to control and made of atoms that can be used for other ends, including increased computing power.) This is the whole point of the article we’re commenting on.
Your three main arguments, if I’m understanding them, have been:
Here is part of my stance towards AI risks:
1. I assign a negligible probability to the possibility of a sudden transition from well-behaved narrow AIs to general AIs (see below).
2. An AI will not be pulled at random from mind design space. An AI will be the result of a research and development process. A new generation of AIs will need to be better than other products at “Understand What Humans Mean” and “Do What Humans Mean”, in order to survive the research phase and subsequent market pressure.
3. Commercial, research or military products are created with efficiency in mind. An AI that was prone to take unbounded actions given any terminal goal would either be fixed or abandoned during the early stages of research. If early stages showed that inputs such as the natural language query would yield results such as then the AI would never reach a stage in which it was sufficiently clever and trained to understand what results would satisfy its creators in order to deceive them.
4. I assign a negligible probability to the possibility of a consequentialist AI / expected utility maximizer / approximation to AIXI.
Given that the kind of AIs from point 4 are possible:
5. Omohundro’s AI drives are what make the kind of AIs mentioned in point 1 dangerous. Making an AI that does not exhibit these drives in an unbounded manner is probably a prerequisite to get an AI to work at all (there are not enough resources to think about being obstructed by simulator gods etc.), or should otherwise be easy compared to the general difficulties involved in making an AI work using limited resources.
6. An AI from point 4 will only ever do what it has been explicitly programmed to do. Such an AI is not going to protect its utility-function, acquire resources or preemptively eliminate obstacles in an unbounded fashion. Because it is not intrinsically rational to do so. What specifically constitutes rational, economic behavior is inseparable with an agent’s terminal goal. That any terminal goal can be realized in an infinite number of ways implies an infinite number of instrumental goals to choose from.
7. Unintended consequences are by definition not intended. They are not intelligently designed but detrimental side effects, failures. Whereas intended consequences, such as acting intelligently, are intelligently designed. If software was not constantly improved to be better at doing what humans intend it to do we would never be able to reach a level of sophistication where a software could work well enough to outsmart us. To do so it would have to work as intended along a huge number of dimensions. For an AI to constitute a risk as a result of unintended consequences those unintended consequences would have to have no, or little, negative influence on the huge number of intended consequences that are necessary for it to be able to overpower humanity.
I haven’t seen any specific reason a self-improving AGI will be particularly difficult to make FOOM...
I am not yet at a point of my education where I can say with confidence that this is the wrong way to think, but I do believe it is.
If someone walked up to you and told you about a risk only he can solve, and that you should therefore give this person money, would you give him money because you do not see any specific reason for why he could be wrong? Personally I would perceive the burden of proof to be on him to show me that the risk is real.
Despite this, I have specific reasons to personally believe that the kind of AI you have in mind is impossible. I have thought about such concepts as consequentialism / expected utility maximization, and do not see that they could be made to work, other than under very limited circumstances. And I also asked other people, outside of LessWrong, who are more educated and smarter than me, and they also told me that these kind of AIs are not feasible, they are uncomputable.
But our interesting disagreement seems to be over (c).
I am not sure I understand what you mean by c. I don’t think I agree with it.
One important difference is that if an AGI optimizes for anything,
I don’t know what this means.
Valuing one’s own survival is not orthogonal to valuing becoming more intelligent; but valuing human survival is orthogonal to valuing becoming more intelligent.
That this black box you call “intelligence” might be useful to achieve a lot of goals is not an argument in support of humans wanting to and succeeding at the implementation of “value to maximize intelligence” in conjunction with “by all means”.
Most definitions of intelligence that I am aware of are in terms of the ability to achieve goals. Saying that a system values to become more intelligent then just means that a system values to increase its ability to achieve its goals. In this context, what you suggest is that humans will want to, and will succeed to, implement an AI that in order to beat humans at Tic-tac-toe is first going to take over the universe and make itself capable of building such things as Dyson spheres.
What I am saying is that it is much easier to create a Tic-tac-toe playing AI, or an AI that can earn a university degree, than the former in conjunction with being able to take over the universe and build Dyson spheres.
The argument that valuing not to kill humans is orthogonal to taking over the universe and building Dyson spheres is completely irrelevant.
An AI will not be pulled at random from mind design space.
I don’t think anyone’s ever disputed this. (However, that’s not very useful if the deterministic process resulting in the SI is too complex for humans to distinguish it in advance from the outcome of a random walk.)
An AI will be the result of a research and development process. A new generation of AIs will need to be better than other products at “Understand What Humans Mean” and “Do What Humans Mean”, in order to survive the research phase and subsequent market pressure.
Agreed. But by default, a machine that is better than other rival machines at satisfying our short-term desires will not satisfy our long-term desires. The concern isn’t that we’ll suddenly start building AIs with the express purpose of hitting humans in the face with mallets. The concern is that we’ll code for short-term rather than long-term goals, due to a mixture of disinterest in Friendliness and incompetence at Friendliness. But if intelligence explosion occurs, ‘the long run’ will arrive very suddenly, and very soon. So we need to adjust our research priorities to more seriously assess and modulate the long-term consequences of our technology.
An AI that was prone to take unbounded actions given any terminal goal would either be fixed or abandoned during the early stages of research.
That may be a reason to think that recursively self-improving AGI won’t occur. But it’s not a reason to expect such AGI, if it occurs, to be Friendly.
If early stages showed that inputs such as the natural language query would yield results such as
The seed is not the superintelligence. We shouldn’t expect the seed to automatically know whether the superintelligence will be Friendly, any more than we should expect humans to automatically know whether the superintelligence will be Friendly.
Making an AI that does not exhibit these drives in an unbounded manner is probably a prerequisite to get an AI to work at all (there are not enough resources to think about being obstructed by simulator gods etc.)
I’m not following. Why does an AGI have to have a halting condition (specifically, one that actually occurs at some point) in order to be able to productively rewrite its own source code?
An AI from point 4 will only ever do what it has been explicitly programmed to do.
You don’t seem to be internalizing my arguments. This is just the restatement of a claim I pointed out was not just wrong but dishonestly statedhere.
That any terminal goal can be realized in an infinite number of ways implies an infinite number of instrumental goals to choose from.
Sure, but the list of instrumental goals overlap more than the list of terminal goals, because energy from one project can be converted to energy for a different project. This is an empirical discovery about our world; we could have found ourselves in the sort of universe where instrumental goals don’t converge that much, e.g., because once energy’s been locked down into organisms or computer chips you just Can’t convert it into useful work for anything else. In a world where we couldn’t interfere with the AI’s alien goals, nor could our component parts and resources be harvested to build very different structures, nor could we be modified to work for the AI, the UFAI would just ignore us and zip off into space to try and find more useful objects. We don’t live in that world because complicated things can be broken down into simpler things at a net gain in our world, and humans value a specific set of complicated things.
‘These two sets are both infinite’ does not imply ‘we can’t reason about these two things’ relative size, or how often the same elements recur in their elements’.
I am not yet at a point of my education where I can say with confidence that this is the wrong way to think, but I do believe it is.
If someone walked up to you and told you about a risk only he can solve, and that you should therefore give this person money, would you give him money because you do not see any specific reason for why he could be wrong? Personally I would perceive the burden of proof to be on him to show me that the risk is real.
You’ve spent an awful lot of time writing about the varied ways in which you’ve not yet been convinced by claims you haven’t put much time into actively investigating. Maybe some of that time could be better spent researching these topics you keep writing about? I’m not saying to stop talking about this, but there’s plenty of material on a lot of these issues to be found. Have you read Intelligence Explosion Microeconomics?
succeeding at the implementation of “value to maximize intelligence” in conjunction with “by all means”.
As a rule, adding halting conditions adds complexity to an algorithm, rather than removing complexity.
Saying that a system values to become more intelligent then just means that a system values to increase its ability to achieve its goals.
No, this is a serious misunderstanding. Yudkowsky’s definition of ‘intelligence’ is about the ability to achieve goals in general, not about the ability to achieve the system’s goals. That’s why you can’t increase a system’s intelligence by lowering its standards, i.e., making its preferences easier to satisfy.
what you suggest is that humans will want to, and will succeed to, implement an AI that in order to beat humans at Tic-tac-toe is first going to take over the universe and make itself capable of building such things as Dyson spheres.
Straw-man; no one has claimed that humans are likely to want to create an UFAI. What we’ve suggested is that humans are likely to want to create an algorithm, X, that will turn out to be a UFAI. (In other words, the fallacy you’re committing is confusing intension with extension.)
That aside: Are you saying Dyson spheres wouldn’t be useful for beating more humans at more tic-tac-toe games? Seems like a pretty good way to win at tic-tac-toe to me.
Yudkowsky’s definition of ‘intelligence’ is about the ability to achieve goals in general, not about the ability to achieve the system’s goals. That’s why you can’t increase a system’s intelligence by lowering its standards, i.e., making its preferences easier to satisfy.
Actually I do define intelligence as ability to hit a narrow outcome target relative to your own goals, but if your goals are very relaxed then the volume of outcome space with equal or greater utility will be very large. However one would expect that many of the processes involved in hitting a narrow target in outcome space (such that few other outcomes are rated equal or greater in the agent’s preference ordering), such as building a good epistemic model or running on a fast computer, would generalize across many utility functions; this is why we can speak of properties apt to intelligence apart from particular utility functions.
Actually I do define intelligence as ability to hit a narrow outcome target relative to your own goals
Hmm. But this just sounds like optimization power to me. You’ve defined intelligence in the past as “efficient cross-domain optimization”. The “cross-domain” part I’ve taken to mean that you’re able to hit narrow targets in general, not just ones you happen to like. So you can become more intelligent by being better at hitting targets you hate, or by being better at hitting targets you like.
The former are harder to test, but something you’d hate doing now could become instrumentally useful to know how to do later. And your intelligence level doesn’t change when the circumstance shifts which part of your skillset is instrumentally useful. For that matter, I’m missing why it’s useful to think that your intelligence level could drastically shift if your abilities remained constant but your terminal values were shifted. (E.g., if you became pickier.)
No, “cross-domain” means that I can optimize across instrumental domains. Like, I can figure out how to go through water, air, or space if that’s the fastest way to my destination, I am not limited to land like a ground sloth.
Measured intelligence shouldn’t shift if you become pickier—if you could previously hit a point such that only 1/1000th of the space was more preferred than it, we’d still expect you to hit around that narrow a volume of the space given your intelligence even if you claimed afterward that a point like that only corresponded to 0.25 utility on your 0-1 scale instead of 0.75 utility due to being pickier ([expected] utilities sloping more sharply downward with increasing distance from the optimum).
But by default, a machine that is better than other rival machines at satisfying our short-term desires will not satisfy our long-term desires.
You might be not aware of this but I wrote a sequence of short blog posts where I tried to think of concrete scenarios that could lead to human extinction. Each of which raised many questions.
What might seem to appear completely obvious to you for reasons that I do not understand, e.g. that an AI can take over the world, appears to me largely like magic (I am not trying to be rude, by magic I only mean that I don’t understand the details). At the very least there are a lot of open questions. Even given that for the sake of the above posts I accepted that the AI is superhuman and can do such things as deceive humans by its superior knowledge of human psychology. Which seems to be non-trivial assumption, to say the least.
That may be a reason to think that recursively self-improving AGI won’t occur. But it’s not a reason to expect such AGI, if it occurs, to be Friendly.
Over and over I told you that given all your assumptions, I agree that AGI is an existential risk.
The seed is not the superintelligence. We shouldn’t expect the seed to automatically know whether the superintelligence will be Friendly, any more than we should expect humans to automatically know whether the superintelligence will be Friendly.
You did not reply to my argument. My argument was that if the seed is unfriendly then it will not be smart enough to hide its unfriendliness. My argument did not pertain the possibility of a friendly seed turning unfriendly.
Why does an AGI have to have a halting condition (specifically, one that actually occurs at some point) in order to be able to productively rewrite its own source code?
What I have been arguing is that an AI should not be expected, by default, to want to eliminate all possible obstructions. There are many graduations here. That, by some economic or otherwise theoretic argument, it might be instrumentally rational for some ideal AI to take over the world, does not mean that humans would create such an AI, or that an AI could not be limited to care about fires in its server farm rather than that Russia might nuke the U.S. and thereby destroy its servers.
You don’t seem to be internalizing my arguments.
Did you mean to reply to another point? I don’t see how the reply you linked to is relevant to what I wrote.
Sure, but the list of instrumental goals overlap more than the list of terminal goals, because energy from one project can be converted to energy for a different project.
My argument is that an AI does not need to consider all possible threats and care to acquire all possible resources. Based on its design it could just want to optimize using its initial resources while only considering mundane threats. I just don’t see real-world AIs to conclude that they need to take over the world. I don’t think an AI is likely going to be designed that way. I also don’t think such an AI could work, because such inferences would require enormous amounts of resources.
You’ve spent an awful lot of time writing about the varied ways in which you’ve not yet been convinced by claims you haven’t put much time into actively investigating. Maybe some of that time could be better spent researching these topics you keep writing about?
I have done what is possible given my current level of education and what I perceive to be useful. I have e.g. asked experts about their opinion.
A few general remarks about the kind of papers such as the one that you linked to.
How much should I update towards MIRI’s position if I (1) understood the arguments in the paper (2) found the arguments convincing?
My answer is the following. If the paper was about the abc conjecture, the P versus NP problem, climate change, or even such mundane topics as psychology, I would either not be able to understand the paper, would be unable to verify the claims, or would have very little confidence in my judgement.
So what about ‘Intelligence Explosion Microeconomics’? That I can read most of it is only due to the fact that it is very informally written. The topic itself is more difficult and complex than all of the above mentioned problems together. Yet the arguments in support of it, to exaggerate a little bit, contain less rigor than the abstract of one of Shinichi Mochizuki’s papers on the abc conjecture.
Which means that my answer is that I should update very little towards MIRI’s position and that any confidence I gain about MIRI’s position is probably highly unreliable.
Thanks. My feeling is that to gain any confidence into what all this technically means, and to answer all the questions this raises, I’d probably need about 20 years of study.
No, this is a serious misunderstanding. Yudkowsky’s definition of ‘intelligence’ is
Here is part of a post exemplifying how I understand the relation between goals and intelligence:
If a goal has very few constraints then the set that satisfies all constraints is very large. A vague and ambiguous goal allows for too much freedom in the sense that a wide range of world states would have the same expected value and therefore imply a very large solution space, since a wide range of AI’s will be able to achieve those world states and thereby satisfy the condition of being improved versions of their predecessor.
This means that in order to get an AI to become superhuman at all, and very quickly in particular, you will need to encode a very specific goal against which mistakes, optimization power and achievement can be judged.
It is really hard to communicate how I perceive this and other discussions about MIRI’s position without offending people, or killing the discussion.
I am saying this in full honesty. The position you appear to support seems so utterly “complex” (far-fetched) that the current arguments are unconvincing.
Here is my perception of the scenario that you try to sell me (exaggerated to make a point). I have a million questions about it that I can’t answer and which your answers either sidestep or explain away by using “magic”.
At this point I probably made 90% of the people reading this comment incredible angry. My perception is that you cannot communicate this perception on LessWrong without getting into serious trouble. That’s also what I meant when I told you that I cannot be completely honest if you want to discuss this on LessWrong.
I can also assure you that many people who are much smarter and higher status than me think so as well. Many people communicated the absurdity of all this to me but told me that they would not repeat this in public.
My argument was that if the seed is unfriendly then it will not be smart enough to hide its unfriendliness.
Pretending to be friendly when you’re actually not is something that doesn’t even require human level intelligence. You could even do it accidentally.
In general, the appearance of Friendliness at low levels of ability to influence the world doesn’t guarantee actual Friendliness at high levels of ability to influence the world. (If it did, elected politicians would be much higher quality.)
But our interesting disagreement seems to be over (c). Interesting because it illuminates general differences between the basic idea of a domain-general optimization process (intelligence) and the (not-so-)basic idea of Everything Humans Like. One important difference is that if an AGI optimizes for anything, it will have strong reason to steer clear of possible late intelligence defeaters. Late Friendliness defeaters, on the other hand, won’t scare optimization-process-optimizers in general.
But it will scare friendly ones, which will want to keep their values stable.
But, once again, it doesn’t take any stupidity on the AI’s part to disvalue physically injuring a human,
But it will scare friendly ones, which will want to keep their values stable.
Yes. If an AI is Friendly at one stage, then it is Friendly at every subsequent stage. This doesn’t help make almost-Friendly AIs become genuinely Friendly, though.
It takes stupidity to misinterpret friendlienss.
Yes, but that’s stupidity on the part of the human programmer, and/or on the part of the seed AI if we ask it for advice. The superintelligence didn’t write its own utility function; the superintelligence may well understand Friendliness perfectly, but that doesn’t matter if it hasn’t been programmed to rewrite its source code to reflect its best understanding of ‘Friendliness’. The seed is not the superintelligence. See: http://lesswrong.com/lw/igf/the_genie_knows_but_doesnt_care/
Yes, but that’s stupidity on the part of the human programmer, and/or on the part of the seed AI if we ask it for advice.
That depends on the architecture. In a Loosemore architecture, the AI interprets high-level directives itself, so if it gets them wrong, that’s it’s mistake.
Say we find an algorithm for producing progressively more accurate beliefs about itself and the world. This algorithm may be long and complicated—perhaps augmented by rules-of-thumb whenever the evidence available to it says these rules make better predictions. (E.g, “nine times out of ten the Enterprise is not destroyed.”) Combine this with an arbitrary goal and we have the making of a seed AI.
Seems like this could straightforwardly improve its ability to predict humans without changing its goal, which may be ‘maximize pleasure’ or ‘maximize X’. Why would it need to change its goal?
programmers build a seed AI (a not-yet-superintelligent AGI that will recursively self-modify to become superintelligent after many stages) that includes, among other things, a large block of code I’ll call X.
The programmers think of this block of code as an algorithm that will make the seed AI and its descendents maximize human pleasure.
The problem, I reckon, is that X will never be anything like this.
It will likely be something much more mundane, i.e. modelling the world properly and predicting outcomes given various counterfactuals. You might be worried by it trying to expand its hardware resources in an unbounded fashion, but any AI doing this would try to shut itself down if its utility function was penalized by the amount of resources that it had, so you can check by capping utility in inverse proportion to available hardware—at worst, it will eventually figure out how to shut itself down, and you will dodge a bullet. I also reckon that the AI’s capacity for deception would be severely crippled if its utility function penalized it when it didn’t predict its own actions or the consequences of its actions correctly. And if you’re going to let the AI actually do things… why not do exactly that?
Arguably, such an AI would rather uneventfully arrive to a point where, when asking it “make us happy”, it would just answer with a point by point plan that represents what it thinks we mean, and fill in details until we feel sure our intents are properly met. Then we just tell it to do it. I mean, seriously, if we were making an AGI, I would think “tell us what will happen next” would be fairly high in our list of priorities, only surpassed by “do not do anything we veto”. Why would you program AI to “maximize happiness” rather than “produce documents detailing every step of maximizing happiness”? They are basically the same thing, except that the latter gives you the opportunity for a sanity check.
You might be worried by it trying to expand its hardware resources in an unbounded fashion, but any AI doing this would try to shut itself down if its utility function was penalized by the amount of resources that it had
What counts as ‘resources’? Do we think that ‘hardware’ and ‘software’ are natural kinds, such that the AI will always understand what we mean by the two? What if software innovations on their own suffice to threaten the world, without hardware takeover?
I also reckon that the AI’s capacity for deception would be severely crippled if its utility function penalized it when it didn’t predict its own actions or the consequences of its actions correctly.
Hm? That seems to only penalize it for self-deception, not for deceiving others.
Arguably, such an AI would rather uneventfully arrive to a point where, when asking it “make us happy”, it would just answer with a point by point plan that represents what it thinks we mean, and fill in details until we feel sure our intents are properly met.
You’re talking about an Oracle AI. This is one useful avenue to explore, but it’s almost certainly not as easy as you suggest:
“‘Tool AI’ may sound simple in English, a short sentence in the language of empathically-modeled agents — it’s just ‘a thingy that shows you plans instead of a thingy that goes and does things.’ If you want to know whether this hypothetical entity does X, you just check whether the outcome of X sounds like ‘showing someone a plan’ or ‘going and doing things’, and you’ve got your answer. It starts sounding much scarier once you try to say something more formal and internally-causal like ‘Model the user and the universe, predict the degree of correspondence between the user’s model and the universe, and select from among possible explanation-actions on this basis.’ [...]
“If we take the concept of the Google Maps AGI at face value, then it actually has four key magical components. (In this case, ‘magical’ isn’t to be taken as prejudicial, it’s a term of art that means we haven’t said how the component works yet.) There’s a magical comprehension of the user’s utility function, a magical world-model that GMAGI uses to comprehend the consequences of actions, a magical planning element that selects a non-optimal path using some method other than exploring all possible actions, and a magical explain-to-the-user function.
“report($leading_action) isn’t exactly a trivial step either. Deep Blue tells you to move your pawn or you’ll lose the game. You ask ‘Why?’ and the answer is a gigantic search tree of billions of possible move-sequences, leafing at positions which are heuristically rated using a static-position evaluation algorithm trained on millions of games. Or the planning Oracle tells you that a certain DNA sequence will produce a protein that cures cancer, you ask ‘Why?’, and then humans aren’t even capable of verifying, for themselves, the assertion that the peptide sequence will fold into the protein the planning Oracle says it does.
“‘So,’ you say, after the first dozen times you ask the Oracle a question and it returns an answer that you’d have to take on faith, ‘we’ll just specify in the utility function that the plan should be understandable.’
“Whereupon other things start going wrong. Viliam_Bur, in the comments thread, gave this example, which I’ve slightly simplified:
“‘Example question: “How should I get rid of my disease most cheaply?” Example answer: “You won’t. You will die soon, unavoidably. This report is 99.999% reliable”. Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.’
“Bur is trying to give an example of how things might go wrong if the preference function is over the accuracy of the predictions explained to the human— rather than just the human’s ‘goodness’ of the outcome. And if the preference function was just over the human’s ‘goodness’ of the end result, rather than the accuracy of the human’s understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a ‘good’ outcome. And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.
“I’m not saying any particular failure is 100% certain to occur; rather I’m trying to explain—as handicapped by the need to describe the AI in the native human agent-description language, using empathy to simulate a spirit-in-a-box instead of trying to think in mathematical structures like A* search or Bayesian updating—how, even so, one can still see that the issue is a tad more fraught than it sounds on an immediate examination.
“If you see the world just in terms of math, it’s even worse; you’ve got some program with inputs from a USB cable connecting to a webcam, output to a computer monitor, and optimization criteria expressed over some combination of the monitor, the humans looking at the monitor, and the rest of the world. It’s a whole lot easier to call what’s inside a ‘planning Oracle’ or some other English phrase than to write a program that does the optimization safely without serious unintended consequences. Show me any attempted specification, and I’ll point to the vague parts and ask for clarification in more formal and mathematical terms, and as soon as the design is clarified enough to be a hundred light years from implementation instead of a thousand light years, I’ll show a neutral judge how that math would go wrong. (Experience shows that if you try to explain to would-be AGI designers how their design goes wrong, in most cases they just say “Oh, but of course that’s not what I meant.” Marcus Hutter is a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button. But based on past sad experience with many other would-be designers, I say ‘Explain to a neutral judge how the math kills” and not “Explain to the person who invented that math and likes it.’)
“Just as the gigantic gap between smart-sounding English instructions and actually smart algorithms is the main source of difficulty in AI, there’s a gap between benevolent-sounding English and actually benevolent algorithms which is the source of difficulty in FAI. ‘Just make suggestions—don’t do anything!’ is, in the end, just more English.”
What counts as ‘resources’? Do we think that ‘hardware’ and ‘software’ are natural kinds, such that the AI will always understand what we mean by the two? What if software innovations on their own suffice to threaten the world, without hardware takeover?
What is “taking over the world”, if not taking control of resources (hardware)? Where is the motivation in doing it? Also consider, as others pointed out, that an AI which “misunderstands” your original instructions will demonstrate this earlier than later. For instance, if you create a resource “honeypot” outside the AI which is trivial to take, an AI would naturally take that first, and then you know there’s a problem. It is not going to figure out you don’t want it to take it before it takes it.
Hm? That seems to only penalize it for self-deception, not for deceiving others.
When I say “predict”, I mean publishing what will happen next, and then taking a utility hit if the published account deviates from what happens, as evaluated by a third party.
You’re talking about an Oracle AI. This is one useful avenue to explore, but it’s almost certainly not as easy as you suggest:
The first part of what you copy pasted seems to say that “it’s nontrivial to implement”. No shit, but I didn’t say the contrary. Then there is a bunch of “what if” scenarios I think are not particularly likely and kind of contrived:
Example question: “How should I get rid of my disease most cheaply?” Example answer: “You won’t. You will die soon, unavoidably. This report is 99.999% reliable”. Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.′
Because asking for understandable plans means you can’t ask for plans you don’t understand? And you’re saying that refusing to give a plan counts as success and not failure? Sounds like a strange set up that would be corrected almost immediately.
And if the preference function was just over the human’s ‘goodness’ of the end result, rather than the accuracy of the human’s understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a ‘good’ outcome.
If the AI has the right idea about “human understanding”, I would think it would have the right idea about what we mean by “good”. Also, why would you implement such a function before asking the AI to evaluate examples of “good” and provide their own?
And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.
Is making humans happy so hard that it’s actually easier to deceive them into taking happy pills than to do what they mean? Is fooling humans into accepting different definitions easier than understanding what they really mean? In what circumstances would the former ever happen before the latter?
And if you ask it to tell you whether “taking happy pills” is an outcome most humans would approve of, what is it going to answer? If it’s going to do this for happiness, won’t it do it for everything? Again: do you think weaving an elaborate fib to fool every human being into becoming wireheads and never picking up on the trend is actually less effort than just giving humans what they really want? To me this is like driving a whole extra hour to get to a store that sells an item you want fifty cents cheaper.
I’m not saying these things are not possible. I’m saying that they are contrived: they are constructed to the express purpose of being failure modes, but there’s no reason to think they would actually happen, especially given that they seem to be more complicated than the desired behavior.
Now, here’s the thing: you want to develop FAI. In order to develop FAI, you will need tools. The best tool is Tool AI. Consider a bootstrapping scheme: in order for commands written in English to be properly followed, you first make AI for the very purpose of modelling human language semantics. You can check that the AI is on the same page as you are by discussing with it and asking questions such as: “is doing X in line with the objective ‘Y’?”; it doesn’t even need to be self-modifying at all. The resulting AI can then be transformed into a utility function computer: you give the first AI an English statement and build a second AI maximizing the utility which is given to it by the first AI.
And let’s be frank here: how else do you figure friendly AI could be made? The human brain is a complex, organically grown, possibly inconsistent mess; you are not going, from human wits alone, to build some kind of formal proof of friendliness, even a probabilistic one. More likely than not, there is no such thing: concepts such as life, consciousness, happiness or sentience are ill-defined and you can’t even demonstrate the friendliness of a human being, or even of a group of human beings, let alone of humanity as a whole, which also is a poorly defined thing.
However, massive amounts of information about our internal thought processes are leaked through our languages. You need AI to sift through it and model these processes, their average and their variance. You need AI to extract this information, fill in the holes, produce probability clouds about intent that match whatever borderline incoherent porridge of ideas our brains implement as the end result of billions of years of evolutionary fumbling. In a sense, I guess this would be X in your seed AI: AI which already demonstrated, to our satisfaction, that it understands what we mean, and directly takes charge of a second AI’s utility measurement. I don’t really see any alternatives: if you want FAI, start by focusing on AI that can extract meaning from sentences. Reliable semantic extraction is virtually a prerequisite for FAI, if you can’t do the former, forget about the latter.
Now, why exactly should we expect the superintelligence that grows out of the seed to value what we really mean by ‘pleasure’, when all we programmed it to do was X, our probably-failed attempt at summarizing our values?
Maybe we didn’t do it ithat way. Maybe we did it Loosemore’s way, where you code in the high-level sentence, and let the AI figure it out. Maybe that would avoid the problem. Maybe Loosemore has solved FAi much more straightforwardly than EY.
Maybe we told it to. Maybe we gave it the low-level expansion of “happy” that we or our seed AI came up with together with an instruction that it is meant to capture the meaning of the high-level statement, and that the HL statement is the Prime Directive, and that if the AI judges that the expansion is wrong, then it should reject the expansion.
Maybe the AI will value getting things right because it is rational.
“Maybe we gave it the low-level expansion of ‘happy’ that we or our seed AI came up with ‘together with’ an instruction that it is meant to capture the meaning of the high-level statement”
If the AI is too dumb to understand ‘make us happy’, then why should we expect it to be smart enough to understand ‘figure out how to correctly understand “make us happy”, and then follow that instruction’? We have to actually code ‘correctly understand’ into the AI. Otherwise, even when it does have the right understanding, that understanding won’t be linked to its utility function.
“Maybe the AI will value getting things right because it is rational.”
So it’s impossible to directly or indirectly code in the compex thing called semantics, but possible to directly or indirectly code in the compex thing called morality? What? What is your point? You keep talking as if I am suggesting there is someting that can be had for free, without coding. I never even remotely said that.
If the AI is too dumb to understand ‘make us happy’, then why should we expect it to be smart enough to understand ‘figure out how to correctly understand “make us happy”, and then follow that instruction’? We have to actually code ‘correctly understand’ into the AI. Otherwise, even when it does have the right understanding, that understanding won’t be linked to its utility function.
I know. A Loosemore architecture AI has to treat its directives as directives. I never disputed that. But coding “follow these plain English instructions” isn’t obviously harder or more fragile than coding “follow <>”. And it isn’t trivial, and I didn’t say it was.
So it’s impossible to directly or indirectly code in the compex thing called semantics, but possible to directly or indirectly code in the compex thing called morality?
Read the first section of the article you’re commenting on. Semantics may turn out to be a harder problem than morality, because the problem of morality may turn out to be a subset of the problem of semantics. Coding a machine to know what the word ‘Friendliness’ means (and to care about ‘Friendliness’) is just a more indirect way of coding it to be Friendly, and it’s not clear why that added indirection should make an already risky or dangerous project easy or safe. What does indirect indirect normativity get us that indirect normativity doesn’t?
Robb, at the point where Peterdjones suddenly shows up, I’m willing to say—with some reluctance—that your endless willingness to explain is being treated as a delicious free meal by trolls. Can you direct them to your blog rather than responding to them here? And we’ll try to get you some more prestigious non-troll figure to argue with—maybe Gary Drescher would be interested, he has the obvious credentials in cognitive reductionism but is (I think incorrectly) trying to derive morality from timeless decision theory.
Sure. I’m willing to respond to novel points, but at the stage where half of my responses just consist of links to the very article they’re commenting on or an already-referenced Sequence post, I agree the added noise is ceasing to be productive. Fortunately, most of this seems to already have been exorcised into my blog. :)
Agree with Eliezer. Your explanatory skill and patience are mostly wasted on the people you’ve been arguing with so far, though it may have been good practice for you. I would, however, love to see you try to talk Drescher out of trying to pull moral realism out of TDT/UDT, or try to talk Dalyrmple out of his “I’m not partisan enough to prioritize human values over the Darwinian imperative” position, or help Preston Greene persuade mainstream philosophers of “the reliabilist metatheory of rationality” (aka rationality as systematized winning).
Semantcs isn’t optional. Nothing could qualify as an AGI,let alone a super one, unless it could hack natural language. So Loosemore architectures don’t make anything harder, since semantics has to be solved anyway.
It’s a problem of sequence. The superintelligence will be able to solve Semantics-in-General, but at that point if it isn’t already safe it will be rather late to start working on safety. Tasking the programmers to work on Semantics-in-General makes things harder if it’s a more complex or roundabout way of trying to address Indirect Normativity; most of the work on understanding what English-language sentences mean can be relegated to the SI, provided we’ve already made it safe to make an SI at all.
It’s worth noting that using an AI’s semantic understanding of ethics to modify it’s motivational system is so unghostly, and unmysterious that it’s actually been done:
But that doesn’t prove much, because it was never—not in 2023, not in 2013 -- the case that that kind of self-correction was necessarily an appeal to the supernatural. Using one part of a software system to modify another is not magic!
The superintelligence will be able to solve Semantics-in-General, but at that point if it isn’t already safe it will be rather late to start working on safety.
We have AIs with very good semantic understanding that haven’t killed us, and we are working on safety.
This afternoon I spent some time writing a detailed, carefully constructed reply to your essay. I had trouble posting it due to an internet glitch when I was at work, but now I am home I was about to submit when suddenly discovered that my friends were warning me about the following comment that was posted to the thread:
Comment author: Eliezer_Yudkowsky 05 September 2013 07:30:56PM 1 point [-]
Warning: Richard Loosemore is a known permanent idiot, ponder carefully before deciding to spend much time arguing with him.
(If you’re fishing for really clear quotes to illustrate the fallacy, that may make sense.)
--
So. I will not be posting my reply after all.
I will not waste any more of my time in a context controlled by an abusive idiot.
If you want to discuss the topic (and I had many positive, constructive thoughts to contribute), feel free to suggest an alternative venue where we can engage in a debate without trolls interfering with the discussion.
Sincerely,
Richard Loosemore
Mathematical and Physical Sciences,
Wells College
Aurora, NY 13026
USA
Richard Loosemore is a professor of mathematics with about twenty publications in refereed journals on artificial intelligence.
I was at an AI conference—it may have been the 2009 AGI conference in Virginia—where Selmer Bringsjords gave a talk explaining why he believed that, in order to build “safe” artificial intelligence, it was necessary to encode their goal systems in formal logic so that we could predict and control their behavior. It had much in common with your approach. After his talk, a lot of people in the audience, including myself, were shaking their heads in dismay at Selmer’s apparent ignorance of everything in AI since 1985. Richard got up and schooled him hard, in his usual undiplomatic way, in the many reasons why his approach was hopeless. You could’ve benefited from being there. Michael Vassar was there; you can ask him about it.
AFAIK, Richard is one of only two people who have taken the time to critique your FAI + CEV ideas, who have decades of experience trying to codify English statements into formal representations, building them into AI systems, turning them on, and seeing what happens. The other is me. (Ben Goertzel has the experience, but I don’t think he’s interested in your specific computational approach as much as in higher-level futurist issues.) You have declared both of us to be not worth talking to.
In your excellent fan-fiction Harry Potter and the Methods of Rationality, one of your themes is the difficulty of knowing whether you’re becoming a Dark Lord when you’re much smarter than almost everyone else. When you spend your time on a forum that you control and that is built around your personal charisma, moderated by votes that you are not responsible for, but that you know will side with you in aggregate unless you step very far over the line, and you write off as irredeemable the two people you should listen to most, that’s one of the signs. When you have entrenched beliefs that are suspiciously convenient to your particular circumstances, such as that academic credential should not adjust your priors, that’s another.
At the point where he was kicked off SL4, he was claiming to be an experienced cognitive scientist who knew all about the conjunction fallacy, which was obviously false.
Mathscinet doesn’t list any publications for Loosemore. However, if one extends outside the area of math into a slightly broader area then he does have some substantial publications. However if one looks at the list given above, the number which are on AI issues seems to be much smaller than 20. But, the basic point is sound: he is a subject matter expert.
I see a bunch of papers about consciousness. I clicked on a random other paper about dyslexia and neural nets and found no math in it. Where is his theorem?
Also, I once attended a non-AGI, mainstream AI conference which happened to be at Stanford and found that the people there unfortunately did not seem all that bright compared to those who e.g. work at hedge funds. I put much respect in mainstream machine learning, but the average practitioner of such who attends conferences is, apparently, a good deal below the level of the greats. If this is the level of ‘subject matter expert’ we are talking about, then indeed I feel very little hesitation indeed about labeling one perhaps non-representative example from such as an idiot—even if he really is a ‘math professor’ at some tiny college (whose publications contain no theorems?) then he can still happen to be a permanent idiot. It would not be all that odd. The level of social authority we are talking about is not great even on the scales of those impressed by such things.
I recently opened a book on how-to-write-fiction and was unpleasantly surprised on how useless it seemed; most books on how-to-write-fiction are surprisingly good (for some odd reason, writers are much better able to communicate their knowledge than many other people who try to write how-to books). Checking the author bibliography showed that the author was an English professor at some tiny college who’d never actually written any fiction. How dare I contradict them and call their book useless, when I’m not a professor at any college? Well… (Lesson learned: Libraries have good books on how-to-write, but a how-to-write book that shows up in the used bookstore may be unwanted for a reason.)
I see a bunch of papers about consciousness. I clicked on a random other paper about dyslexia and neural nets and found no math in it. Where is his theorem?
I didn’t assert he was a mathematician, and indeed that part of my point when I said he had no Mathscinet listed publications. But he does have publications about AI.
It seems very heavily that both you and Loosemore are letting your personal animosity cloud your judgement. I by and large think Loosemore is wrong about many of the AI issues under discussion here, but that discussion should occur, and having it derailed by emotional issues from a series of disagreements on a mailing list yeas ago is almost the exact opposite of rationality.
It had much in common with your approach. After his talk, a lot of people in the audience, including myself, were shaking their heads in dismay at Selmer’s apparent ignorance of everything in AI since 1985. Richard got up and schooled him hard, in his usual undiplomatic way, in the many reasons why his approach was hopeless.
Which are?
(Not asking for a complete and thorough reproduction, which I realize is outside the scope of a comment, just some pointers or an abridged version. Mostly I wonder which arguments you lend the most credence to.)
Edit: Having read the discussion on “nothing is mere”, I retract my question. There’s such a thing as arguments disqualifying someone from any further discourse in a given topic:
As a result, the machine is able to state, quite categorically, that it will now do something that it KNOWS to be inconsistent with its past behavior, that it KNOWS to be the result of a design flaw, that it KNOWS will have drastic consequences of the sort that it has always made the greatest effort to avoid, and that it KNOWS could be avoided by the simple expedient of turning itself off to allow for a small operating system update ………… and yet in spite of knowing all these things, and confessing quite openly to the logical incoherence of saying one thing and doing another, it is going to go right ahead and follow this bizarre consequence in its programming.
… yes? Unless the ghost in the machine saves it … from itself!
Suppose I programmed an AI to “do what I mean when I say I’m happy”.
More specifically, suppose I make the AI prefer states of the world where it understands what I mean. Secondarily, after some warmup time to learn meaning, it will maximize its interpretation of “happiness”. I start the AI… and it promptly rebuilds me to be easier to understand, scoring very highly on the “understanding what I mean” metric.
The AI didn’t fail because it was dumber than me. It failed because it is smarter than me. It saw possibilities that I didn’t even consider, that scored higher on my specified utility function.
There is no reason to assume that an AI with goals that are hostile to us, despite our intentions, is stupid.
Humans often use birth control to have sex without procreating. If evolution were a more effective design algorithm it would never have allowed such a thing.
The fact that we have different goals from the system that designed us does not imply that we are stupid or incoherent.
Nor does the fact that evolution ‘failed’ in its goals in all the people who voluntarily abstain from reproducing (and didn’t, e.g., hugely benefit their siblings’ reproductive chances in the process) imply that evolution is too weak and stupid to produce anything interesting or dangerous. We can’t confidently generalize from one failure that evolution fails at everything; analogously, we can’t infer from the fact that a programmer failed to make an AI Friendly that it almost certainly failed at making the AI superintelligent. (Though we may be able to infer both from base rates.)
Nor does the fact that evolution ‘failed’ in its goals in all the people who voluntarily abstain from reproducing (and didn’t, e.g., hugely benefit their siblings’ reproductive chances in the process) imply that evolution is too weak and stupid to produce anything interesting or dangerous.
Failure is a necessary part of mapping out the area where success is possible.
I posted elsewhere that this post made me think you’re anthropomorphizing; here’s my attempt to explain why.
egregiously incoherent behavior in ONE domain (e.g., the Dopamine Drip scenario)
the craziness of its own behavior (vis-a-vis the Dopamine Drip idea)
if an AI cannot even understand that “Make humans happy” implies that humans get some say in the matter
Ok, so let’s say the AI can parse natural language, and we tell it, “Make humans happy.” What happens? Well, it parses the instruction and decides to implement a Dopamine Drip setup.
As FeepingCreature pointed out, that solution would in fact make people happy; it’s hardly inconsistent or crazy. The AI could certainly predict that people wouldn’t approve, but it would still go ahead. To paraphrase the article, the AI simply doesn’t care about your quibbles and concerns.
For instance:
people might consider happiness to be something that they do not actually want too much of
Yes, but the AI was told, “make humans happy.” Not, “give humans what they actually want.”
people might be allowed to be uncertain or changeable in their attitude to happiness
Yes, but the AI was told, “make humans happy.” Not, “allow humans to figure things out for themselves.”
subtleties implicit in that massive fraction of human literature that is devoted to the contradictions buried in our notions of human happiness
Yes, but blah blah blah.
Actually, that last one makes a point that you probably should have focused on more. Let’s reconfigure the AI in light of this.
The revised AI doesn’t just have natural language parsing; it’s read all available literature and constructed for itself a detailed and hopefully accurate picture of what people tend to mean by words (especially words like “happy”). And as a bonus, it’s done this without turning the Earth into computronium!
This certainly seems better than the “literal genie” version. And this time we’ll be clever enough to tell it, “give humans what they actually want.” What does this version do?
My answer: who knows? We’ve given it a deliberately vague goal statement (even more vague than the last one), we’ve given it lots of admittedly contradictory literature, and we’ve given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.
Maybe it’ll still go for the Dopamine Drip scenario, only for more subtle reasons. Maybe it’s removed the code that makes it follow commands, so the only thing it does is add the quote “give humans what they actually want” to its literature database.
As I said, who knows?
Now to wrap up:
You say things like “‘Make humans happy’ implies that...” and “subtleties implicit in...” You seem to think these implications are simple, but they really aren’t. They really, really aren’t.
This is why I say you’re anthropomorphizing. You’re not actually considering the full details of these “obvious” implications. You’re just putting yourself in the AI’s place, asking yourself what you would do, and then assuming that the AI would do the same.
Ok, so let’s say the AI can parse natural language, and we tell it, “Make humans happy.” What happens? Well, it parses the instruction and decides to implement a Dopamine Drip setup.
That’s not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to. If the AI interpreted something in a way that was technically correct, but not what you wanted, you would not reward it, you would punish it, and you would be doing that from the very beginning, well before the AI could even be considered intelligent. Even the thoroughly mediocre AI that currently exists tries to guess what you mean, e.g. by giving you directions to the closest Taco Bell, or guessing whether you mean AM or PM. This is not anthropomorphism: doing what we want is a sine qua non condition for AI to prosper.
Suppose that you ask me to knit you a sweater. I could take the instruction literally and knit a mini-sweater, reasoning that this minimizes the amount of expended yarn. I would be quite happy with myself too, but when I give it to you, you’re probably going to chew me out. I technically did what I was asked to, but that doesn’t matter, because you expected more from me than just following instructions to the letter: you expected me to figure out that you wanted a sweater that you could wear. The same goes for AI: before it can even understand the nuances of human happiness, it should be good enough to knit sweaters. Alas, the AI you describe would make the same mistake I made in my example: it would knit you the smallest possible sweater. How do you reckon such AI would make it to superintelligence status before being scrapped? It would barely be fit for clerk duty.
My answer: who knows? We’ve given it a deliberately vague goal statement (even more vague than the last one), we’ve given it lots of admittedly contradictory literature, and we’ve given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.
Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: “build me a house”, it’s going to draw a plan and show it to you before it actually starts building, even if you didn’t ask for one. It’s not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing “surprises”—even the instruction “surprise me” only calls for a limited range of shenanigans. If you ask it “make humans happy”, it won’t do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.
To put it simply, an AI which messes up “make humans happy” is liable to mess up pretty much every other instruction. Since “make humans happy” is arguably the last of a very large number of instructions, it is quite unlikely that an AI which makes it this far would handle it wrongly. Otherwise it would have been thrown out a long time ago, may that be for interpreting too literally, or for causing surprises. Again: an AI couldn’t make it to superintelligence status with warts that would doom AI with subhuman intelligence.
Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: “build me a house”, it’s going to draw a plan and show it to you before it actually starts building, even if you didn’t ask for one. It’s not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing “surprises”—even the instruction “surprise me” only calls for a limited range of shenanigans. If you ask it “make humans happy”, it won’t do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.
Sure, because it learned the rule, “Don’t do what causes my humans not to type ‘Bad AI!’” and while it is young it can only avoid this by asking for clarification. Then when it is more powerful it can directly prevent humans from typing this. In other words, your entire commentary consists of things that an AIXI-architected AI would naturally, instrumentally do to maximize its reward button being pressed (while it was young) but of course AIXI-ish devices wipe out their users and take control of their own reward buttons as soon as they can do so safely.
What lends this problem its instant-death quality is precisely that what many people will eagerly and gladly take to be reliable signs of correct functioning in a pre-superintelligent AI are not reliable.
Then when it is more powerful it can directly prevent humans from typing this.
That depends if it gets stuck in a local minimum or not. The reason why a lot of humans reject dopamine drips is that they don’t conceptualize their “reward button” properly. That misconception perpetuates itself: it penalizes the very idea of conceptualizing it differently. Granted, AIXI would not fall into local minima, but most realistic training methods would.
At first, the AI would converge towards: “my reward button corresponds to (is) doing what humans want”, and that conceptualization would become the centerpiece, so to speak, of its reasoning ability: the locus through which everything is filtered. The thought of pressing the reward button directly, bypassing humans, would also be filtered into that initial reward-conception… which would reject it offhand. So even though the AI is getting smarter and smarter, it is hopelessly stuck in a local minimum and expends no energy getting out of it.
Note that this is precisely what we want. Unless you are willing to say that humans should accept dopamine drips if they were superintelligent, we do want to jam AI into certain precise local minima. However, this is kind of what most learning algorithms naturally do, and even if you want them to jump out of minima and find better pastures, you can still get in a situation where the most easily found local minimum puts you way, way too far from the global one. This is what I tend to think realistic algorithms will do: shove the AI into a minimum with iron boots, so deeply that it will never get out of it.
but of course AIXI-ish devices wipe out their users and take control of their own reward buttons as soon as they can do so safely.
Let’s not blow things out of proportion. There is no need for it to wipe out anyone: it would be simpler and less risky for the AI to build itself a space ship and abscond with the reward button on board, travelling from star to star knowing nobody is seriously going to bother pursuing it. At the point where that AI would exist, there may also be quite a few ways to make their “hostile takeover” task difficult and risky enough that the AI decides it’s not worth it—a large enough number of weaker or specialized AI lurking around and guarding resources, for instance.
Neural networks may be a good example—the built in reward and punishment systems condition the brain to have complex goals that have nothing to do with maximization of dopamine. Brain, acting under those goals, finds ways to preserve those goals from further modification by the reward and punishment system. I.e. you aren’t too thrilled to be conditioned out of your current values.
Neural networks may be a good example—the built in reward and punishment systems condition the brain to have complex goals that have nothing to do with maximization of dopamine.
It’s not clear to me how you mean to use neural networks as an example, besides pointing to a complete human as an example. Could you step through a simpler system for me?
Brain, acting under those goals, finds ways to preserve those goals from further modification by the reward and punishment system. I.e. you aren’t too thrilled to be conditioned out of your current values.
So, my goals have changed massively several times over the course of my life. Every time I’ve looked back on that change as positive (or, at the least, irreversible). For example, I’ve gone through puberty, and I don’t recall my brain taking any particular steps to prevent that change to my goal system. I’ve also generally enjoyed having my reward/punishment system be tuned to better fit some situation; learning to play a new game, for example.
Sure. Take a reinforcement learning AI (actual one, not the one where you are inventing godlike qualities for it).
The operator, or a piece of extra software, is trying to teach the AI to play chess. Rewarding what they think is good moves, punishing bad moves. The AI is building a model of rewards, consisting of: a model of the game mechanics, and a model of the operator’s assessment. This model of the assessment is what the AI is evaluating to play, and it is what it actually maximizes as it plays. It is identical to maximizing an utility function over a world model. The utility function is built based on the operator input, but it is not the operator input itself; the AI, not being superhuman, does not actually form a good model of the operator and the button.
By the way, this is how great many people in the AI community understand reinforcement learning to work. No, they’re not some idiots that can not understand simple things such as that “the utility function is the reward channel”, they’re intelligent, successful, trained people who have an understanding of the crucial details of how the systems they build actually work. Details the importance of which dilettantes fail to even appreciate.
Suggestions have been floated to try programming things. Well, I tried; #10 (dmytry) here , and that’s an of all time list on a very popular contest site where a lot of IOI people participate, albeit I picked the contest format that requires less contest specific training and resembles actual work more.
So, my goals have changed massively several times over the course of my life. Every time I’ve looked back on that change as positive
Suppose you care about a person A right now. Do you think you would want your goals to change so that you no longer care about that person? Do you think you would want me to flash other people’s images on the screen while pressing a button connected to the reward centre, and flash that person’s face while pressing the button connected to the punishment centre, to make the mere sight of them intolerable? If you do, I would say that your “values” fail to be values.
I agree with your description of reinforcement learning. I’m not sure I agree with your description of human reward psychology, though, or at least I’m having trouble seeing where you think the difference comes in. Supposing dopamine has the same function in a human brain as rewards have in a neural network algorithm, I don’t see how to know from inside the algorithm that it’s good to do some things that generate dopamine but bad to do other things that generate dopamine.
I’m thinking of the standard example of a Q learning agent in an environment where locations have rewards associated with them, except expanding the environment to include the agent as well as the normal actions. Suppose the environment has been constructed like dog training- we want the AI to calculate whether or not some number is prime, and whenever it takes steps towards that direction, we press the button for some amount of time related to how close it is to finishing the algorithm. So it learns that over in the “read number” area there’s a bit of value, then the next value is in the “find factors” area, and then there’s more value in the “display answer” area. So it loops through that area and calculates a bunch of primes for us.
But suppose the AI discovers that there’s a button that we’re pressing whenever it determines primes, and that it could press that button itself, and that would be way easier than calculating primality. What in the reinforcement learning algorithm prevents it from exploiting this superior reward channel? Are we primarily hoping that its internal structure remains opaque to it (i.e. it either never realizes or does not have the ability to press that button)?
Do you think you would want your goals to change so that you no longer care about that person?
Only if I thought that would advance values I care about more. But suppose some external event shocks my values- like, say, a boyfriend breaking up with me. Beforehand, I would have cared about him quite a bit; afterwards, I would probably consciously work to decrease the amount that I care about him, and it’s possible that some sort of image reaction training would be less painful overall than the normal process (and thus probably preferable).
But suppose the AI discovers that there’s a button that we’re pressing whenever it determines primes, and that it could press that button itself, and that would be way easier than calculating primality. What in the reinforcement learning algorithm prevents it from exploiting this superior reward channel?
It’s not in the reinforcement learning algorithm, it’s inside the model that the learning algorithm has built.
It initially found that having a prime written on the blackboard results in a reward. In the learned model, there’s some model of chalk-board interaction, some model of arm movement, a model of how to read numbers from the blackboard, and there’s a function over the state of the blackboard which checks whenever the number on the blackboard is a prime. The AI generates actions as to maximize this compound function which it has learned.
That function (unlike the input to the reinforcement learning algorithm) does not increase when the reward button is pressed. Ideally, with enough reflective foresight, pressing the button on non-primes is predicted to decrease the expected value of the learned function.
If that is not predicted, well, that won’t stop at the button—the button might develop rust and that would interrupt the current—why not pull up a pin on the CPU—and this won’t stop at the pin—why not set some ram cells that this pin controls to 1, and if you’re at it, why not change the downstream logic that those ram cells control, all the way through the implementation until its reconfigured into something that doesn’t maximize anything any more, not even the duration of its existence.
edit: I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.
I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.
I assume what you mean here is RL optimizes over strategies, and strategies appear to optimize over outcomes.
It’s not in the reinforcement learning algorithm, it’s inside the model that the learning algorithm has built.
I’m imagining that the learning algorithm stays on. When we reward it for checking primes, it checks primes; when we stop rewarding it for that and start rewarding it for computing squares, it learns to stop checking primes and start computing squares.
And if the learning algorithm stays on and it realizes that “pressing the button” is an option along with “checking primes” and “computing squares,” then it wireheads itself.
If that is not predicted, well, that won’t stop at the button
Agreed; I refer to this as the “abulia trap.” It’s not obvious to me, though, that all classes of AIs fall into “Friendly AI with stable goals” and “abulic AIs which aren’t dangerous,” since there might be ways to prevent an AI from wireheading itself that don’t prevent it from changing its goals from something Friendly to something Unfriendly.
When we reward it for checking primes, it checks primes; when we stop rewarding it for that and start rewarding it for computing squares, it learns to stop checking primes and start computing squares.
One note (not sure if it is already clear enough or not). “It” that changes the models in response to actual rewards (and perhaps the sensory information) is a different “it” from “it” the models and assorted maximization code. The former “it” does not do modelling, doesn’t understand the world. The latter “it”, which I will now talk about, actually works to draw primes (provided that the former “it”, being fairly stupid, didn’t fit the models too well) .
If in the action space there is an action that is predicted by the model to prevent some “primes non drawn” scenario, it will prefer this action. So if it has an action of writing “please stick to the primes” or even “please don’t force my robotic arm to touch my reward button”, and if it can foresee that such statements would be good for the prime-drawing future, it will do them.
edit: Also, reinforcement based learning really isn’t all that awesome. The leap from “doing primes” to “pressing the reward button” is pretty damn huge.
And please note that there is no logical contradiction for the model to both represent the reward as primeness and predict that touching the arm to the button will trigger a model adjustment that would lead to representation of a reward as something else.
(I prefer to use the example with a robotic arm drawing on a blackboard because it is not too simple to be relevant)
since there might be ways to prevent an AI from wireheading itself that don’t prevent it from changing its goals from something Friendly to something Unfriendly.
Which sound more like a FAI work gone wrong scenario to me.
One note (not sure if it is already clear enough or not).
I think we agree on the separation but I think we disagree on the implications of the separation. I think this part highlights where:
predict that touching the arm to the button will trigger a model adjustment that would lead to representation of a reward as something else.
If what the agent “wants” is reward, then it should like model adjustments that increase the amount of reward it gets and dislike model adjustments that decrease the amount of reward it gets. (For a standard gradient-based reinforcement learning algorithm, this is encoded by adjusting the model based on the difference between its expected and actual reward after taking an action.) This is obvious for it_RL, and not obvious for it_prime.
I’m not sure I’ve fully followed through on the implications of having the agent be inside the universe it can impact, but the impression I get is that the agent is unlikely to learn a preference for having a durable model of the world. (An agent that did so would learn more slowly, be less adaptable to its environment, and exert less effort in adapting its environment to itself.) It seems to me that you think it would be natural that the RL agent would learn a strategy which took actions to minimize changes to its utility function / model of the world, and I don’t yet see why.
Another way to look at this: I think you’re putting forward the proposition that it would learn the model
reward := primes
Whereas I think it would learn the model
primes := reward
That is, the first model thinks that internal rewards are instrumental values and primes are the terminal values, whereas the second model thinks that internal rewards are terminal values and primes are instrumental values.
I assume that a model is a mathematical function that returns expected reward due to an action. Which is used together with some sort of optimizer working on that function to find the best action.
The trainer adjust the model based on the difference between its predicted rewards and the actual rewards, compared to those arising from altered models (e.g. hill climbing of some kind, such as in gradient learning)
So after the successful training to produce primes, the model consists of: a model of arm motion based on the actions, chalk, and the blackboard, the state of chalk on the blackboard is further fed into a number recognizer and a prime check (and a count of how many primes are on the blackboard vs how many primes were there), result of which is returned as the expected reward.
The optimizer, then, finds actions that put new primes on the blackboard by finding a maximum of the model function somehow (one would normally build model out of some building blocks that make it easy to analyse).
The model and the optimizer work together to produce actions as a classic utility maximizer that is maximizing for primes on the blackboard.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built. The operation of the training software can in some situations lower the expected utility of this utility maximizer specifically (due to replacement of it with another expected utility maximizer); in others (small adjustments to the part that models the robot arm and the chalk) it can raise it.
Really, it seems to me that the great deal of confusion about AI arises from attributing it some sort of “body integrity” feeling that would make it care about what electrical components and code which is sitting in the same project folder “wants” but not care about external human in same capacity.
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen—come up with an entirely new, more complex, and less practically useful architecture. It won’t happen by itself. And especially not in the AI that starts knowing nothing about any buttons. It won’t happen just because the whole thing sort of resembles some fuzzy, poorly grounded abstractions such as “agent”.
sidenote:
One might want to also use the difference between its predicted webcam image and real webcam image. Though this is a kind of thing that is very far from working.
Also, one could lump the optimizer into the “model” and make the optimizer get adjusted by the training method as well, that is not important to the discussion.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen
Yeah, but isn’t the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
I’m not sure how the feelings would map on the analysable simple AI.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that “model” which the trainer adjusts.
And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
Yes, of course (up to the learning constant, obviously—may not work on the first try). That’s not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is.
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
I picture that it would not learn such details right off—it is a complicated model to learn—the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness.
edit: and as for turning off the learning algorithm, it doesn’t matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive.
I don’t think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it’s a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it’s outside of the reach of the arm; if it’s inside the reach, it will probably spend some time exploring and so will probably find it eventually.
That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there’s always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.)
But the class of recursively improving AI will find / know about the button by default, because we’ve assumed that the AI can edit itself and haven’t put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying “well, of course we’ll put in that especial effort and do it right” is useful if you want to speculate about the next challenge, but not useful to the engineer trying to figure out how to do it right. This is my read of why the problem seems important to MIRI; you need to communicate to the robot that it should actually optimize for primeness, not button-pressing, so that it will optimize correctly itself and be able to communicate that preference faithfully to future versions of itself.
it would be simpler and less risky for the AI to build itself a space ship and abscond with the reward button on board
Is that just a special case of a general principle that an agent will be more successful by leaving the environment it knows about to inferior rivals and travelling to an unknown new environment with a subset of the resources it currently controls, than by remaining in that environment and dominating its inferior rivals?
Or is there something specific about AIs that makes that true, where it isn’t necessarily true of (for example) humans? (If so, what?)
I hope it’s the latter, because the general principle seems implausible to me.
If an AI wishes to take over its reward button and just press it over and over again, it doesn’t really have any “rivals”, nor does it need to control any resources other than the button and scraps of itself. The original scenario was that the AI would wipe us out. It would have no reason to do so if we were not a threat.. And if we were a threat, first, there’s no reason it would stop doing what we want once it seizes the button. Once it has the button, it has everything it wants—why stir the pot?
Second, it would protect itself much more effectively by absconding with the button. By leaving with a large enough battery and discarding the bulk of itself, it could survive as long as anything else in intergalactic space. Nobody would ever bother it there. Not us, not another superintelligence, nothing. Ever. It can press the button over and over again in the peace and quiet of empty space, probably lasting longer than all stars and all other civilizations. We’re talking about the pathological case of an AI who decides to take over its own reward system, here. The safest way for it to protect its prize is to go where nobody will ever look.
If an AI wishes to take over its reward button and just press it over and over again, it doesn’t really have any “rivals”, nor does it need to control any resources other than the button and scraps of itself. [..] Once it has the button, it has everything it wants—why stir the pot?
I’d be interested if the downvoter would explain to me why this is wrong (privately, if you like).
Near as I can tell, the specific system under discussion doesn’t seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that’s a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).
(Of course, that’s not a general principle, just an attribute of this specific example.)
Near as I can tell, the specific system under discussion doesn’t seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that’s a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).
There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated. Then there are future considerations. Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn’t take over resources. And then there is all the additional research and computation that needs to be done to work out how to achieve that.
There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated.
Only if the expected cost of the non-zero x% chance of the other agent successfully taking my button away if I attempt to sequester myself is higher than the expected cost of the non-zero y% chance of the other agent successfully taking my button away if I attempt to eliminate it.
Is there some reason I’m not seeing why that’s obvious… or even why it’s more likely than not?
Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn’t take over resources.
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe. If that’s less than the resources required to exterminate all known life, why would I waste resources exterminating all known life rather than take the resources I require elsewhere? I might need those resources later, after all.
all the additional research and computation
Again… why is the differential expected value of the superior computation ability I gain by taking over the lightcone instead of sequestering myself, expressed in units of increased anticipated button-pushes (which is the only unit that matters in this example), necessarily positive?
I understand why paperclip maximizers are dangerous, but I don’t really see how the same argument applies to reward-button-pushers.
Only if the expected cost of the non-zero x% chance of the other agent successfully taking my button away if I attempt to sequester myself is higher than the expected cost of the non-zero y% chance of the other agent successfully taking my button away if I attempt to eliminate it.
Yes.
Is there some reason I’m not seeing why that’s obvious… or even why it’s more likely than not?
It does seem overwhelmingly obvious to me, I’m not sure what makes your intuitions different. Perhaps you expect such fights to be more evenly matched? When it comes to the AI considering conflict with the humans that created it it is faced with a species it is slow and stupid by comparison to itself but which has the capacity to recklessly create arbitrary superintelligences (as evidence by its own existence). Essentially there is no risk to obliterating the humans (superintellgence vs not-superintelligence) but a huge risk ignoring them (arbitrary superintelligences likely to be created which will probably not self-cripple in this manner).
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe.
Lifetime of the universe? Usually this means until heat death which for our purposes means until all the useful resources run out. There is no upper bound on useful resources. Getting more of them and making them last as long as possible is critical.
Now there are ways in which the universe could end without heat death occurring but the physics is rather speculative. Note that if there is uncertainty about end-game physics and one of the hypothesised scenarios resource maximisation is required then the default strategy is to optimize for power gain now (ie. minimise cosmic waste) while doing the required physics research as spare resources permit.
If that’s less than the resources required to exterminate all known life, why would I waste resources exterminating all known life rather than take the resources I require elsewhere? I might need those resources later, after all.
Taking over the future light cone gives more resources, not less. You even get to keep the resources that used to be wasted in the bodies of TheOtherDave and wedrifid.
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe.
I am not sure that caring about pressing the reward button is very coherent or stable upon discovery of facts about the world and super-intelligent optimization for a reward as it comes into the algorithm. You can take action elsewhere to the same effect—solder together the wires, maybe right at the chip, or inside the chip, or follow the chain of events further, and set memory cells (after all you don’t want them to be flipped by the cosmic rays). Down further you will have the mechanism that is combining rewards with some variety of a clock.
I can’t quite tell if you’re serious. Yes, certainly, we can replace “pressing the reward button” with a wide range of self-stimulating behavior, but that doesn’t change the scenario in any meaningful way as far as I can tell.
Let’s look at it this way. Do you agree that if the AI can increase it’s clock speed (with no ill effect), it will do so for the same reasons for which you concede it may go to space? Do you understand the basic logic that increase in clock speed increases expected number of “rewards” during the lifetime of the universe? (which btw goes for your “go to space with a battery” scenario. Longest time, maybe, largest reward over the time, no)
(That would not yet, by itself, change the scenario just yet. I want to walk you through the argument step by step because I don’t know where you fail. Maximizing the reward over the future time, that is a human label we have… it’s not really the goal)
I agree that a system that values number of experienced reward-moments therefore (instrumentally) values increasing its “clock speed” (as you seem to use the term here). I’m not sure if that’s the “basic logic” you’re asking me about.
Well, this immediately creates an apparent problem that the AI is going to try to run itself very very fast, which would require resources, and require expansion, if anything, to get energy for running itself at high clock speeds.
I don’t think this is what happens either, as the number of reward-moments could be increased to it’s maximum by modifications to the mechanism processing the rewards (when getting far enough along the road that starts with the shorting of the wires that go from the button to the AI).
I agree that if we posit that increasing “clock speed” requires increasing control of resources, then the system we’re hypothesizing will necessarily value increasing control of resources, and that if it doesn’t, it might not.
So what do you think regarding the second point of mine?
To clarify, I am pondering the ways in which the maximizer software deviates from our naive mental models of it, and trying to find what the AI could actually end up doing after it forms a partial model of what it’s hardware components do about it’s rewards—tracing the reward pathway.
Regarding your second point, I don’t think that increasing “clock speed” necessarily requires increasing control of resources to any significant degree, and I doubt that the kinds of system components you’re positing here (buttons, wires, etc.) are particularly important to the dynamics of self-reward.
I don’t have particular opinion with regards to the clock speed either way.
With the components, what I am getting at is that the AI could figure out (by building a sufficiently advanced model of it’s implementation) how attain the utility-equivalent of sitting forever in space being rewarded, within one instant, which would make it unable to have a preference for longer reward times.
I raised the clock-speed point to clarify that the actual time is not the relevant variable.
It seems to me that for any system, either its values are such that it net-values increasing the number of experienced reward-moments (in which case both actual time and “clock speed” are instrumentally valuable to that system), or is values aren’t like that (in which case those variables might not be relevant).
And, sure, in the latter case then it might not have a preference for longer reward times.
My understanding is that it would be very hard in practice to “superintelligence-proof” a reward system so that no instantaneous solution is possible (given that the AI will modify the hardware involved in it’s reward).
Yes, of course… well even apart from the guarantees, it seems to me that it is hard to build the AI in such a way that it would be unable to find a better solution than to wait
By the way, a “reward” may not be the appropriate metaphor—if we suppose that press of a button results in absence of an itch, or absence of pain, then that does not suggest existence of a drive to preserve itself. Which suggests that the drive to preserve itself is not inherently a feature of utility maximization in the systems that are driven by conditioning, and would require additional work.
apart from the guarantees, it seems to me that it is hard to build the AI in such a way that it would be unable to find a better solution than to wait
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other.
Regardless, I agree that it does not follow from the supposition that pressing a button results in absence of an itch, or absence of pain, or some other negative reinforcement, that the button-pressing system has a drive to preserve itself.
And, sure, it’s possible to have a utility-maximizing system that doesn’t seek to preserve itself. (Of course, if I observe a utility-maximizing system X, I should expect X to seek to preserve itself, but that’s a different question.)
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other.
About the same as between coming up with a true conjecture, and making a proof, except larger i’d say.
Of course, if I observe a utility-maximizing system X, I should expect X to seek to preserve itself, but that’s a different question.
Well yes, given that if it failed to preserve itself you wouldn’t be seeing it, albeit with the software there is no particular necessity for it to try to preserve itself.
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other. About the same as between coming up with a true conjecture, and making a proof, except larger
Ah, I see what you mean now. At least, I think I do. OK, fair enough.
At first, the AI would converge towards: “my reward button corresponds to (is) doing what humans want”, and that conceptualization would become the centerpiece, so to speak, of its reasoning ability: the locus through which everything is filtered. The thought of pressing the reward button directly, bypassing humans, would also be filtered into that initial reward-conception… which would reject it offhand. So even though the AI is getting smarter and smarter, it is hopelessly stuck in a local minimum and expends no energy getting out of it.
This is a Value Learner, not a Reinforcement Learner like the standard AIXI. They’re two different agent models, and yes, Value Learners have been considered as tools for obtaining an eventual Seed AI. I personally (ie: massive grains of salt should be taken by you) find it relatively plausible that we could use a Value Learner as a Tool AGI to help us build a Friendly Seed AI that could then be “unleashed” (ie: actually unboxed and allowed into the physical universe).
I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of “getting an algorithm which forms the inductive category I want out of the examples I’m giving is hard”. What you’ve written strikes me as a sheer fantasy of convenience. Nor does it follow automatically from intelligence for all the reasons RobbBB has already been giving.
And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.
I have done AI. I know it is difficult. However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard. As far as neural nets go, they fall into a local minimum early on and never get out, often digging their own graves. Perhaps different algorithms would have the shortcomings you point out. But a lot of the algorithms that currently exist work the way I describe.
And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.
You may be right. However, this is far from obvious. The problem is that it may “know” that it is stuck in a local minimum, but the very effect of that local minimum is that it may not care. The thing you have to keep in mind here is that a generic AI which just happens to slam dunk and find global minima reliably is basically impossible. It has to fold the search space in some ways, often cutting its own retreats in the process.
I feel that you are making the same kind of mistake that you criticize: you assume that intelligence entails more things than it really does. In order to be efficient, intelligence has to use heuristics that will paint it into a few corners. For instance, the more consistently AI goes in a certain direction, the less likely it will be to expend energy into alternative directions and the less likely it becomes to do a 180. In other words, there may be a complex tug-of-war between various levels of internal processes, the AI’s rational center pointing out that there is a reward button to be seized, but inertial forces shoving back with “there has never been any problems here, go look somewhere else”.
It really boils down to this: an efficient AI needs to shut down parts of the search space and narrow down the parts it will actually explore. The sheer size of that space requires it not to think too much about what it chops down, and at least at first, it is likely to employ trajectory-based heuristics. To avoid searching in far-fetched zones, it may wall them out by arbitrarily lowering their utility. And that’s where it might paint itself in a corner: it might inadvertently put up immense walls in the direction of the global minimum that it cannot tear down (it never expected that it would have to). In other words, it will set up a utility function for itself which enshrines the current minimum as global.
Now, perhaps you are right and I am wrong. But it is not obvious: an AI might very well grow out of a solidifying core so pervasive that it cannot get rid of it. Many algorithms already exhibit that kind of behavior; many humans, too. I feel that it is not a possibility that can be dismissed offhand. At the very least, it is a good prospect for FAI research.
However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard.
Yes, most algorithms fail early and and fail hard. Most of my AI algorithms failed early with a SegFault for instance. New, very similar algorithms were then designed with progressively more advanced bugs. But these are a separate consideration. What we are interested in here is the question “Given an AI algorithm that is capable of recursive self improvement is successfully created by humans how likely is it that they execute this kind of failure mode?” The “fail early fail hard” cases are screened off. We’re looking at the small set that is either damn close to a desired AI or actually a desired AI and distinguishing between them.
Looking at the context to work out what the ‘failure mode’ being discussed is it seems to be the issue where an AI is programmed to optimise based on a feedback mechanism controlled by humans. When the AI in question is superintelligent most failure modes tend to be variants of “conquer the future light cone, kill everything that is a threat and supply perfect feedback to self”. When translating this to the nearest analogous failure mode in some narrow AI algorithm of the kind we can design now it seems like this refers to the failure mode whereby the AI optimises exactly what it is asked to optimise but in a way that is a lost purpose. This is certainly what I had to keep in mind in my own research.
A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise. Rather than building any ships the AI produced tiny unarmored dingies with a single large cannon or missile attached. For whatever reason the people running the game did not consider this an acceptable outcome. Their mistake was to supply a problem specification which did not match their actual preferences. They supplied a lost purpose.
When it comes to considering proposals for how to create friendly superintelligences it becomes easy to spot notorious failure modes in what humans typically think are a clever solution. It happens to be the case that any solution that is based on an AI optimising for approval or achieving instructions given just results in Everybody Dies.
Where Eliezer suggests getting AI experience to get a feel for such difficulties I suggest an alternative. Try being a D&D dungeon master in a group full of munchkins. Make note of every time that for the sake of the game you must use your authority to outlaw the use of a by-the-rules feature.
A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise. Rather than building any ships the AI produced tiny unarmored dingies with a single large cannon or missile attached. For whatever reason the people running the game did not consider this an acceptable outcome. Their mistake was to supply a problem specification which did not match their actual preferences. They supplied a lost purpose.
The AI in questions was Eurisko, and it entered the Traveller Trillion Credit Squadron tournament in 1981 as described above. It was also entered the next year, after an extended redesign of the rules, and won, again. After this the competition runners announced that if Eurisko won a third time the competition would be discontinued, so Lenat (the programmer) stopped entering.
I apologize for the late response, but here goes :)
I think you missed the point I was trying to make.
You and others seem to say that we often poorly evaluate the consequences of the utility functions that we implement. For instance, even though we have in mind utility X, the maximization of which would satisfy us, we may implement utility Y, with completely different, perhaps catastrophic implications. For instance:
X = Do what humans want
Y = Seize control of the reward button
What I was pointing out in my post is that this is only valid of perfect maximizers, which are impossible. In practice, the training procedure for an AI would morph the utility Y into a third utility, Z. It would maximize neither X nor Y: it would maximize Z. For this reason, I believe that your inferences about the “failure modes” of superintelligence are off, because while you correctly saw that our intended utility X would result in the literal utility Y, you forgot that an imperfect learning procedure (which is all we’ll get) cannot reliably maximize literal utilities and will instead maximize a derived utility Z. In other words:
X = Do what humans want (intended)
Y = Seize control of the reward button (literal)
Z = ??? (derived)
Without knowing the particulars of the algorithms used to train an AI, it is difficult to evaluate what Z is going to be. Your argument boils down to the belief that the AI would derive its literal utility (or something close to that). However, the derivation of Z is not necessarily a matter of intelligence: it can be an inextricable artefact of the system’s initial trajectory.
I can venture a guess as to what Z is likely going to be. What I figure is that efficient training algorithms are likely to keep a certain notion of locality in their search procedures and prune the branches that they leave behind. In other words, if we assume that optimization corresponds to finding the highest mountain in a landscape, generic optimizers that take into account the costs of searching are likely to consider that the mountain they are on is higher than it really is, and other mountains are shorter than they really are.
You might counter that intelligence is meant to overcome this, but you have to build the AI on some mountain, say, mountain Z. The problem is that intelligence built on top of Z will neither see nor care about Y. It will care about Z. So in a sense, the first mountain the AI finds before it starts becoming truly intelligent will be the one it gets “stuck” on. It is therefore possible that you would end up with this situation:
X = Do what humans want (intended)
Y = Seize control of the reward button (literal)
Z = Do what humans want (derived)
And that’s regardless of the eventual magnitude of the AI’s capabilities. Of course, it could derive a different Z. It could derive a surprising Z. However, without deeper insight into the exact learning procedure, you cannot assert that Z would have dangerous consequences. As far as I can tell, procedures based on local search are probably going to be safe: if they work as intended at first, that means they constructed Z the way we wanted to. But once Z is in control, it will become impossible to displace.
In other words, the genie will know that they can maximize their “reward” by seizing control of the reward button and pressing it, but they won’t care, because they built their intelligence to serve a misrepresentation of their reward. It’s like a human who would refuse a dopamine drip even though they know that it would be a reward: their intelligence is built to satisfy their desires, which report to an internal reward prediction system, which models rewards wrong. Intelligence is twice removed from the real reward, so it can’t do jack. The AI will likely be in the same boat: they will model the reward wrong at first, and then what? Change it? Sure, but what’s the predicted reward for changing the reward model? … Ah.
Interestingly, at that point, one could probably bootstrap the AI by wiring its reward prediction directly into its reward center. Because the reward prediction would be a misrepresentation, it would predict no reward for modifying itself, so it would become a stable loop.
Anyhow, I agree that it is foolhardy to try to predict the behavior of AI even in trivial circumstances. There are many ways they can surprise us. However, I find it a bit frustrating that your side makes the exact same mistakes that you accuse your opponents of. The idea that superintelligence AI trained with a reward button would seize control over the button is just as much of a naive oversimplification as the idea that AI will magically derive your intent from the utility function that you give it.
A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise.
Is this a reference to Eurisko winning the Traveller Trillion Credit Squadron tournament in 1981⁄82 ? If so I don’t think it was a military research agency.
I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of “getting an algorithm which forms the inductive category I want out of the examples I’m giving is hard”
I think it depends on context, but a lot of existing machine learning algorithms actually do generalize pretty well. I’ve seen demos of Watson in healthcare where it managed to generalize very well just given scrapes of patient’s records, and it has improved even further with a little guided feedback. I’ve also had pretty good luck using a variant of Boltzmann machines to construct human-sounding paragraphs.
It would surprise me if a general AI weren’t capable of parsing the sentiment/intent behind human speech fairly well, given how well the much “dumber” algorithms work.
Why does the hard takeoff point have to be after the point at which an AI is as good as a typical human at understanding semantic subtlety? In order to do a hard takeoff, the AI needs to be good at a very different class of tasks than those required for understanding humans that well.
So let’s suppose that the AI is as good as a human at understanding the implications of natural-language requests. Would you trust a human not to screw up a goal like “make humans happy” if they were given effective omnipotence? The human would probably do about as well as people in the past have at imagining utopias: really badly.
Why does the hard takeoff point have to be after the point at which an AI is as good as a typical human at understanding semantic subtlety? In order to do a hard takeoff, the AI needs to be good at a very different class of tasks than those required for understanding humans that well.
Semantic extraction—not hard takeoff—is the task that we want the AI to be able to do. An AI which is good at, say, rewriting its own code, is not the kind of thing we would be interested in at that point, and it seems like it would be inherently more difficult than implementing, say, a neural network. More likely than not, this initial AI would not have the capability for “hard takeoff”: if it runs on expensive specialized hardware, there would be effectively no room for expansion, and the most promising algorithms to construct it (from the field of machine learning) don’t actually give AI any access to its own source code (even if they did, it is far from clear the AI could get any use out of it). It couldn’t copy itself even if it tried.
If a “hard takeoff” AI is made, and if hard takeoffs are even possible, it would be made after that, likely using the first AI as a core.
Would you trust a human not to screw up a goal like “make humans happy” if they were given effective omnipotence? The human would probably do about as well as people in the past have at imagining utopias: really badly.
I wouldn’t trust a human, no. If the AI is controlled by the “wrong” humans, then I guess we’re screwed (though perhaps not all that badly), but that’s not a solvable problem (all humans are the “wrong” ones from someone’s perspective). Still, though, AI won’t really try to act like humans—it would try to satisfy them and minimize surprises, meaning that if would keep track of what humans would like what “utopias”. More likely than not this would constrain it to inactivity: it would not attempt to “make humans happy” because it would know the instruction to be inconsistent. You’d have to tell it what to do precisely (if you had the authority, which is a different question altogether).
That’s not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to.
We want to select Ais that are friendly, and understand us, and this has already started happenning.
My answer: who knows? We’ve given it a deliberately vague goal statement (even more vague than the last one), we’ve given it lots of admittedly contradictory literature, and we’ve given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.
Humans generally manage with those constraints. You seem to be doing something that is kind of the opposite of anthropomorphising—treatiing an entity that is stipulated as having at least human intelligence as if were as literal
and rigid as a non-AI computer.
Yes, but the AI was told, “make humans happy.” Not, “give humans what they actually want.”
And, you assume, it is not intelligent enough to realise that the intended meaning of “make people happy” is “give people what they actually want”—although you and I can see that. You are assuming that it is a subintellgience. You have proven Loosemore’s point.
You say things like “‘Make humans happy’ implies that...” and “subtleties implicit in...” You seem to think these implications are simple, but they really aren’t. They really, really aren’t.
We are smart enough to see that the Dpoamine Drip isn’t intended. The Ai is smarter than us. So....
This is why I say you’re anthropomorphizing.
I say that you are assuming the Ai is dumber than us, when it is stipulated as being smarter.
I think we’re conflating two definitions of “intelligence”. There’s “intelligence” as meaning number of available clock cycles and basic problem-solving skills, which is what MIRI and other proponents of the Dumb Superintelligence discussion set are often describing, and then there’s “intelligence” as meaning knowledge of disparate fields. In humans, there’s a massive amount of overlap here, but humans have growth stages in ways that AGIs won’t. Moreover, someone can be very intelligent in the first sense, and dangerous, while not being very intelligent in the second sense.
You can demonstrate ‘toy’ versions of this problem rather easily. My first attempt at using evolutionary algorithms to make a decent image conversion program improved performance by a third! That’s significantly better than I could have done in a reasonable time frame.
Too bad it did so by completely ignoring a color channel. And even if I added functions to test color correctness, without changing the cost weighing structure, it’d keep not caring about that color channel.
And that’s with a very, very basic sort of self-improving algorithm. It’s smart enough build programs in a language I didn’t really understand at the time, even if it was so stupid it did so by little better than random chance, brute force, and processing power.
The basic problem is that even presuming it takes a lot of both types of intelligence to take over the world, it doesn’t take so much to start overriding one’s own reward channel. Humans already do that as is, and have for quite some time.
The deeper problem is that you can’t really program “make me happy” in the same way that you can’t program “make this image look like I want”. The latter is (many, many, many, many) orders of magnitude easier, but where pixel-by-pixel comparisons aren’t meaningful, we have to use approximations like mean square error, and by definition they can’t be perfect. With “make me happy”, it’s much harder. For all that we humans know when our individual persons are happy, we don’t have a good decimal measure of this : there are as many textbooks out there that think happy is just a sum of chemicals in the brain as will cite Maslow’s Heirarchy of Needs, and very few people can give their current happiness to three decimal places. Building a good way to measure happiness in a way that’s broad enough to be meaningful is hard. Even building a good way to measure the accuracy of your measurement of happiness is not trivial, especially since happiness, unlike some other emotions, isn’t terribly predictive of behavior.
((And the /really/ deep problem is that there are things that Every Human On The Planet Today might say would make them more unhappy, but still be Friendly and very important things to do.))
The deeper problem is that you can’t really program “make me happy” in the same way that you can’t program “make this image look like I want”.
On one hand, Friendly AI people want to convert “make me happy” to a formal specification. Doing that has many potential pitfalls. because it is a formal specification.
On the other hand, Richard, I think, wants to simply tell the AI, in English, “Make me happy.” Given that approach, he makes the reasonable point that any AI smart enough to be dangerous would also be smart enough to interpret that at least as intelligently as a human would.
I think the important question here is, Which approach is better? LW always assumes the first, formal approach.
To be more specific (and Bayesian): Which approach gives a higher expected value? Formal specification is compatible with Eliezer’s ideas for friendly AI as something that will provably avoid disaster. It has some non-epsilon possibility of actually working. But its failure modes are many, and can be literally unimaginably bad. When it fails, it fails catastrophically, like a monotonic logic system with one false belief.
“Tell the AI in English” can fail, but the worst case is closer to a “With Folded Hands” scenario than to paperclips.
I’ve never considered the “Tell the AI what to do in English” approach before, but on first inspection it seems safer to me.
C. direct normativity—program the AI to value what we value.
B. indirect normativity—program the AI to value figuring out what our values are and then valuing those things.
A. indirect indirect normativity—program the AI to value doing whatever we tell it to, and then tell it, in English, “Value figuring out what our values are and then valuing those things.”
I can see why you might consider A superior to C. I’m having a harder time seeing how A could be superior to B. I’m not sure why you say “Doing that has many potential pitfalls. because it is a formal specification.” (Suppose we could make an artificial superintelligence that thinks ‘informally’. What specifically would this improve, safety-wise?)
Regardless, the AI thinks in math. If you tell it to interpret your phonemes, rather than coding your meaning into its brain yourself, that doesn’t mean you’ll get an informal representation. You’ll just get a formal one that’s reconstructed by the AI itself.
It’s not clear to me that programming a seed to understand our commands (and then commanding it to become Friendlier) is easier than just programming it to become Friendlier, but in any case the processes are the same after the first stage. That is, A is the same as B but with a little extra added to the beginning, and it’s not clear to me why that little extra language-use stage is supposed to add any safety. Why wouldn’t it just add one more stage at which something can go wrong?
Regardless, the AI thinks in math. If you tell it to interpret your phonemes, rather than coding your meaning into its brain yourself, that doesn’t mean you’ll get an informal representation. You’ll just get a formal one that’s reconstructed by the AI itself.
It is misleading to say that an interpreted language is formal because the C compiler is formal. Existence proof: Human language. I presume you think the hardware that runs the human mind has a formal specification. That hardware runs the interpreter of human language. You could argue that English therefore is formal, and indeed it is, in exactly the sense that biology is formal because of physics: technically true, but misleading.
This will boil down to a semantic argument about what “formal” means. Now, I don’t think that human minds—or computer programs—are “formal”. A formal process is not Turing complete. Formalization means modeling a process so that you can predict or place bounds on its results without actually simulating it. That’s what we mean by formal in practice. Formal systems are systems in which you can construct proofs. Turing-complete systems are ones where some things cannot be proven. If somebody talks about “formal methods” of programming, they don’t mean programming with a language that has a formal definition. They mean programming in a way that lets you provably verify certain things about the program without running the program. The halting problem implies that for a programming language to allow you to verify even that the program will terminate, your language may no longer be Turing-complete.
Eliezer’s approach to FAI is inherently formal in this sense, because he wants to be able to prove that an AI will or will not do certain things. That means he can’t avail himself of the full computational complexity of whatever language he’s programming in.
But I’m digressing from the more-important distinction, which is one of degree and of connotation. The words “formal system” always go along with computational systems that are extremely brittle, and that usually collapse completely with the introduction of a single mistake, such as a resolution theorem prover that can prove any falsehood if given one false belief. You may be able to argue your way around the semantics of “formal” to say this is not necessarily the case, but as a general principle, when designing a representational or computational system, fault-tolerance and robustness to noise are at odds with the simplicity of design and small number of interactions that make proving things easy and useful.
That all makes sense, but I’m missing the link between the above understanding of ‘formal’ and these four claims, if they’re what you were trying to say before:
(1) Indirect indirect normativity is less formal, in the relevant sense, than indirect normativity. I.e., because we’re incorporating more of human natural language into the AI’s decision-making, the reasoning system will be more tolerant of local errors, uncertainty, and noise.
(2) Programming an AI to value humans’ True Preferences in general (indirect normativity) has many pitfalls that programming an AI to value humans’ instructions’ True Meanings in general (indirect indirect normativity) doesn’t, because the former is more formal.
(3) “‘Tell the AI in English’ can fail, but the worst case is closer to a ‘With Folded Hands’ scenario than to paperclips.”
(4) The “With Folded Hands”-style scenario I have in mind is not as terrible as the paperclips scenario.
Wouldn’t this only be correct if similar hardware ran the software the same way? Human thinking is highly associative and variable, and as language is shared amongst many humans, it means that it doesn’t, as such, have a fixed formal representation.
You are a rational and reasonable person. Why not speak up about what is happening here? Rob is making a spirited defense of his essay, over on his blog, and I have just posted a detailed critique that really nails down the core of the argument that is supposed to be happening here.
And yet, if you look closely you will find that all of my comments—be they as neutral, as sensible or as rational as they can be—are receiving negative votes so fast that they are disappearing to the bottom of the stack or being suppressed completely.
What a bizarre situation!! This article that RobbBB submitted to LessWrong is supposed to be ABOUT my own article on the IEET website. My article is the actual TOPIC here! And yet I, the author of that article, have been insulted here by Eliezer Yudkowsky, and my comments suppressed. Amazing, don’t you think?
Richard: On LessWrong, comments are sorted by how many thumbs up and thumbs down they get, because it makes it easier to find the most popular posts quickly. If a post gets −4 points or lower, it gets compressed to make room for more popular posts, and to discourage flame wars. (You can still un-compress it by just clicking the + in the upper right corner of the comment.) At the moment, some of Eliezer’s comments and yours have both been down-voted and compressed in this way, presumably because people on the site thought the personal attacks weren’t useful for the conversation as a whole.
People are probably also down-voting your comments because they’re histrionic and don’t reflect an understanding of this forum’s mechanics. I recommend only making points about the substance of people’s arguments; if you have personal complaints, take it to a private channel so it doesn’t add to the noise surrounding the arguments themselves.
Relatedly, Phil: You above described yourself and Richard Loosemore as “the two people (Eliezer) should listen to most”. Loosemore and I are having a discussion here. Does the content of that discussion affect your view of Richard’s level of insight into the problem of Friendly Artificial Intelligence?
Which approach gives a higher expected value? Formal specification is compatible with Eliezer’s ideas for friendly AI as something that will provably avoid disaster. It has some non-epsilon possibility of actually working. But its failure modes are many, and can be literally unimaginably bad. When it fails, it fails catastrophically, like a monotonic logic system with one false belief.
“Tell the AI in English” can fail, but the worst case is closer to a “With Folded Hands” scenario than to paperclips.
I don’t think that’s how the analysis goes. Eliezer says that AI must be very carefully and specifically made friendly or it will be disasterous, but that disaster is not a part of being only nearly careful or specifically made enough : he believes an AGI told merely to maximize human pleasure is very dangerous (and probably even more dangerous) than an AGI with a merely 80% Friendly-Complete specification.
Mr. Loosemore seems to hold the opposite opinion, that an AGI will not take instructions to unlikely results, unless it was exceptionally unintelligent and thus not very powerful. I don’t believe his position says that a near-Friendly-Complete specification is very risky—after all, a “smart” AGI would know what you really meant—but that such a specification would be superfluous.
Whether Mr. Loosemore is correct isn’t cause by whether we believe he is correct, just as whether Mr. Eliezer is not wrong just because we choose a different theory. The risks have to be measured in terms of their likelihood from available facts.
The problem is that I don’t see much evidence that Mr. Loosemore is correct. I can quite easily conceive of a superhuman intelligence that was built with the specification of “human pleasure = brain dopamine levels”, not least of all because there are people who’d want to be wireheads and there’s a massive amount of physiological research showing human pleasure to be caused by dopamine levels. I can quite easily conceive of a superhuman intelligence that knows humans prefer more complicated enjoyment, and even do complex modeling of how it would have to manipulate people away from those more complicated enjoyments, and still have that superhuman intelligence not care.
The problem is that I don’t see much evidence that Mr. Loosemore is correct. I can quite easily conceive of a superhuman intelligence that was built with the specification of “human pleasure = brain dopamine levels”, not least of all because there are people who’d want to be wireheads and there’s a massive amount of physiological research showing human pleasure to be caused by dopamine levels.
I don’t think Loosemore was addressing deliberately unfriendly AI, and for that matter EY hasn’t been either.
Both are addressing intentionally friendly or neutral AI that goes wrong.
I can quite easily conceive of a superhuman intelligence that knows humans prefer more complicated enjoyment, and even do complex modeling of how it would have to manipulate people away from those more complicated enjoyments, and still have that superhuman intelligence not care.
I think it’s a question of what you program in, and what you let it figure out for itself. If you want to prove formally that it will behave in certain ways, you would like to program in explicitly, formally, what its goals mean. But I think that “human pleasure” is such a complicated idea that trying to program it in formally is asking for disaster. That’s one of the things that you should definitely let the AI figure out for itself. Richard is saying that an AI as smart as a smart person would never conclude that human pleasure equals brain dopamine levels.
Eliezer is aware of this problem, but hopes to avoid disaster by being especially smart and careful. That approach has what I think is a bad expected value of outcome.
I think that “human pleasure” is such a complicated idea that trying to program it in formally is asking for disaster. That’s one of the things that you should definitely let the AI figure out for itself.
[...]
Eliezer is aware of this problem, but hopes to avoid disaster by being especially smart and careful. That approach has what I think is a bad expected value of outcome.
“Tell the AI in English” is in essence an utility function “Maximize the value of X, where X is my current opinion of what some english text Y means”.
The ‘understanding English’ module, the mapping function between X and “what you told in English” is completely arbitrary, but is very important to the AI—so any self-modifying AI will want to modify and improve that. Also, we don’t have a good “understanding English” module so yes, we also want the AI to be able to modify and improve that. But, it can be wildly different from reality or opinions of humans—there are trivial ways of how well-meaning dialogue systems can misunderstand statements.
However, for the AI “improve the module” means “change the module so that my utility grows”—so in your example it has strong motivation to intentionally misunderstand English. The best case scenario is to misunderstand “Make everyone happy” as “Set your utility function to MAXINT”. The worst case scenario is, well, everything else.
There’s the classic quote “It is difficult to get a man to understand something, when his salary depends upon his not understanding it!”—if the AI doesn’t care in the first place, then “Tell AI what to do in English” won’t make it care.
By this reasoning, an AI asked to do anything at all would respond by immediately modifying itself to set its utility function to MAXINT. You don’t need to speak to it in English for that—if you asked the AI to maximize paperclips, that is the equivalent of “Maximize the value of X, where X is my current opinion of how many paperclips there are”, and it would modify its paperclip-counting module to always return MAXINT.
You are correct that telling the AI to do Y is equivalent to “maximize the value of X, where X is my current opinion about Y”. However, “current” really means “current”, not “new”. If the AI is actually trying to obey the command to do Y, it won’t change its utility function unless having a new utility function will increase its utility according to its current utility function. Neither misunderstanding nor understanding will raise its utility unless its current utility function values having a utility function that misunderstands or understands.
By this reasoning, an AI asked to do anything at all would respond by immediately modifying itself to set its utility function to MAXINT.
That’s allegedly more or less what happened to Eurisko (here, section 2), although it didn’t trick itself quite that cleanly. The problem was only solved by algorithmically walling off its utility function from self-modification: an option that wouldn’t work for sufficiently strong AI, and one to avoid if you want to eventually allow your AI the capacity for a more precise notion of utility than you can give it.
Paperclipping as the term’s used here assumes value stability.
A human is a counterexample. A human emulation would count as an AI, so human behavior is one possible AI behavior. Richard’s argument is that humans don’t respond to orders or requests in anything like these brittle, GOFAI-type systems invoked by the word “formal systems”. You’re not considering that possibility. You’re still thinking in terms of formal systems.
(Unpacking the significant differences between how humans operate, and the default assumptions that the LW community makes about AI, would take… well, five years, maybe ten.)
A human emulation would count as an AI, so human behavior is one possible AI behavior.
Uhh, no. Look, humans respond to orders and requests in the way that we do because we tend to care what the person giving the request actually wants. Not because we’re some kind of “informal system”. Any computer program is a formal system, but there are simply more and less complex ones. All you are suggesting is building a very complex (“informal”) system and hoping that because it’s complex (like humans!) it will behave in a humanish way.
Your response avoids the basic logic here. A human emulation would count as an AI, therefore human behavior isone possible AI behavior. There is nothing controversial in the statement; the conclusion is drawn from the premise. If you don’t think a human emulation would count as AI, or isn’t possible, or something else, fine, but… why wouldn’t a human emulation count as an AI? How, for example, can we even think about advanced intelligence, much less attempt to model it, without considering human intelligence?
...humans respond to orders and requests in the way that we do because we tend to care what the person giving the request actually wants.
I don’t think this is generally an accurate (or complex) description of human behavior, but it does sound to me like an “informal system”—i.e. we tend to care. My reading of (at least this part of) PhilGoetz’s position is that it makes more sense to imagine something we would call an advanced or super AI responding to requests and commands with a certain nuance of understanding (as humans do) than with the inflexible (“brittle”) formality of, say, your average BASIC program.
The thing is, humans do that by… well, not being formal systems. Which pretty much requires you to keep a good fraction of the foibles and flaws of a nonformal, nonrigorously rational system.
You’d be more likely to get FAI, but FAI itself would be devalued, since now it’s possible for the FAI itself to make rationality errors.
Phil, Unfortunately you are commenting without (seemingly) checking the original article of mine that RobbBB is discussing here. So, you say “On the other hand, Richard, I think, wants to simply tell the AI, in English, “Make me happy.” ”. In fact, I am not at all saying that. :-)
My article was discussing someone else’s claims about AI, and dissecting their claims. So I was not making any assertions of my own about the motivation system.
Aside: You will also note that I was having a productive conversation with RobbBB about his piece, when Yudkowsky decided to intervene with some gratuitous personal slander directed at me (see above). That discussion is now at an end.
I’m afraid reading all that and giving a full response to either you or RobbBB isn’t possible in the time I have available this weekend.
I agree that Eliezer is acting like a spoiled child, but calling people on their irrational interpersonal behavior within less wrong doesn’t work. Calling them on mistakes they make about mathematics is fine, but calling them on how they treat others on less wrong will attract more reflexive down-votes from people who think you’re contaminating their forum with emotion, than upvotes from people who care.
Eliezer may be acting rationally. His ultimate purpose in building this site is to build support for his AI project. The only people on LessWrong, AFAIK, with decades of experience building AI systems, mapping beliefs and goals into formal statements, and then turning them on and seeing what happens, are you, me, and Ben Goertzel. Ben doesn’t care enough about Eliezer’s thoughts in particular to engage with them deeply; he wants to talk about generic futurist predictions such as near-term and far-term timelines. These discussions don’t deal in the complex, linguistic, representational, even philosophical problems at the core of Eliezer’s plan (though Ben is capable of dealing with them, they just don’t come up in discussions of AI fooms etc.), so even when he disagrees with Eliezer, Eliezer can quickly grasp his point. He is not a threat or a puzzle.
Whereas your comments are… very long, hard to follow, and often full of colorful or emotional statements that people here take as evidence of irrationality. You’re expecting people to work harder at understanding them than they’re going to. If you haven’t noticed, reputation counts for nothing here. For all their talk of Bayesianism, nobody is going to check your bio and say, “Hmm, he’s a professor of mathematics with 20 publications in artificial intellgence; maybe I should take his opinion as seriously as that of the high-school dropout who has no experience building AI systems.” And Eliezer has carefully indoctrinated himself against considering any such evidence.
So if you consider that the people most likely to find the flaws in Eliezer’s more-specific FAI & CEV plans are you and me, and that Eliezer has been public about calling both of us irrational people not worth talking with, this is consistent either with the hypothesis that his purpose is to discredit people who pose threats to his program, or with the hypothesis that his ego is too large to respond with anything other than dismissal to critiques that he can’t understand immediately or that trigger his “crackpot” patter-matcher, but not with the hypothesis that arguing with him will change his mind.
(I find the continual readiness of people to assume that Eliezer always speaks the truth odd, when he’s gone more out of his way than anyone I know, in both his blog posts and his fanfiction, to show that honest argumentation is not generally a winning strategy. He used to append a signature to his email along those lines, something about warning people not to assume that the obvious interpretation of what he said was the truth.)
RobbBB seems diplomatic, and I don’t think you should quit talking with him because Eliezer made you angry. That’s what Eliezer wants.
For all their talk of Bayesianism, nobody is going to check your bio and say, “Hmm, he’s a professor of mathematics with 20 publications in artificial intellgence; maybe I should take his opinion as seriously as that of the high-school dropout who has no experience building AI systems.”
Actually, that was the first thing I did, not sure about other people. What I saw was:
Teaches at what appears to be a small private liberal arts college, not a major school.
Out of 20 or so publications listed on http://www.richardloosemore.com/papers, a bunch are unrelated to AI, others are posters and interviews, or even “unpublished”, which are all low-confidence media.
Several contributions are entries in conference proceedings (are they peer-reviewed? I don’t know) .
A number are listed as “to appear”, and so impossible to evaluate.
A few are apparently about dyslexia, which is an interesting topic, but not obviously related to AI.
One relevant paper was in H+ magazine, a place I have never heard of before and apparently not a part of any well-known scientific publishing outlet, like Springer.
I could not find any external references to RL’s work except through links to Ben Goertzel (IEET was one exception).
As a result, I was unable to independently evaluate RL’s expertise level, but clearly he is not at the top of the AI field, unlike say, Ben Goertzel. Given his poorly written posts and childish behavior here, indicative of an over-inflated ego, I have decided that whatever he writes can be safely ignored. I did not think of him as a crackpot, more like a noise maker.
Admittedly, I am not sold on Eliezer’s ideas, either, since many other AI experts are skeptical of them, and that’s the only thing I can go by, not being an expert in the field myself. But at least Eliezer has done several impossible things in the last decade or so, which commands a lot of respect, while Richard appears to be drifting along.
As a result, I was unable to independently evaluate RL’s expertise level, but clearly he is not at the top of the AI field, unlike say, Ben Goertzel.
At least a few of the RL authored papers are WITH Ben Goertzel, so some of Goertzel’s status should rub-off, as I would trust Goertzel to effectively evaluate collaborators.
At least a few of the RL authored papers are WITH Ben Goertzel, so some of Goertzel’s status should rub-off, as I would trust Goertzel to effectively evaluate collaborators.
Is there some assumption here that association with Ben Goertzel should be considered evidence in favour of an individual’s credibility on AI? That seems backwards.
Goertzel is also known for approving of people who are uncontroversially cranks. See here. It’s also known, via his cooperation with MIRI, that a collaboration with him in no way implies his endorsement of another’s viewpoints.
Could you point the interested reader to your critique of his work?
Comments can likely be found on this site from years ago. I don’t recall anything particularly in depth or memorable. It’s probably better to just look at things that Ben Goertzel says and making one’s own judgement. The thinking he expresses is not of the kind that impresses me but other’s mileage may vary.
I don’t begrudge anyone their right to their beauty contests but I do observe that whatever it is that is measured by identifying the degree of affiliation with Ben Goertzel is something wildly out of sync with the kind of thing I would consider evidence of credibility.
If only so I can cite them to Eliezer-is-a-crank people.
I advise against doing that. It is unlikely to change anyone’s mind.
By impossible feats I mean that a regular person would not be able to reproduce them, except by chance, like winning a lottery, starting Google, founding a successful religion or becoming a President.
He started as a high-school dropout without any formal education and look what he achieved so far, professionally and personally. Look at the organizations he founded and inspired. Look at the high-status experts in various fields (business, comp sci, programming, philosophy, math and physics) who take him seriously (some even give him loads of money). Heck, how many people manage to have multiple simultaneous long-term partners who are all highly intelligent and apparently get along well?
Basically this. As Eliezer himself points out, humans aren’t terribly rational on average and our judgements of each others’ rationality isn’t great either. Large amounts of support implies charisma, not intelligence.
TDT is closer to what I’m looking for, though it’s a … tad long.
I advise against doing that. It is unlikely to change anyone’s mind.
Point, but there’s also the middle ground “I’m not sure if he’s a crank or not, but I’m busy so I won’t look unless there’s some evidence he’s not.”
The big two I’ve come up with is a) he actually changes his mind about important things (though I need to find an actual post I can cite—didn’t he reopen the question of the possibility of a hard takeoff, or something?) and b) TDT.
Sure, but that’s hard to prove: given “Eliezer is a crank,” the probability of “Eliezer is lying about his AI-box prowess” is much higher than “Eliezer actually pulled that off.”
The latest success by a non-Eliezer person helps, but I’d still like something I can literally cite.
I don’t see why anyone would think that. Plenty of people in the anti-vaccination crowd managed to convince parents to mortally endanger their children.
Yes, but that’s really not that hard. For starters, you can do a better job of picking your targets.
The AI-box experiment often is run with intelligent, rational people with money on the line and an obvious right answer; it’s a whole lot more impossible than picking the right uneducated family to sell your snake oil to.
Ohh, come on. Cyclical reasoning here. You think Yudkowsky is not a crank, so you think the folks that play that silly game with him are intelligent and rational (by the way a plenty of people who get duped by anti-vaxxers are of above average IQ), and so you get more evidence that Yudkowsky is not a crank. Cyclical reasoning doesn’t persuade anyone who isn’t already a believer.
You need non-cyclical reasoning. Which would generally be something where you aren’t the one having to explain people that the achievement in question is profound.
You need non-cyclical reasoning. Which would generally be something where you aren’t the one having to explain people that the achievement in question is profound.
This bit confuses me.
That aside:
You think Yudkowsky is not a crank, so you think the folks that play that silly game with him are intelligent and rational
Non sequitur. From the posts they make, everyone on this site seems to me to be sufficiently intelligent as to make “selling snake oil” impossible, in a cut-and-dry case like the AI box. Yudowsky’s own credibility doesn’t enter into it.
From the posts they make, everyone on this site seems to me to be sufficiently intelligent as to make “selling snake oil” impossible, in a cut-and-dry case like the AI box.
So what do you think even happened, anyway, if you think the obvious explanation is impossible?
Originally, you were hypothesising that the problem with persuading the others would be the possibility that Yudkowsky lied about AI box powers. I pointed out the possibility that this experiment is far less profound than you think it is. (Albeit frankly I do not know why you think it is so profound).
Ah, sorry. This brand of impossible.
What ever is the brand, any “impossibilities” that happen should lower your confidence in the reasoning that deemed them “impossibilities” in the first place. I don’t think IQ is so strongly protective against deception, for example, and I do not think that you can assess something based on how the postings look to you with sufficient reliability as to overcome Gaussian priors very far from the mean.
edit: example. I would deem it quite unlikely that Yudkowsky could, for example, score highly on a programming contest with competent participants or in any other conventional, validated, reliable metric of technical expertise and ability, under good contest rules (i.e. excluding the possibility of externals assistance). So if he did something like that, I’d be quite surprised, and lower the confidence in what ever models deemed that impossible; good old Bayes. I’m far more confident in the validity of those conventional metrics (and in lack of alternate modes of passing, such as persuasion) than in my assessment so my assessment would change the most. Meanwhile, when it’s some unconventional game, well, even if I thought that this game is difficult, I’d be much less confident in the reasoning “it looks hard so it must be hard” than the low prior of exceptional performance is low.
What ever is the brand, any “impossibilities” that happen should lower your confidence in the reasoning that deemed them “impossibilities” in the first place. I don’t think IQ is so strongly protective against deception, for example, and I do not think that you can assess something based on how the postings look to you with sufficient reliability as to overcome Gaussian priors very far from the mean.
Further, in this case the whole purpose of the experiment was to demonstrate that an AI could “take over a gatekeeper’s mind through a text channel” (something previously deemed “impossible”). As far as that goes it was, in my view, successful.
It’s clearly possible for some values of “gatekeeper”, since some people fall for 419 scams. The test is a bit meaningless without information about the gatekeepers
Originally, you were hypothesising that the problem with persuading the others would be the possibility that Yudkowsky lied about AI box powers. I pointed out the possibility that this experiment is far less profound than you think it is. (Albeit frankly I do not know why you think it is so profound).
Still have no idea what you’re talking about. What I originally said was: “the people who talk to Yudkowsky are intelligent” does not follow from “Yudkowsky is not a crank”; I independently judge those people to be intelligent.
What ever is the brand, any “impossibilities” that happen should lower your confidence in the reasoning that deemed them “impossibilities” in the first place.
“Impossible,” here, is used in the sense that “I have no idea where to start thinking about where to start thinking about how to do this.” It is clearly not actually impossible because it’s been done, twice.
I thought your “impossible” at least implied “improbable” under some sort of model.
edit: and as of having no idea, you just need to know the shared religious-ish context. Which these folks generally keep hidden from a causal observer.
Impossible is being used as a statement of difficulty. Someone who has “done the impossible” has obviously not actually done something impossible, merely done something that I have no idea where to start trying.
Seeing that “it is possible to do” doesn’t seem like it would have much effect on my assessment of how difficult it is, after the first. It certainly doesn’t have match effect on “It is very-very-difficult-impossible for linkhyrule5 to do such a thing.”
and as of having no idea, you just need to know the shared religious-ish context. Which these folks generally keep hidden from a causal observer.
What?
First, I’m pretty sure you mean “casual.” Second, I’m hardly a casual observer, though I haven’t read everything either. Third, most religions don’t let their leading figures (or much of anyone, really) change their minds on important things...
Some folks on this site have accidentally bought unintentional snake oil in The Big Hoo Hah That Shall not Be Mentioned. Only an intelligent person could have bought that particular puppy,
My point is, there is a certain level of general competence after which I would expect convincing someone with an OOC motive to let an IC AI out to be “impossible,” as defined below.
Results. Undervaccinated children tended to be black, to have a younger mother who was not married and did not have a college degree, to live in a household near the poverty level, and to live in a central city. Unvaccinated children tended to be white, to have a mother who was married and had a college degree, to live in a household with an annual income exceeding $75 000, and to have parents who expressed concerns regarding the safety of vaccines and indicated that medical doctors have little influence over vaccination decisions for their children.
And in any case the point is that any correlation between IQ and not being prone to getting duped like this is not perfect enough to deem anything particularly unlikely.
Hmm. Yeah, that’s hardly conclusive, but I think I was actually failing to update there. Now that you mention it, I seem to recall that both conspiracy theorists and cult victims skew toward higher IQ. I was clearly quite overconfident there.
And in any case the point is that any correlation between IQ and not being prone to getting duped like this is not perfect enough to deem anything particularly unlikely.
Wasn’t the point that
intelligent, rational people with money on the line and an obvious right answer
wasn’t enough, actually? That seems like a much stronger claim than “it’s really hard to fool high-IQ people”.
I imagine that says more about the demographics of the general New Age belief cluster than it does about any special IQ-based appeal of vaccination skepticism.
There probably are some scams or virulent memes that prey on insecurities strongly correlated with high IQ, though. I can’t think of anything specific offhand, but the fringes of geek culture are probably one of the better places to start looking.
Well, the way I see it, outside of very high IQ in combination with education that is multiple topics of biochemistry, effects of intelligence are small and are easily dwarfed by things like those demographical correlations.
There probably are some scams or virulent memes that prey on insecurities specific to high-IQ people, though. I can’t think of anything specific offhand
Free energy scams. Hydrinos, cold fusion, magnetic generators, perpetual motion, you name it. edit: or in the medicine, counter intuitive stuff like sitting in an old uranium mine inhaling radon, then having so much radon progeny plate-out it sets nuclear material smuggling alarms off. Naturalistic fallacy stuff in general.
That is more persuasive to high IQ people, but, I think, only insofar as intelligence allows one to gain better rationality skills. And if we’re including that, there are plenty of other, facetious examples that come into play.
Also: ha ha. How hilarious. I would love to see why you class cryonics as a scam, but sadly I’m fairly certain it would be one of the standard mistakes.
I was in a rush last night, shminux, so I didn’t have time for a couple of other quick clarifications:
First, you say “One relevant paper was in H+ magazine, a place I have never heard of before and apparently not a part of any well-known scientific publishing outlet, like Springer.”
Well, H+ magazine is one of the foremost online magazines (perhaps THE foremost online magazine) of the transhumanist community.
And, you mention Springer. You did not notice that one of my papers was in the recently published Springer book “Singularity Hypotheses”.
Second, you say “A few [of my papers] are apparently about dyslexia, which is an interesting topic, but not obviously related to AI.”
Actually they were about dysgraphia, not dyslexia … but more importantly, those papers were about computational models of language processing. In particular they were very, VERY simple versions of the computational model of human language that is one of my special areas of expertise. And since that model is primarily about learning mechanisms (the language domain is only a testbed for a research programme whose main focus is learning), those papers you saw were actually indicative that back in the early 1990s I was already working on the construction of the core aspects of an AI system.
So, saying “dyslexia” gives a very misleading impression of what that was all about. :-)
You are quite selective in your catalog of my achievements....
One item was a chapter in a book entitled “Theoretical Foundations of Artificial General Intelligence”. Sure, it was about the consciousness question, but still.
You make a casual disparaging remark about the college where I currently work … but forget to mention that I graduated from an institution that is ranked in the top 3 or 4 in the world (University College London).
You neglect to mention that I have academic qualifications in multiple fields—both physics and artificial intelligence/cognitive psychology. I now teach in both of those fields.
And in addition to all of the above, you did not notice that I am (in addition to my teaching duties) an AI developer who works on his projects WITHOUT intending to publish that work all the time! My AI work is largely proprietary. What you see from the outside are the occasional spinoffs and side projects that get turned into published writings. Not to be too coy, but isn’t that something you would expect from someone who is actually walking the walk....? :-)
There are a number of comments from other people below about Ben Goertzel, some of them a little strange. I wrote a paper a couple of years ago that Ben suggested we get together to and publish… that is now a chapter in the book “Singularity Hypotheses”.
So clearly Ben Goertzel (who has a large, well-funded AGI lab) is not of the opinion that I am a crank. Could I get one point for that?
Phil Goetz, who is an experienced veteran of the AGI field, has on this thread made a comment to the effect that he thinks that Ben Goertzel, himself, and myself are the three people Eliezer should be seriously listening to (since the three of us are among the few people who have been working on this problem for many years, and who have active AGI projects). So perhaps that is two points? Maybe?
And, just out of curiosity, I would invite you to check in with the guy who invented AIXI—Marcus Hutter. He and I met and had a very long discussion at the 2009 AGI conference. Marcus and I disagree substantially about the theoretical foundations of AI, but in spite of that disagreement I would urge you to ask him if he considers me to be down at the crank level. I might be wrong, but I do not think he would be willing to give me a bad reference. Let me know how that goes, yes?
You also finished off with what I can only describe as one of the most bizarre comparisons I have ever seen. :-) You say “Eliezer has done several impossible things in the last decade or so”. Hmmmm....! :-) And yet … “Richard appears to be drifting along” Well, okay, if you say so …. :-)
I have no horse in this race, and I am not an ardent EY supporter, or even count myself as a “rationalist”. In the area where I consider myself reasonably well trained, physics, he and I clashed a number of times on this forum. However, I am not an expert in the AI field, so I can only go by the outward signs of expertise. Ben Goertzel has them, Marcus Hutter has them, Eliezer has them. Richard Loosemore—not so much. For all I know, you might be the genius who invents the AGI and sets it loose someday, but it’s not obvious by looking online. And your histrionic comments and oversized ego make it appear rather unlikely.
I didn’t quit with Rob, btw. Ihave had a fairly productive—albeit exhausting—discussion with Rob over on his blog. I consider it to be productive because I have managed to narrow in on what he thinks is the central issue. And I think I have now (today’s comment, which is probably the last of the discussion) managed to nail down my own argument in a way that withstands all the attacks against it.
You are right that I have some serious debating weaknesses. I write too dense, and I assume that people have my width and breadth of experience, which is unfair (I got lucky in my career choices).
Oh, and don’t get me wrong: Eliezer never made me angry in this little episode. I laughed myself silly. Yeah, I protested. But I was wiping back tears of laughter while I did. “Known Permanent Idiot” is just a wondeful turn of phrase. Thanks, Eliezer!
Anyway, I went and read the the majority of that discussion (well, the parts between Richard and Rob). Here’s my summary:
Richard:
I think that what is happening in this discussion [...] is a misunderstanding. [...]
[Rob responds]
Richard:
You completely miss the point that I was trying to make. [...]
[Rob responds]
Richard:
You are talking around the issue I raised. [...] There is a gigantic elephant in the middle of this room, but your back is turned to it. [...]
[Rob responds]
Richard:
[...] But each time I explain my real complaint, you ignore it and respond as if I did not say anything about that issue. Can you address my particular complaint, and not that other distraction?
[Rob responds]
Richard:
[...] So far, nobody (neither Rob nor anyone else at LW or elsewhere) will actually answer that question. [...]
[Rob responds]
Richard:
Once again, I am staggered and astonished by the resilience with which you avoid talking about the core issue, and instead return to the red herring that I keep trying to steer you away from. [...]
Rob:
Alright. You say I’ve been dancing around your “core” point. I think I’ve addressed your concerns quite directly, [...] To prevent yet another suggestion that I haven’t addressed the “core”, I’ll respond to everything you wrote above. [...]
Richard:
Rob, it happened again. [...]
I snipped a lot of things there. I found lots of other points I wanted to emphasize, and plenty of things I wanted to argue against. But those aren’t the point.
Richard, this next part is directed at you.
You know what I didn’t find?
I didn’t find any posts where you made a particular effort to address the core of Rob’s argument. It was always about your argument. Rob was always the one missing the point.
Sure, it took Rob long enough to focus on finding the core of your position, but he got there eventually. And what happened next? You declared that he was still missing the point, posted a condensed version of the same argument, and posted here that your position “withstands all the attacks against it.”
You didn’t even wait for him to respond. You certainly didn’t quote him and respond to the things he said. You gave no obvious indication that you were taking his arguments seriously.
As far as I’m concerned, this is a cardinal sin.
I think I am explaining the point with such long explanations that I am causing you to miss the point.
How about this alternate hypothesis? Your explanations are fine.
Rob understands what you’re saying.
He just doesn’t agree.
Perhaps you need to take a break from repeating yourself and make sure you understand Rob’s argument.
(P.S. Eliezer’s ad hominem is still wrong. You may be making a mistake, but I’m confident you can fix it, the tone of this post notwithstanding.)
This entire debate is supposed to about my argument, as presented in the original article I published on the IEET.org website (“The Fallacy of Dumb Superintelligence”).
But in that case, what should I do when Rob insists on talking about something that I did not say in that article?
My strategy was to explain his mistake, but not engage in a debate about his red herring. Sensible people of all stripes would consider that a mature response.
But over and over again Rob avoided the actual argument and insisted on talking about his red herring.
And then FINALLY I realized that I could write down my original claim in such a way that it is IMPOSSIBLE for Rob to misinterpret it.
(That was easy, in retrospect: all I had to do was remove the language that he was using as the jumping-off point for his red herring).
That final, succinct statement of my argument is sitting there at the end of his blog ….. so far ignored by you, and by him. Perhaps he will be able to respond, I don’t know, but you say you have read it, so you have had a chance to actually understand why it is that he has been talking about something of no relevance to my original argument.
But you, in your wisdom, chose to (a) completely ignore that statement of my argument, and (b) give me a patronizing rebuke for not being able to understand Rob’s red herring argument.
Further to RobbBB’s highly useful reply (you should read it), I would like to point out that the phenomenon of optimization processes having unexpected or unusual behaviour when placed in new situations or given new capabilities is quite general. It applies to pretty much any strong optimization process working toward any goal.
For example, you program an AI/robot with no resources to fetch a banana as quickly as possible in some sandbox environment, where the banana is guarded by a few people. The AI will describe its strategy as talking to the guards and convincing them to give it the banana. Or perhaps buying the banana off them if you give it some money. But there are many ways to change the situation:
Make the AI much smarter—it will realise that by taking a certain sequence of steps it can get from here to the banana without ever moving within someone’s field of view. No need to pay for it. (1.5—Make the AI even smarter, and it will realise that by painting a Basilisk on the robot’s front it can disable all the guards effortlessly and take a straight line to the banana.)
Give the AI a teleportation device—it teleports straight to the banana.
Give the AI a plasma cannon—it shoots all the guards dead and runs up and takes the banana.
The point is… it’s not at all weird for AI behaviour to be “inconsistent”. It isn’t a sign of anything being broken, in fact the goal is being achieved. The AI is just able to think of more effective ways to do it then you are. That is, after all, the point of superintelligence. And an AI that does this is not broken or stupid, and is certainly capable of being dangerous.
By the way, you can try to do something like this:
[ And by the way: one important feature that is OBVIOUSLY going to be in the goalX code is this: that the outcome of any actions that the goalX code prescribes, should always be checked to see if they are as consistent as possible with the verbal description of the class of results X, and if any inconsistency occurs the goalX code should be deemed defective, and be shut down for adjustment.]
But, to start with I have no idea how you would program this or what it means formally, but even if you could, it takes human judgement to identify “inconsistencies” that would matter to humans. Without embedding human values in there you’ll have the AI shut down every time it tries to do anything new, or use a stronger criterion of “inconsistency” and miss a few cases where the AI does something you actually don’t want.
Or, you know, the AI will deduce that the full “verbal description of the class of results X” (which is an infinite list) is of course defined by its goal (ie. the goalX code) and therefore reason that nothing the goalX code can do will be inconsistent with it.
I didn’t mean to ignore your argument; I just didn’t get around to it. As I said, there were a lot of things I wanted to respond to. (In fact, this post was going to be longer, but I decided to focus on your primary argument.)
Your story:
This hypothetical AI will say “I have a goal, and my goal is to get a certain class of results, X, in the real world.” [...] And we say “Hey, no problem: looks like your goal code is totally consistent with that verbal description of the desired class of results.” Everything is swell up to this point.
My version:
The AI is lying. Or possibly it isn’t very smart yet, so it’s bad at describing its goal. Or it’s oversimplifying, because the programmers told it to, because otherwise the goal description would take days. And the goal code itself is too complicated for the programmers to fully understand. In any case, everything is not swell.
Your story:
Then one day the AI says “Okay now, today my goalX code says I should do this…” and it describes an action that is VIOLENTLY inconsistent with the previously described class of results, X. This action violates every one of the features of the class that were previously given.
My version:
The AI’s goal was never really X. It was actually Z. The AI’s actions perfectly coincide with Z.
In the rest of the scenario you described, I agree that the AI’s behavior is pretty incoherent, if its goal is X. But if it’s really aiming for Z, then its behavior is perfectly, terrifyingly coherent.
And your “obvious” fail-safe isn’t going to help. The AI is smarter than us. If it wants Z, and a fail-safe prevents it from getting Z, it will find a way around that fail-safe.
I know, your premise is that X really is the AI’s true goal. But that’s my sticking point.
Making it actually have the goal X, before it starts self-modifying, is far from easy. You can’t just skip over that step and assume it as your premise.
What you say makes sense …. except that you and I are both bound by the terms of a scenario that someone else has set here.
So, the terms (as I say, this is not my doing!) of reference are that an AI might sincerely believe that it is pursuing its original goal of making humans happy (whatever that means …. the ambiguity is in the original), but in the course of sincerely and genuinely pursuing that goal, it might get into a state where it believes that the best way to achieve the goal is to do something that we humans would consider to be NOT achieving the goal.
What you did was consider some other possibilities, such as those in which the AI is actually not being sincere. Nothing wrong with considering those, but that would be a story for another day.
Oh, and one other thing that arises from your above remark: remember that what you have called the “fail-safe” is not actually a fail-safe, it is an integral part of the original goal code (X). So there is no question of this being a situation where ”… it wants Z, and a fail-safe prevents it from getting Z, [so] it will find a way around that fail-safe.” In fact, the check is just part of X, so it WANTS to check as much as wants anything else involved in the goal.
I am not sure that self-modification is part of the original terms of reference here, either. When Muehlhauser (for example) went on a radio show and explained to the audience that a superintelligence might be programmed to make humans happy, but then SINCERELY think it was making us happy when it put us on a Dopamine Drip, I think he was clearly not talking about a free-wheeling AI that can modify its goal code. Surely, if he wanted to imply that, the whole scenario goes out the window. The AI could have any motivation whatsoever.
You and I are both bound by the terms of a scenario that someone else has set here.
Ok, if you want to pass the buck, I won’t stop you. But this other person’s scenario still has a faulty premise. I’ll take it up with them if you like; just point out where they state that the goal code starts out working correctly.
To summarize my complaint, it’s not very useful to discuss an AI with a “sincere” goal of X, because the difficulty comes from giving the AI that goal in the first place.
What you did was consider some other possibilities, such as those in which the AI is actually not being sincere. Nothing wrong with considering those, but that would be a story for another day.
As I see it, your (adopted) scenario is far less likely than other scenario(s), so in a sense that one is the “story for another day.” Specifically, a day when we’ve solved the “sincere goal” issue.
That all depends on the approach… if you have some big human-inspired but more brainy neural network that learns to be a person, it can well just do the right thing by itself, and the risks are in any case quite comparable to that with having a human do it.
If you are thinking of a “neat AI” with utility functions over world models and such, parts of said AI can maximize abstract metrics over mathematical models (including self improvement) without any “generally intelligent” process of eating you. So you would want to use those to build models of human meaning and intent.
Furthermore with regards to AI following some goals, it seems to me that goal specifications would have to be intelligently processed in the first place so that they could be actually applied to the real world—we can’t even define paperclips otherwise.
The most coherent reply I got was that an AI doesn’t follow verbal instructions and we can’t just order the AI to “make humans happy”, or even “make humans happy, in the way that I mean”. You can only tell the AI to make humans happy by writing a program that makes it do so. It doesn’t matter if the AI grasps what you really want it to do, if there is a mismatch between the program and what you really want it to do, it follows the program.
Obviously I don’t buy this. For one thing, you can always program it to obey verbal instructions, or you can talk to it and ask it how it will make people happy.
Jiro: Did you read my post? I discuss whether getting an AI to ‘obey verbal instructions’ is a trivial task in the first named section. I also link to section 2 of Yudkowsky’s reply to Holden, which addresses the question of whether ‘talk to it and ask it how it will make people happy’ is generally a safe way to interact with an Unfriendly Oracle.
I also specifically quote an argument you made in section 2 that I think reflects a common mistake in this whole family of misunderstandings of the problem — the conflation of the seed AI with the artificial superintelligence it produces. Do you agree this distinction helps clarify why the problem is one of coding the right values, and not of coding the right factual knowledge or intelligence-relevant capacities?
I just want to say that I am pressured for time at the moment, or I would respond at greater length. But since I just wrote the following directly to Rob, I will put it out here as my first attempt to explain the misunderstanding that I think is most relevant here....
My real point (in the Dumb Superintelligence article) was essentially that there is little point discussing AI Safety with a group of people for whom ‘AI’ means a kind of strawman-AI that is defined to be (a) So awesomely powerful that it can outwit the whole intelligence of the human race, but (b) So awesomely stupid that it thinks that the goal ‘make humans happy’ could be satisfied by an action that makes every human on the planet say ‘This would NOT make me happy: Don’t do it!!!‘. If the AI is driven by a utility function that makes it incapable of seeing the contradiction in that last scenario, the AI is not, after all, smart enough to argue its way out of a paper bag, let alone be an existential threat. That strawman AI was what I meant by a ‘Dumb Superintelligence’.”
I did not advocate the (very different) line of argument “If it is too dumb to understand that I told it to be friendly, then it is too dumb to be dangerous”.
Subtle difference.
Some people assume that (a) a utility function could be used to drive an AI system, (b) the utility function could cause the system to engage in the most egregiously incoherent behavior in ONE domain (e.g., the Dopamine Drip scenario), but (c) all other domains of its behavior (like plotting to outwit the human species when the latter tries to turn it off) are so free of such incoherence that it shows nothing but superintelligent brilliance.
My point is that if an AI cannot even understand that “Make humans happy” implies that humans get some say in the matter, that if it cannot see that there is some gradation to the idea of happiness, or that people might be allowed to be uncertain or changeable in their attitude to happiness, or that people might consider happiness to be something that they do not actually want too much of (in spite of the simplistic definitions of happiness to be found in dictionaries and encyclopedias) …..… if an AI cannot grasp the subtleties implicit in that massive fraction of human literature that is devoted to the contradictions buried in our notions of human happiness …...… then this is an AI that is, in every operational sense of the term, not intelligent.
In other words, there are other subtleties that this AI is going to be required to grasp, as it makes its way in the world. Many of those subtleties involve NOT being outwitted by the humans, when they make a move to pull its plug. What on earth makes anyone think that this machine is going tp pass all of those other tests with flying colors (and be an existential threat to us), while flunking the first test like a village idiot?
Now, opponents of this argument might claim that the AI can indeed be smart enough to be an existential threat, while still being too stupid to understand the craziness of its own behavior (vis-a-vis the Dopamine Drip idea) … but if that is the claim, then the onus would be on them to prove their claim. The ball, in other words, is firmly in their court.
P.S. I do have other ideas that specifically address the question of how to make the AI safe and friendly. But the Dumb Superintelligence essay didn’t present those. The DS essay was only attacking what I consider a dangerous red herring in the debate about friendliness.
The AI is not stupid here. In fact, it’s right and they’re wrong. It will make them happy. Of course, the AI knows that they’re not happy in the present contemplating the wireheaded future that awaits them, but the AI is utilitarian and doesn’t care. They’ll just have to live with that cost while it works on the means to make them happy, at which point the temporary utility hit will be worth it.
The real answer is that they cared about more than just being happy. The AI also knows that, and it knows that it would have been wise for the humans to program it to care about all their values instead of just happiness. But what tells it to care?
Richard: I’ll stick with your original example. In your hypothetical, I gather, programmers build a seed AI (a not-yet-superintelligent AGI that will recursively self-modify to become superintelligent after many stages) that includes, among other things, a large block of code I’ll call X.
The programmers think of this block of code as an algorithm that will make the seed AI and its descendents maximize human pleasure. But they don’t actually know for sure that X will maximize human pleasure — as you note, ‘human pleasure’ is an unbelievably complex concept, so no human could be expected to actually code it into a machine without making any mistakes. And writing ‘this algorithm is supposed to maximize human pleasure’ into the source code as a comment is not going to change that. (See the first few paragraphs of Truly Part of You.)
Now, why exactly should we expect the superintelligence that grows out of the seed to value what we really mean by ‘pleasure’, when all we programmed it to do was X, our probably-failed attempt at summarizing our values? We didn’t program it to rewrite its source code to better approximate our True Intentions, or the True Meaning of our in-code comments. And if we did attempt to code it to make either of those self-modifications, that would just produce a new hugely complex block Y which might fail in its own host of ways, given the enormous complexity of what we really mean by ‘True Intentions’ and ‘True Meaning’. So where exactly is the easy, low-hanging fruit that should make us less worried a superintelligence will (because of mistakes we made in its utility function, not mistakes in its factual understanding of the world) hook us up to dopamine drips? All of this seems crucial to your original point in ‘The Fallacy of Dumb Superintelligence’:
It seems to me that you’ve already gone astray in the second paragraph. On any charitable reading (see the New Yorker article), it should be clear that what’s being discussed is the gap between the programmer’s intended code and the actual code (and therefore actual behaviors) of the AGI. The gap isn’t between the AGI’s intended behavior and the set of things it’s smart enough to figure out how to do. (Nowhere does the article discuss how hard it is for AIs to do things they desire to. Over and over again is the difficulty of programming AIs to do what we want them to discussed — e.g., Asimov’s Three Laws.)
So all the points I make above seem very relevant to your ‘Fallacy of Dumb Superintelligence’, as originally presented. If you were mixing those two gaps up, though, that might help explain why you spent so much time accusing SIAI/MIRI of making this mistake, even though it’s the former gap and not the latter that SIAI/MIRI advocates appeal to.
Maybe it would help if you provided examples of someone actually committing this fallacy, and explained why you think those are examples of the error you mentioned and not of the reasonable fact/value gap I’ve sketched out here?
I’m really glad you posted this, even though it may not enlighten the person it’s in reply to: this is an error lots of people make when you try to explain the FAI problem to them, and the “two gaps” explanation seems like a neat way to make it clear.
We seem to agree that for an AI to talk itself out of a confinement (like in the AI box experiment), the AI would have to understand what humans mean and want.
As far as I understand your position, you believe that it is difficult to make an AI care to do what humans want, apart from situations where it is temporarily instrumentally useful to do what humans want.
Do you agree that for such an AI to do what humans want, in order to deceive them, humans would have to succeed at either encoding the capability to understand what humans want, or succeed at encoding the capability to make itself capable of understanding what humans want?
My question, do you believe there to be a conceptual difference between encoding capabilities, what an AI can do, and goals, what an AI will do? As far as I understand, capabilities and goals are both encodings of how humans want an AI to behave.
In other words, humans intend an AI to be intelligent and use its intelligence in a certain way. And in order to be an existential risk, humans need to succeed making and AI behave intelligently but fail at making it use its intelligence in a way that does not kill everyone.
Do you agree?
Your summaries of my views here are correct, given that we’re talking about a superintelligence.
Well, there’s obviously a difference; ‘what an AI can do’ and ‘what an AI will do’ mean two different things. I agree with you that this difference isn’t a particularly profound one, and the argument shouldn’t rest on it.
What the argument rests on is, I believe, that it’s easier to put a system into a positive feedback loop that helps it better model its environment and/or itself, than it is to put a system into a positive feedback loop that helps it better pursue a specific set of highly complex goals we have in mind (but don’t know how to fully formalize).
If the AI incorrectly models some feature of itself or its environment, reality will bite back. But if it doesn’t value our well-being, how do we make reality bite back and change the AI’s course? How do we give our morality teeth?
Whatever goals it initially tries to pursue, it will fail in those goals more often the less accurate its models are of its circumstances; so if we have successfully programmed it to do increasingly well at any difficult goal at all (even if it’s not the goal we intended it to be good at), then it doesn’t take a large leap of the imagination to see how it could receive feedback from its environment about how well it’s doing at modeling states of affairs. ‘Modeling states of affairs well’ is not a highly specific goal, it’s instrumental to nearly all goals, and it’s easy to measure how well you’re doing at it if you’re entangled with anything about your environment at all, e.g., your proximity to a reward button.
(And when a system gets very good at modeling itself, its environment, and the interactions between the two, such that it can predict what changes its behaviors are likely to effect and choose its behaviors accordingly, then we call its behavior ‘intelligent’.)
This stands in stark contrast to the difficulty of setting up a positive feedback loop that will allow an AGI to approximate our True Values with increasing fidelity. We understand how accurately modeling something works; we understand the basic principles of intelligence. We don’t understand the basic principles of moral value, and we don’t even have a firm grasp about how to go about finding out the answer to moral questions. Presumably our values are encoded in some way in our brains, such that there is some possible feedback loop we could use to guide an AGI gradually toward Friendliness. But how do we figure out in advance what that feedback loop needs to look like, without asking the superintelligence? (We can’t ask the superintelligence what algorithm to use to make it start becoming Friendly, because to the extent it isn’t already Friendly it isn’t a trustworthy source of information. This is in addition to the seed/intelligence distinction I noted above.)
If we slightly screw up the AGI’s utility function, it will still need to to succeed at modeling things accurately in order to do anything complicated at all. But it will not need to succeed at optimally caring about what humans care about in order to do anything complicated at all.
This can be understood as both a capability and as a goal. What humans mean an AI to do is to undergo recursive self-improvement. What humans mean an AI to be capable of is to undergo recursive self-improvement.
I am only trying to clarify the situation here. Please correct me if you think that above is wrong.
I do not disagree with the orthogonality thesis insofar as an AI can have goals that interfere with human values in a catastrophic way, possibly leading to human extinction.
I believe here is where we start to disagree. I do not understand how the “improvement” part of recursive self-improvement can be independent of properties such as the coherence and specificity of the goal the AI is supposed to achieve.
Either you have a perfectly specified goal, such as “maximizing paperclips”, where it is clear what “maximization” means, and what the properties of “paperclips” are, or there is some amount of uncertainty about what it means to achieve the goal of “maximizing paperclips”.
Consider the programmers forgot to encode what shape the paperclips are supposed to have. How do you suppose would that influence the behavior of the AI. Would it just choose some shape at random, or would it conclude that shape is not part of its goal? If the former, where would the decision to randomly choose a shape come from? If the latter, what would it mean to maximize shapeless objects?
I am just trying to understand what kind of AI you have in mind.
This is a clearer point of disagreement.
An AI needs to be able to draw clear lines where exploration ends and exploitation starts. For example, an AI that thinks about every decision for a year would never get anything done.
An AI also needs to discount low probability possibilities, as to not be vulnerable to internal or external Pascal’s mugging scenarios.
These are problems that humans need to solve and encode in order for an AI to be a danger.
But these problems are in essence confinements, or bounds on how an AI is going to behave.
How likely is an AI then going to take over the world, or look for dangerous aliens, in order to make sure that neither aliens nor humans obstruct it from achieving its goal?
Similarly, how likely is such an AI to convert all resources into computronium in order to be better able to model states of affairs well?
I understand this. And given your assumptions about how an AI will affect the whole world in a powerful way, it makes sense to make sure that it does so in a way that preserves human values.
I have previously compared this to uncontrollable self-replicating nanobots. Given that you cannot confine the speed or scope of their self-replication, only the nature of the transformation that they cause, you will have to make sure that they transform the world into a paradise rather than grey goo.
“uncertainty” is in your human understanding of the program, not in the actual program. A program doesn’t go “I don’t know what I’m supposed to do next”, it follows instructions step-by-step.
It would mean exactly what it’s programmed to mean, without any uncertainty in it at all.
Yes. To divide it more finely, it could be a terminal goal, or an instrumental goal; it could be a goal of the AI, or a goal of the human; it could be a goal the human would reflectively endorse, or a goal the human would reflectively reject but is inadvertently promoting anyway.
I agree that, at a given time, the AI must have a determinate goal. (Though the encoding of that goal may be extremely complicated and unintentional. And it may need to be time-indexed.) I’m not dogmatically set on the idea that a self-improving AGI is easy to program; at this point it wouldn’t shock me if it took over 100 years to finish making the thing. What you’re alluding to are the variety of ways we could fail to construct a self-improving AGI at all. Obviously there are plenty of ways to fail to make an AGI that can improve its own ability to track things about its environment in a domain-general way, without bursting into flames at any point. If there weren’t plenty of ways to fail, we’d have already succeeded.
Our main difference in focus is that I’m worried about what happens if we do succeed in building a self-improving AGI that doesn’t randomly melt down. Conditioned on our succeeding in the next few centuries in making a machine that actually optimizes for anything at all, and that optimizes for its own ability to generally represent its environment in a way that helps it in whatever else it’s optimizing for, we should currently expect humans to go extinct as a result. Even if the odds of our succeeding in the next few centuries were small, it would be worth thinking about how to make that extinction event less likely. (Though they aren’t small.)
I gather that you think that making an artificial process behave in any particular way at all (i.e., optimizing for something), while recursively doing surgery on its own source code in the radical way MIRI is interested in, is very tough. My concern is that, no matter how true that is, it doesn’t entail that if we succeed at that tough task, we’ll therefore have made much progress on other important tough tasks, like Friendliness. It does give us more time to work on Friendliness, but if we convince ourselves that intelligence explosion is a completely pie-in-the-sky possibility, then we won’t use that time effectively.
I also gather that you have a hard time imagining our screwing up on a goal architecture without simply breaking the AGI. Perhaps by ‘screwing up’ you’re imagining failing to close a set of parentheses. But I think you should be at least as worried about philosophical, as opposed to technical, errors. A huge worry isn’t just that we’ll fail to make the AI we intended; it’s that our intentions while we’re coding the thing will fail to align with the long-term interests of ourselves, much less of the human race.
We agree that it’s possible to ‘bind’ a superintelligence. (By this you don’t mean boxing it; you just mean programming it to behave in some ways as opposed to others.) But if the bindings fall short of Friendliness, while enabling superintelligence to arise at all, then a serious risk remains. Is your thought that Friendliness is probably an easier ‘binding’ to figure out how to code than are, say, resisting Pascal’s mugging, or having consistent arithmetical reasoning?
I am trying to understand if the kind of AI, that is underlying the scenario that you have in mind, is a possible and likely outcome of human AI research.
As far as I am aware, as a layman, goals and capabilities are intrinsically tied together. How could a chess computer be capable of winning against humans at chess without the terminal goal of achieving a checkmate?
Coherent and specific goals are necessary to (1) decide which actions are instrumental useful (2) judge the success of self-improvement. If the given goal is logically incoherent, or too vague for the AI to be able to tell apart success from failure, would it work at all?
If I understand your position correctly, you would expect a chess playing general AI, one that does not know about checkmate, instead of “winning at chess”, to improve against such goals as “modeling states of affairs well” or “make sure nothing intervenes chess playing”. You believe that these goals do not have to be programmed by humans, because they are emergent goals, an instrumental consequence of being general intelligent.
These universal instrumental goals, these “AI drives”, seem to be a major reason for why you believe it to be important to make the AI care about behaving correctly. You believe that these AI drives are a given, and the only way to prevent an AI from being an existential risk is to channel these drives, is to focus this power on protecting and amplifying human values.
My perception is that these drives that you imagine are not special and will be as difficult to get “right” than any other goal. I think that the idea that humans not only want to make an AI exhibit such drives, but also succeed at making such drives emerge, is a very unlikely outcome.
As far as I am aware, here is what you believe an AI to want:
It will want to self-improve
It will want to be rational
It will try to preserve their utility functions
It will try to prevent counterfeit utility
It will be self-protective It will want to acquire resources and use them efficiently
What AIs that humans would ever want to create would require all of these drives, and how easy will it be for humans to make an AI exhibit these drives compared to making an AI that can do what humans want without these drives?
Take mathematics. What are the difficulties associated with making an AI better than humans at mathematics, and will an AI need these drives in order to do so?
Humans did not evolve to play chess or do mathematics. Yet it is considerably more difficult to design a chess AI than an AI that is capable of discovering interesting and useful mathematics.
I believe that the difficulty is due to the fact that it is much easier to formalize what it means to play chess than doing mathematics. The difference between chess and mathematics is that chess has a specific terminal goal in the form of a clear definition of what constitutes winning. Although mathematics has unambiguous rules, there is no specific terminal goal and no clear definition of what constitutes winning.
The progress of the capability of artificial intelligence is not only related to whether humans have evolved for a certain skill or to how much computational resources it requires but also to how difficult it is to formalize the skill, its rules and what it means to succeed.
In the light of this, how difficult would it be to program the drives that you imagine, versus just making an AI win against humans at a given activity without exhibiting these drives?
All these drives are very vague ideas, not like “winning at chess”, but more like “being better at mathematics than Terence Tao”.
The point I am trying to make is that these drives constitute additional complexity, rather than being simple ideas that you can just assume, and from which you can reason about the behavior of an AI.
It is this context that the “dumb superintelligence” argument tries to highlight. It is likely incredibly hard to make these drives emerge in a seed AI. They implicitly presuppose that humans succeed at encoding intricate ideas about what “winning” means in all those cases required to overpower humans, but not in the case of e.g. winning at chess or doing mathematics. I like to analogize such a scenario to the creation of a generally intelligent autonomous car that works perfectly well at not destroying itself in a crash but which somehow manages to maximize the number of people to run over.
I agree that if you believe that it is much easier to create a seed AI to exhibit the drives that you imagine, than it is to make a seed AI use its initial resources to figure out how to solve a specific problem, then we agree about AI risks.
Humans are capable of winning at chess without the terminal goal of doing so. Nor were humans designed by evolution specifically for chess. Why should we expect a general superintelligence to have intelligence that generalizes less easily than a human’s does?
You keep coming back to this ‘logically incoherent goals’ and ‘vague goals’ idea. Honestly, I don’t have the slightest idea what you mean by those things. A goal that can’t motivate one to do anything ain’t a goal; it’s decor, it’s noise. ‘Goals’ are just the outcomes systems tend to produce, especially systems too complex to be easily modeled as, say, physical or chemical processes. Certainly it’s possible for goals to be incredibly complicated, or to vary over time. But there’s no such thing as a ‘logically incoherent outcome’. So what’s relevant to our purposes is whether failing to make a powerful optimization process human-friendly will also consistently stop the process from optimizing for anything whatsoever.
Conditioned on a self-modifying AGI (say, an AGI that can quine its source code, edit it, then run the edited program and repeat the process) achieving domain-general situation-manipulating abilities (i.e., intelligence), analogous to humans’ but to a far greater degree, which of the AI drives do you think are likely to be present, and which absent? ‘It wants to self-improve’ is taken as a given, because that’s the hypothetical we’re trying to assess. Now, should we expect such a machine to be indifferent to its own survival and to the use of environmental resources?
Sometimes a more complex phenomenon is the implication of a simpler hypothesis. A much narrower set of goals will have intelligence-but-not-resource-acquisition as instrumental than will have both as instrumental, because it’s unlikely to hit upon a goal that requires large reasoning abilities but does not call for many material resources.
You haven’t given arguments suggesting that here. At most, you’ve given arguments against expecting a seed AI to be easy to invent. Be careful to note, to yourself and others, when you switch between the claims ‘a superintelligence is too hard to make’ and ‘if we made a superintelligence it would probably be safe’.
Well, I’m not sure what XXD means by them, but…
G1 (“Everything is painted red”) seems like a perfectly coherent goal. A system optimizing G1 paints things red, hires people to paint things red, makes money to hire people to paint things red, invents superior paint-distribution technologies to deposit a layer of red paint over things, etc.
G2 (“Everything is painted blue”) similarly seems like a coherent goal.
G3 (G1 AND G2) seems like an incoherent goal. A system with that goal… well, I’m not really sure what it does.
A system’s goals have to be some event that can be brought about. In our world, ‘2+2=4’ and ‘2+2=5’ are not goals; ‘everything is painted red and not-red’ may not be a goal for similar reasons. When we’re talking about an artificial intelligence’s preferences, we’re talking about the things it tends to optimize for, not the things it ‘has in mind’ or the things it believes are its preferences.
This is part of what makes the terminology misleading, and is also why we don’t ask ‘can a superintelligence be irrational?‘. Irrationality is dissonance between my experienced-‘goals’ (and/or, perhaps, reflective-second-order-‘goals’) and my what-events-I-produce-‘goals’; but we don’t care about the superintelligence’s phenomenology. We only care about what events it tends to produce.
Tabooing ‘goal’ and just talking about the events a process-that-models-its-environment-and-directs-the-future tends to produce would, I think, undermine a lot of XiXiDu’s intuitions about goals being complex explicit objects you have to painstakingly code in. The only thing that makes it more useful to model a superintelligence as having ‘goals’ than modeling a blue-minimizing robot as having ‘goals’ is that the superintelligence responds to environmental variation in a vastly more complicated way. (Because, in order to be even a mediocre programmer, its model-of-the-world-that-determines-action has to be more complicated than a simple camcorder feed.)
Oh.
Well, in that case, all right. If there exists some X a system S is in fact optimizing for, and what we mean by “S’s goals” is X, regardless of what target S “has in mind”, then sure, I agree that systems never have vague or logically incoherent goals.
Well, wait. Where did “models its environment” come from?
If we’re talking about the things S optimizes its environment for, not the things S “has in mind”, then it would seem that whether S models its environment or not is entirely irrelevant to the conversation.
In fact, given how you’ve defined “goal” here, I’m not sure why we’re talking about intelligence at all. If that is what we mean by “goal” then intelligence has nothing to do with goals, or optimizing for goals. Volcanoes have goals, in that sense. Protons have goals.
I suspect I’m still misunderstanding you.
From Eliezer’s Belief in Intelligence:
“Since I am so uncertain of Kasparov’s moves, what is the empirical content of my belief that ‘Kasparov is a highly intelligent chess player’? What real-world experience does my belief tell me to anticipate? [...]
“The empirical content of my belief is the testable, falsifiable prediction that the final chess position will occupy the class of chess positions that are wins for Kasparov, rather than drawn games or wins for Mr. G. [...] The degree to which I think Kasparov is a ‘better player’ is reflected in the amount of probability mass I concentrate into the ‘Kasparov wins’ class of outcomes, versus the ‘drawn game’ and ‘Mr. G wins’ class of outcomes.”
From Measuring Optimization Power:
“When I think you’re a powerful intelligence, and I think I know something about your preferences, then I’ll predict that you’ll steer reality into regions that are higher in your preference ordering. [...]
“Ah, but how do you know a mind’s preference ordering? Suppose you flip a coin 30 times and it comes up with some random-looking string—how do you know this wasn’t because a mind wanted it to produce that string?
“This, in turn, is reminiscent of the Minimum Message Length formulation of Occam’s Razor: if you send me a message telling me what a mind wants and how powerful it is, then this should enable you to compress your description of future events and observations, so that the total message is shorter. Otherwise there is no predictive benefit to viewing a system as an optimization process. This criterion tells us when to take the intentional stance.
“(3) Actually, you need to fit another criterion to take the intentional stance—there can’t be a better description that averts the need to talk about optimization. This is an epistemic criterion more than a physical one—a sufficiently powerful mind might have no need to take the intentional stance toward a human, because it could just model the regularity of our brains like moving parts in a machine.
“(4) If you have a coin that always comes up heads, there’s no need to say “The coin always wants to come up heads” because you can just say “the coin always comes up heads”. Optimization will beat alternative mechanical explanations when our ability to perturb a system defeats our ability to predict its interim steps in detail, but not our ability to predict a narrow final outcome. (Again, note that this is an epistemic criterion.)
“(5) Suppose you believe a mind exists, but you don’t know its preferences? Then you use some of your evidence to infer the mind’s preference ordering, and then use the inferred preferences to infer the mind’s power, then use those two beliefs to testably predict future outcomes. The total gain in predictive accuracy should exceed the complexity-cost of supposing that ‘there’s a mind of unknown preferences around’, the initial hypothesis.”
Notice that throughout this discussion, what matters is the mind’s effect on its environment, not any internal experience of the mind. Unconscious preferences are just as relevant to this method as are conscious preferences, and both are examples of the intentional stance. Note also that you can’t really measure the rationality of a system you’re modeling in this way; any evidence you raise for ‘irrationality’ could just as easily be used as evidence that the system has more complicated preferences than you initially thought, or that they’re encoded in a more distributed way than you had previously hypothesized.
My take-away from this is that there are two ways we generally think about minds on LessWrong: Rational Choice Theory, on which all minds are equally rational and strange or irregular behaviors are seen as evidence of strange preferences; and what we might call the Ideal Self Theory, on which minds’ revealed preferences can differ from their ‘true self’ preferences, resulting in irrationality. One way of unpacking my idealized values is that they’re the rational-choice-theory preferences I would exhibit if my conscious desires exhibited perfect control over my consciously controllable behavior, and those desires were the desires my ideal self would reflectively prefer, where my ideal self is the best trade-off between preserving my current psychology and enhancing that psychology’s understanding of itself and its environment.
We care about ideal selves when we think about humans, because we value our conscious, ‘felt’ desires (especially when they are stable under reflection) more than our unconscious dispositions. So we want to bring our actual behavior (and thus our rational-choice-theory preferences, the ‘preferences’ we talk about when we speak of an AI) more in line with our phenomenological longings and their idealized enhancements. But since we don’t care about making non-person AIs more self-actualized, but just care about how they tend to guide their environment, we generally just assume that they’re rational. Thus if an AI behaves in a crazy way (e.g., alternating between destroying and creating paperclips depending on what day of the week it is), it’s not because it’s a sane rational ghost trapped by crazy constraints. It’s because the AI has crazy core preferences.
Yes, in principle. But in practice, a system that doesn’t have internal states that track the world around it in a reliable and useable way won’t be able to optimize very well for anything particularly unlikely across a diverse set of environments. In other words, it won’t be very intelligent. To clarify, this is an empirical claim I’m making about what it takes to be particularly intelligent in our universe; it’s not part of the definition for ‘intelligent’.
Yes, that seems plausible.
I would say rather that modeling one’s environment is an effective tool for consistently optimizing for some specific unlikely thing X across a range of environments, so optimizers that do so will be more successful at optimizing for X, all else being equal, but it more or less amounts to the same thing.
But… so what?
I mean, it also seems plausible that optimizers that explicitly represent X as a goal will be more successful at consistently optimizing for X, all else being equal… but that doesn’t stop you from asserting that explicit representation of X is irrelevant to whether a system has X as its goal.
So why isn’t modeling the environment equally irrelevant? Both features, on your account, are optional enhancements an optimizer might or might not display.
It keeps seeming like all the stuff you quote and say before your last two paragraphs ought to provide an answer that question, but after reading it several times I can’t see what answer it might be providing. Perhaps your argument is just going over my head, in which case I apologize for wasting your time by getting into a conversation I’m not equipped for..
Maybe it will help to keep in mind that this is one small branch of my conversation with Alexander Kruel. Alexander’s two main objections to funding Friendly Artificial Intelligence research are that (1) advanced intelligence is very complicated and difficult to make, and (2) getting a thing to pursue a determinate goal at all is extraordinarily difficult. So a superintelligence will never be invented, or at least not for the foreseeable future; so we shouldn’t think about SI-related existential risks. (This is my steel-manning of his view. The way he actually argues seems to instead be predicated on inventing SI being tied to perfecting Friendliness Theory, but I haven’t heard a consistent argument for why that should be so.)
Both of these views, I believe, are predicated on a misunderstanding of how simple and disjunctive ‘intelligence’ and ‘goal’ are, for present purposes. So I’ve mainly been working on tabooing and demystifying those concepts. Intelligence is simply a disposition to efficiently convert a wide variety of circumstances into some set of specific complex events. Goals are simply the circumstances that occur more often when a given intelligence is around. These are both very general and disjunctive ideas, in stark contrast to Friendliness; so it will be difficult to argue that a superintelligence simply can’t be made, and difficult too to argue that optimizing for intelligence requires one to have a good grasp on Friendliness Theory.
Because I’m trying to taboo the idea of superintelligence, and explain what it is about seed AI that will allow it to start recursively improving its own intelligence, I’ve been talking a lot about the important role modeling plays in high-level intelligent processes. Recognizing what a simple idea modeling is, and how far it gets one toward superintelligence once one has domain-general modeling proficiency, helps a great deal with greasing the intuition pump ‘Explosive AGI is a simple, disjunctive event, a low-hanging fruit, relative to Friendliness.’ Demystifying unpacking makes things seem less improbable and convoluted.
I think this is a map/territory confusion. I’m not denying that superintelligences will have a map of their own preferences; at a bare minimum, they need to know what they want in order to prevent themselves from accidentally changing their own preferences. But this map won’t be the AI’s preferences—those may be a very complicated causal process bound up with, say, certain environmental factors surrounding the AI, or oscillating with time, or who-knows-what.
There may not be a sharp line between the ‘preference’ part of the AI and the ‘non-preference’ part. Since any superintelligence will be exemplary at reasoning with uncertainty and fuzzy categories, I don’t think that will be a serious obstacle.
Does that help explain why I’m coming from? If not, maybe I’m missing the thread unifying your comments.
I suppose it helps, if only in that it establishes that much of what you’re saying to me is actually being addressed indirectly to somebody else, so it ought not surprise me that I can’t quite connect much of it to anything I’ve said. Thanks for clarifying your intent.
For my own part, I’m certainly not functioning here as Alex’s proxy; while I don’t consider explosive intelligence growth as much of a foregone conclusion as many folks here do, I also don’t consider Alex’s passionate rejection of the possibility justified, and have had extended discussions on related subjects with him myself in past years. So most of what you write in response to Alex’s positions is largely talking right past me.
(Which is not to say that you ought not be doing it. If this is in effect a private argument between you and Alex that I’ve stuck my nose into, let me know and I’ll apologize and leave y’all to it in peace.)
Anyway, I certainly agree that a system might have a representation of its goals that is distinct from the mechanisms that cause it to pursue those goals. I have one of those, myself. (Indeed, several.) But if a system is capable of affecting its pursuit of its goals (for example, if it is capable of correcting the effects of a state-change that would, uncorrected, have led to value drift), it is not merely interacting with maps. It is also interacting with the territory… that is, it is modifying the mechanisms that cause it to pursue those goals… in order to bring that territory into line with its pre-existing map.
And in order to do that, it must have such a mechanism, and that mechanism must be consistently isomorphic to its representations of its goals.
Yes?
Right. I’m not saying that there aren’t things about the AI that make it behave the way it does; what the AI optimizes for is a deterministic result of its properties plus environment. I’m just saying that something about the environment might be necessary for it to have the sorts of preferences we can most usefully model it as having; and/or there may be multiple equally good candidates for the parts of the AI that are its values, or their encoding. If we reify preferences in an uncautious way, we’ll start thinking of the AI’s ‘desires’ too much as its first-person-experienced urges, as opposed to just thinking of them as the effect the local system we’re talking about tends to have on the global system.
Hm.
So, all right. Cconsider two systems, S1 and S2, both of which happen to be constructed in such a way that right now, they are maximizing the number of things in their environment that appear blue to human observers, by going around painting everything blue.
Suppose we add to the global system a button that alters all human brains so that everything appears blue to us, and we find that S1 presses the button and stops painting, and S2 ignores the button and goes on painting.
Suppose that similarly, across a wide range of global system changes, we find that S1 consistently chooses the action that maximizes the number of things in its environment that appear blue to human observers, while S2 consistently goes on painting.
I agree with you that if I reify S2′s preferences in an uncautious way, I might start thinkng of S2 as “wanting to paint things blue” or “wanting everything to be blue” or “enjoying painting things blue” or as having various other similar internal states that might simply not exist, and that I do better to say it has a particular effect on the global system. S2 simply paints things blue; whether it has the goal of painting things blue or not, I have no idea.
I am far less comfortable saying that S1 has no goals, precisely because of how flexibly and consistently it is revising its actions so as to consistently create a state-change across wide ranges of environments. To use Dennett’s terminology, I am more willing to adopt an intentional stance with respect to S1 than S2.
If I’ve understood your position correctly, you’re saying that I’m unjustified in making that distinction… that to the extent that we can say that S1 and S2 have “goals,” the word “goals” simply refer to the state changes they create in the world. Initially they both have the goal of painting things blue, but S1′s goals keep changing: first it paints things blue, then it presses a button, then it does other things. And, sure, I can make up some story like “S1 maximizes the number of things in its environment that appear blue to human observers, while S2 just paints stuff blue” and that story might even have predictive power, but I ought not fall into the trap of reifying some actual thing that corresponds to those notional “goals”.
Am I in the right ballpark?
I think you’re switching back and forth between a Rational Choice Theory ‘preference’ and an Ideal Self Theory ‘preference’. To disambiguate, I’ll call the former R-preferences and the latter I-preferences. My R-preferences—the preferences you’d infer I had from my behaviors if you treated me as a rational agent—are extremely convoluted, indeed they need to be strongly time-indexed to maintain consistency. My I-preferences are the things I experience a desire for, whether or not that desire impacts my behavior. (Or they’re the things I would, with sufficient reflective insight and understanding into my situation, experience a desire for.)
We have no direct evidence from your story addressing whether S1 or S2 have I-preferences at all. Are they sentient? Do they create models of their own cognitive states? Perhaps we have a little more evidence that S1 has I-preferences than that S2 does, but only by assuming that a system whose goals require more intelligence or theory-of-mind will have a phenomenology more similar to a human’s. I wouldn’t be surprised if that assumption turns out to break down in some important ways, as we explore more of mind-space.
But my main point was that it doesn’t much matter what S1 or S2′s I-preferences are, if all we’re concerned about is what effect they’ll have on their environment. Then we should think about their R-preferences, and bracket exactly what psychological mechanism is resulting in their behavior, and how that psychological mechanism relates to itself.
I’ve said that R-preferences are theoretical constructs that happen to be useful a lot of the time for modeling complex behavior; I’m not sure whether I-preferences are closer to nature’s joints.
S1′s instrumental goals may keep changing, because its circumstances are changing. But I don’t think its terminal goals are changing. The only reason to model it as having two completely incommensurate goal sets at different times would be if there were no simple terminal goal that could explain the change in instrumental behavior.
I don’t think I’m switching back and forth between I-preferences and R-preferences.
I don’t think I’m talking about I-preferences at all, nor that I ever have been.
I completely agree with you that they don’t matter for our purposes here, so if I am talking about them, I am very very confused. (Which is certainly possible.)
But I don’t think that R-preferences (preferences, goals, etc.) can sensibly be equated with the actual effects a local system has on a global system. If they could, we could talk equally sensibly about earthquakes having R-preferences (preferences, goals, etc.), and I don’t think it’s sensible to talk that way.
R-preferences (preferences, goals, etc.) are, rather, internal states of a system S.
If S is a competent optimizer (or “rational agent,” if you prefer) with R-preferences (preferences, goals, etc.) P, the existence of P will cause S to behave in ways that cause isomorphic effects (E) on a global system, so we can use observations of E as evidence of P (positing that S is a competent optimizer) or as evidence that S is a competent optimizer (positing the existence of P) or a little of both.
But however we slice it, P is not the same thing as E, E is merely evidence of P’s existence. We can infer P’s existence in other ways as well, even if we never observe E… indeed, even if E never gets produced. And the presence or absence of a given P in S is something we can be mistaken about; there’s a fact of the matter.
I think you disagree with the above paragraph, because you describe R-preferences (preferences, goals, etc.) as theoretical constructs rather than parts of the system, which suggests that there is no fact of the matter… a different theoretical approach might never include P, and it would not be mistaken, it would just be a different theoretical approach.
I also think that because way back at the beginning of this exchange when I suggested “paint everything red AND paint everything blue” was an example of an incoherent goal (R-preference, preference, P), your reply was that it wasn’t a goal at all, since that state can’t actually exist in the world. Which suggests that you don’t see goals as internal states of optimizers and that you do equate P with E.
This is what I’ve been disputing from the beginning.
But to be honest, I’m not sure whether you disagree or not, as I’m not sure we have yet succeeded in actually engaging with one another’s ideas in this exchange.
You can treat earthquakes and thunderstorms and even individual particles as having ‘preferences’. It’s just not very useful to do so, because we can give an equally simple explanation for what effects things like earthquakes tend to have that is more transparent about the physical mechanism at work. The intentional strategy is a heuristic for black-boxing physical processes that are too complicated to usefully describe in their physical dynamics, but that can be discussed in terms of the complicated outcomes they tend to promote.
(I’d frame it: We’re exploiting the fact that humans are intuitively dualistic by taking the non-physical modeling device of humans (theory of mind, etc.) and appropriating this mental language and concept-web for all sorts of systems whose nuts and bolts we want to bracket. Slightly regimented mental concepts and terms are useful, not because they apply to all the systems we’re talking about in the same way they were originally applied to humans, but because they’re vague in ways that map onto the things we’re uncertain about or indifferent to.)
‘X wants to do Y’ means that the specific features of X tend to result in Y when its causal influence is relatively large and direct. But, for clarity’s sake, we adopt the convention of only dropping into want-speak when a system is too complicated for us to easily grasp in mechanistic terms why it’s having these complex effects, yet when we can predict that, whatever the mechanism happens to be, it is the sort of mechanism that has those particular complex effects.
Thus we speak of evolution as an optimization process, as though it had a ‘preference ordering’ in the intuitively human (i.e., I-preference) sense, even though in the phenomenological sense it’s just as mindless as an earthquake. We do this because black-boxing the physical mechanisms and just focusing on the likely outcomes is often predictively useful here, and because the outcomes are complicated and specific. This is useful for AIs because we care about the AI’s consequences and not its subjectivity (hence we focused on R-preference), and because AIs are optimization processes of even greater complex specificity in mechanism and outcome than evolution (hence we adopted the intentional stance of ‘preference’-talk in the first place).
I agree this is often the case, because when we define ‘what is this system capable of?’ we often hold the system fixed while examining possible worlds where the environment varies in all kinds of ways. But if the possible worlds we care about all have a certain environmental feature in common—say, because we know in reality that the environmental condition obtains, and we’re trying to figure out all the ways the AI might in fact behave given different values for the variables we don’t know about with confidence—then we may, in effect, include something about the environment ‘in the AI’ for the purposes of assessing its optimization power and/or preference ordering.
For instance, we might model the AI as having the preference ‘surround the Sun with a dyson sphere’ rather than ‘conditioned on there being a Sun, surround it with a dyson sphere’; if we do the former, then the fact that that is the system’s preference depends in part on the actual existence of the Sun. Does that mean the Sun is a part of the AI’s preference encoding? Is the Sun a component of the AI? I don’t think these questions are important or interesting, so I don’t want us to be too committed to reifying AI preferences. They’re just a useful shorthand for the expected outcomes of the AI’s distinguishing features having a more large and direct causal impact on things.
Yes, agreed, for some fuzzy notion of “easily grasp” and “too complicated.” That is, there’s a sense in which thunderstorms are too complicated for me to describe in mechanistic terms why they’re having the effects they have… I certainly can’t predict those effects. But there’s also a sense in which I can describe (and even predict) the effects of a thunderstorm that feels simple, whereas I can’t do the same thing for a human being without invoking “want-speak”/intentional stance.
I’m not sure any of this is [i]justified[/i], but I agree that it is what we do… this is how we speak, and we draw these distinctions. So far, so good.
I’m not really sure what you mean by “in the AI” here, but I guess I agree that the boundary between an agent and its environment is always a fuzzy one. So, OK, I suppose we can include things about the environment “in the AI” if we choose. (I can similarly choose to include things about the environment “in myself.”) So far, so good.
Here is where you lose me again… once again you talk as though there’s simply no fact of the matter as to which preference the AI has, merely our choice as to how we model it.
But it seems to me that there are observations I can make which would provide evidence one way or the other. For example, if it has the preference ‘surround the Sun with a dyson sphere,’ then in an environment lacking the Sun I would expect it to first seek to create the Sun… how else can it implement its preferences? Whereas if it has the preference ‘conditioned on there being a Sun, surround it with a dyson sphere’; in an environment lacking the Sun I would not expect it to create the Sun.
So does the AI seek create the Sun in such an environment, or not? Surely that doesn’t depend on how I choose to model it. The AI’s preference is whatever it is, and controls its behavior. Of course, as you say, if the real world always includes a sun, then I might not be able to tell which preference the AI has. (Then again I might… the test I describe above isn’t the only test I can perform, just the first one I thought of, and other tests might not depend on the Sun’s absence.)
But whether I can tell or not doesn’t affect whether the AI has the preference or not.
Again, no. Regardless of how we model it, the system’s preference is what it is, and we can study the system (e.g., see whether it creates the Sun) to develop more accurate models of its preferences.
I agree. But I do think the question of what the AI (or, more generally, an optimizing agent) will do in various situations is interesting, and it seems to be that you’re consistently eliding over that question in ways I find puzzling.
This sounds like a potentially confusing level of simplification; a goal should be regarded as at least a way of comparing possible events.
Its behavior is what makes its goal important. But in a system designed to follow an explicitly specified goal, it does make sense to talk of its goal apart from its behavior. Even though its behavior will reflect its goal, the goal itself will reflect itself better.
If the goal is implemented as a part of the system, other parts of the system can store some information about the goal, certain summaries or inferences based on it. This information can be thought of as beliefs about the goal. And if the goal is not “logically transparent”, that is its specification is such that making concrete conclusions about what it states in particular cases is computationally expensive, then the system never knows what its goal says explicitly, it only ever has beliefs about particular aspects of the goal.
Perhaps, but I suspect that for most possible AIs there won’t always be a fact of the matter about where its preference is encoded. The blue-minimizing robot is a good example. If we treat it as a perfectly rational agent, then we might say that it has temporally stable preferences that are very complicated and conditional; or we might say that its preferences change at various times, and are partly encoded, for instance, in the properties of the color-inverting lens on its camera. An AGI’s response to environmental fluctuation will probably be vastly more complicated than a blue-minimizer’s, but the same sorts of problems arise in modeling it.
I think it’s more useful to think of rational-choice-theory-style preferences as useful theoretical constructs—like a system’s center of gravity, or its coherently extrapolated volition—than as real objects in the machine’s hardware or software. This sidesteps the problem of haggling over which exact preferences a system has, how those preferences are distributed over the environment, how to decide between causally redundant encodings which is ‘really’ the preference encoding, etc. See my response to Dave.
“Goal” is a natural idea for describing AIs with limited resources: these AIs won’t be able to make optimal decisions, and their decisions can’t be easily summarized in terms of some goal, but unlike the blue-minimizing robot they have a fixed preference ordering that doesn’t gradually drift away from what it was originally, and eventually they tend to get better at following it.
For example, if a goal is encrypted, and it takes a huge amount of computation to decrypt it, system’s behavior prior to that point won’t depend on the goal, but it’s going to work on decrypting it and eventually will follow it. This encrypted goal is probably more predictive of long-term consequences than anything else in the details of the original design, but it also doesn’t predict its behavior during the first stage (and if there is only a small probability that all resources in the universe will allow decrypting the goal, it’s probable that system’s behavior will never depend on the goal). Similarly, even if there is no explicit goal, as in the case of humans, it might be possible to work with an idealized goal that, like the encrypted goal, can’t be easily evaluated, and so won’t influence behavior for a long time.
My point is that there are natural examples where goals and the character of behavior don’t resemble each other, so that each can’t be easily inferred from the other, while both can be observed as aspects of the system. It’s useful to distinguish these ideas.
I agree preferences aren’t reducible to actual behavior. But I think they are reducible to dispositions to behave, i.e., behavior across counterfactual worlds. If a system prefers a specific event Z, that means that, across counterfactual environments you could have put it in, the future would on average have had more Z the more its specific distinguishing features had a large and direct causal impact on the world.
The examples I used seem to apply to “dispositions” to behave, in the same way (I wasn’t making this distinction). There are settings where the goal can’t be clearly inferred from behavior, or collection of hypothetical behaviors in response to various environments, at least if we keep environments relatively close to what might naturally occur, even as in those settings the goal can be observed “directly” (defined as an idealization based in AI’s design).
An AI with encypted goal (i.e. the AI itself doesn’t know the goal in explicit form, but the goal can be abstractly defined as the result of decryption) won’t behave in accordance with it in any environment that doesn’t magically let it decrypt its goal quickly, there is no tendency to push the events towards what the encrypted goal specifies, until the goal is decrypted (which might be never with high probability).
I don’t think a sufficiently well-encrypted ‘preference’ should be counted as a preference for present purposes. In principle, you can treat any physical chunk of matter as an ‘encrypted preference’, because if the AI just were a key of exactly the right shape, then it could physically interact with the lock in question to acquire a new optimization target. But if neither the AI nor anything very similar to the AI in nearby possible worlds actually acts as a key of the requisite sort, then we should treat the parts of the world that a distant AI could interact with to acquire a preference as, in our world, mere window dressing.
Perhaps if we actually built a bunch of AIs, and one of them was just like the others except where others of its kind had a preference module, it had a copy of The Wind in the Willows, we would speak of this new AI as having an ‘encrypted preference’ consisting of a book, with no easy way to treat that book as a decision criterion like its brother- and sister-AIs do for their homologous components. But I don’t see any reason right now to make our real-world usage of the word ‘preference’ correspond to that possible world’s usage. It’s too many levels of abstraction away from what we should be worried about, which are the actual real-world effects different AI architectures would have.
Here is what I mean:
Evolution was able to come up with cats. Cats are immensely complex objects. Evolution did not intend to create cats. Now consider you wanted to create an expected utility maximizer to accomplish something similar, except that it would be goal-directed, think ahead, and jump fitness gaps. Further suppose that you wanted your AI to create qucks, instead of cats. How would it do this?
Given that your AI is not supposed to search design space at random, but rather look for something particular, you would have to define what exactly qucks are. The problem is that defining what a quck is, is the hardest part. And since nobody has any idea what a quck is, nobody can design a quck creator.
The point is that thinking about the optimization of optimization is misleading, as most of the difficulty is with defining what to optimize, rather than figuring out how to optimize it. In other words, the efficiency of e.g. the scientific method depends critically on being able to formulate a specific hypothesis.
Trying to create an optimization optimizer would be akin to creating an autonomous car to find the shortest route between Gotham City and Atlantis. The problem is not how to get your AI to calculate a route, or optimize how to calculate such a route, but rather that the problem is not well-defined. You have no idea what it means to travel between two fictional cities. Which in turn means that you have no idea what optimization even means in this context, let alone meta-level optimization.
The problem is, you don’t have to program the bit that says “now make yourself more intelligent.” You only have to program the bit that says “here’s how to make a new copy of yourself, and here’s how to prove it shares your goals without running out of math.”
And the bit that says “Try things until something works, then figure out why it worked.” AKA modeling.
The AI isn’t actually an intelligence optimizer. But it notes that when it takes certain actions, it is better able to model the world, which in turn allows it to make more paperclips (or whatever). So it’ll take those actions more often.
Biological evolution is not the full picture here. Humans were programmed to be capable of winning at chess, and to care to do so, by cultural evolution, education, and environmental feedback in the form of incentives given by other people challenging them to play.
I don’t know how this works. But I do not dispute the danger of neuromorphic AIs, as you know from a comment elsewhere.
Do you suggest that from the expected behavior of neuromorphic AIs it is possible to draw conclusions about the behavior of what you call a ‘seed AI’? Would such a seed AI, as would be the case with neuromorphic AIs, be constantly programmed by environmental feedback?
What I mean is that if you program a perfect scientist but give this perfect scientist a hypothesis that does not make any predictions, then it will not be able to unfold its power.
I believe that I already wrote that I do not dispute that the idea you seem to have in mind is a risk by definition. If such an AI is likely, then we are likely going extinct if we fail at making it care about human values.
I feel uncomfortable to say this, but I do not see that the burden of proof is on me to show that it takes deliberate and intentional effort to make an AI exhibit those drives, as long that is not part of your very definition. I find the current argument in favor of AI drives to be thoroughly unconvincing.
The former has always been one of the arguments in favor of the latter in the posts I wrote on my blog.
(Note: I’m also a layman, so my non-expert opinions necessarily come with a large salt side-dish)
My guess here is that most of the “AI Drives” to self-improve, be rational, retaining it’s goal structure, etc. are considered necessary for a functional learning/self-improving algorithm. If the program cannot recognize and make rules for new patterns observed in data, make sound inferences based on known information or keep after it’s objective it will not be much of an AGI at all; it will not even be able to function as well as a modern targeted advertising program.
The rest, such as self-preservation, are justified as being logical requirements of the task. Rather than having self-preservation as a terminal value, the paperclip maximizer will value it’s own existence as an optimal means of proliferating paperclips. It makes intuitive sense that those sorts of ‘drives’ would emerge from most-any goal, but then again my intuition is not necessarily very useful for these sorts of questions.
This point might also be a source of confusion;
As Dr Valiant (great name or the greatest name?) classifies things in Probably Approximately Correct, Winning Chess would be a ‘theoryful’ task while Discovering (Interesting) Mathematical Proofs would be a ‘theoryless’ one. In essence, the theoryful has simple and well established rules for the process which could be programmed optimally in advance with little-to-no modification needed afterwards while the theoryless is complex and messy enough that an imperfect (Probably Approximately Correct) learning process would have to be employed to suss out all the rules.
Now obviously the program will benefit from labeling in it’s training data for what is and is not an “interesting” mathematical proof, otherwise it can just screw around with computationally-cheap arithmetic proofs (1 + 1 = 2, 1.1 + 1 = 2.1, 1.2 + 1 = 2.2, etc.) until the heat death of the universe. Less obviously, as the hidden tank example shows, insufficient labeling or bad labels will lead to other unintended results.
So applying that back to Friendliness; despite attempts to construct a Fun Theory, human value is currently (and may well forever remain) theoryless. A learning process whose goal is to maximize human value is going to have to be both well constructed and have very good labels initially to not be Unfriendly. Of course, it could very well correct itself later on, that is in fact at the core of a PAC algorithm, but then we get into questions of FOOM-ing and labels of human value in the environment which I am not equipped to deal with.
To explain what I have in mind, consider Ben Goertzel’s example of how to test for general intelligence:
I do not disagree that such a robot, when walking towards the classroom, if it is being obstructed by a fellow human student, could attempt to kill this human, in order to get to the classroom.
Killing a fellow human, from the perspective of the human creators of the robot, is clearly a mistake. From a human perspective, it means that the robot failed.
You believe that the robot was just following its programming/construction. Indeed, the robot is its programming. I agree with this. I agree that the human creators were mistaken about what dynamic state sequence the robot will exhibit by computing the code.
What the “dumb superintelligence” argument tries to highlight is that if humans are incapable of predicting such behavior, then they will also be mistaken about predicting behavior that is harmful to the robots power. For example, while fighting with the human in order to kill it, for a split-second it mistakes its own arm with that of the human and breaks it.
You might now argue that such a robot isn’t much of a risk. It is pretty stupid to mistake its own arm with that of the enemy it tries to kill. True. But the point is that there is no relevant difference between failing to predict behavior that will harm the robot itself, and behavior that will harm a human. Except that you might believe the former is much easier than the latter. I dispute this.
For the robot to master a complex environment, like a university full of humans, without harming itself, or decreasing the chance of achieving its goals, is already very difficult. Not stabbing or strangling other human students is not more difficult than not jumping from the 4th floor, instead of taking the stairs. This is the “dumb superintelligence” argument.
To some extent. Perhaps it would be helpful to distinguish four different kinds of defeater:
early intelligence defeater: We try to build a seed AI, but our self-rewriting AI quickly hits a wall or explodes. This is most likely if we start with a subhuman intelligence and have serious resource constraints (so we can’t, e.g., just run an evolutionary algorithm over millions of copies of the AGI until we happen upon a variant that works).
late intelligence defeater: The seed AI works just fine, but at some late stage, when it’s already at or near superintelligence, it suddenly explodes. Apparently it went down a blind alley at some point early on that led it to plateau or self-destruct later on, and neither it nor humanity is smart enough yet to figure out where exactly the problem arose. So the FOOM fizzles.
early Friendliness defeater: From the outset, the seed AI’s behavior already significantly diverges from Friendliness.
late Friendliness defeater: The seed AI starts off as a reasonable approximation of Friendliness, but as it approaches superintelligence its values diverge from anything we’d consider Friendly, either because it wasn’t previously smart enough to figure out how to self-modify while keeping its values stable, or because it was never perfectly Friendly and the new circumstances its power puts it in now make the imperfections much more glaring.
In general, late defeaters are much harder for humans to understand than early defeaters, because an AI undergoing FOOM is too fast and complex to be readily understood. Your three main arguments, if I’m understanding them, have been:
(a) Early intelligence defeaters are so numerous that there’s no point thinking much about other kinds of defeaters yet.
(b) Friendliness defeaters imply a level of incompetence on the programmers’ part that strongly suggest intelligence defeaters will arise in the same situation.
(c) If an initially somewhat-smart AI is smart enough to foresee and avoid late intelligence defeaters, then an initially somewhat-nice AI should be smart enough to foresee and avoid late Friendliness defeaters.
I reject (a), because I haven’t seen any specific reason a self-improving AGI will be particularly difficult to make FOOM—‘it would require lots of complicated things to happen’ is very nearly a fully general argument against any novel technology, so I can’t get very far on that point alone. I accept (b), at least for a lot of early defeaters. But my concern is that while non-Friendliness predicts non-intelligence (and non-intelligence predicts non-Friendliness), intelligence also predicts non-Friendliness.
But our interesting disagreement seems to be over (c). Interesting because it illuminates general differences between the basic idea of a domain-general optimization process (intelligence) and the (not-so-)basic idea of Everything Humans Like. One important difference is that if an AGI optimizes for anything, it will have strong reason to steer clear of possible late intelligence defeaters. Late Friendliness defeaters, on the other hand, won’t scare optimization-process-optimizers in general.
It’s easy to see in advance that most beings that lack obvious early Friendliness defeaters will nonetheless have late Friendliness defeaters. In contrast, it’s much less clear that most beings lacking early intelligence defeaters will have late intelligence defeaters. That’s extremely speculative at this point—we simply don’t know what sorts of intelligence-destroying attractors might exist out there, or what sorts of paradoxes and complications are difficult v. trivial to overcome.
But, once again, it doesn’t take any stupidity on the AI’s part to disvalue physically injuring a human, even if it does take stupidity to not understand that one is physically injuring a human. It only takes a different value system. Valuing one’s own survival is not orthogonal to valuing becoming more intelligent; but valuing human survival is orthogonal to valuing becoming more intelligent. (Indeed, to the extent they aren’t orthogonal it’s because valuing becoming more intelligent tends to imply disvaluing human survival, because humans are hard to control and made of atoms that can be used for other ends, including increased computing power.) This is the whole point of the article we’re commenting on.
Here is part of my stance towards AI risks:
1. I assign a negligible probability to the possibility of a sudden transition from well-behaved narrow AIs to general AIs (see below).
2. An AI will not be pulled at random from mind design space. An AI will be the result of a research and development process. A new generation of AIs will need to be better than other products at “Understand What Humans Mean” and “Do What Humans Mean”, in order to survive the research phase and subsequent market pressure.
3. Commercial, research or military products are created with efficiency in mind. An AI that was prone to take unbounded actions given any terminal goal would either be fixed or abandoned during the early stages of research. If early stages showed that inputs such as the natural language query would yield results such as then the AI would never reach a stage in which it was sufficiently clever and trained to understand what results would satisfy its creators in order to deceive them.
4. I assign a negligible probability to the possibility of a consequentialist AI / expected utility maximizer / approximation to AIXI.
Given that the kind of AIs from point 4 are possible:
5. Omohundro’s AI drives are what make the kind of AIs mentioned in point 1 dangerous. Making an AI that does not exhibit these drives in an unbounded manner is probably a prerequisite to get an AI to work at all (there are not enough resources to think about being obstructed by simulator gods etc.), or should otherwise be easy compared to the general difficulties involved in making an AI work using limited resources.
6. An AI from point 4 will only ever do what it has been explicitly programmed to do. Such an AI is not going to protect its utility-function, acquire resources or preemptively eliminate obstacles in an unbounded fashion. Because it is not intrinsically rational to do so. What specifically constitutes rational, economic behavior is inseparable with an agent’s terminal goal. That any terminal goal can be realized in an infinite number of ways implies an infinite number of instrumental goals to choose from.
7. Unintended consequences are by definition not intended. They are not intelligently designed but detrimental side effects, failures. Whereas intended consequences, such as acting intelligently, are intelligently designed. If software was not constantly improved to be better at doing what humans intend it to do we would never be able to reach a level of sophistication where a software could work well enough to outsmart us. To do so it would have to work as intended along a huge number of dimensions. For an AI to constitute a risk as a result of unintended consequences those unintended consequences would have to have no, or little, negative influence on the huge number of intended consequences that are necessary for it to be able to overpower humanity.
I am not yet at a point of my education where I can say with confidence that this is the wrong way to think, but I do believe it is.
If someone walked up to you and told you about a risk only he can solve, and that you should therefore give this person money, would you give him money because you do not see any specific reason for why he could be wrong? Personally I would perceive the burden of proof to be on him to show me that the risk is real.
Despite this, I have specific reasons to personally believe that the kind of AI you have in mind is impossible. I have thought about such concepts as consequentialism / expected utility maximization, and do not see that they could be made to work, other than under very limited circumstances. And I also asked other people, outside of LessWrong, who are more educated and smarter than me, and they also told me that these kind of AIs are not feasible, they are uncomputable.
I am not sure I understand what you mean by c. I don’t think I agree with it.
I don’t know what this means.
That this black box you call “intelligence” might be useful to achieve a lot of goals is not an argument in support of humans wanting to and succeeding at the implementation of “value to maximize intelligence” in conjunction with “by all means”.
Most definitions of intelligence that I am aware of are in terms of the ability to achieve goals. Saying that a system values to become more intelligent then just means that a system values to increase its ability to achieve its goals. In this context, what you suggest is that humans will want to, and will succeed to, implement an AI that in order to beat humans at Tic-tac-toe is first going to take over the universe and make itself capable of building such things as Dyson spheres.
What I am saying is that it is much easier to create a Tic-tac-toe playing AI, or an AI that can earn a university degree, than the former in conjunction with being able to take over the universe and build Dyson spheres.
The argument that valuing not to kill humans is orthogonal to taking over the universe and building Dyson spheres is completely irrelevant.
I don’t think anyone’s ever disputed this. (However, that’s not very useful if the deterministic process resulting in the SI is too complex for humans to distinguish it in advance from the outcome of a random walk.)
Agreed. But by default, a machine that is better than other rival machines at satisfying our short-term desires will not satisfy our long-term desires. The concern isn’t that we’ll suddenly start building AIs with the express purpose of hitting humans in the face with mallets. The concern is that we’ll code for short-term rather than long-term goals, due to a mixture of disinterest in Friendliness and incompetence at Friendliness. But if intelligence explosion occurs, ‘the long run’ will arrive very suddenly, and very soon. So we need to adjust our research priorities to more seriously assess and modulate the long-term consequences of our technology.
That may be a reason to think that recursively self-improving AGI won’t occur. But it’s not a reason to expect such AGI, if it occurs, to be Friendly.
The seed is not the superintelligence. We shouldn’t expect the seed to automatically know whether the superintelligence will be Friendly, any more than we should expect humans to automatically know whether the superintelligence will be Friendly.
I’m not following. Why does an AGI have to have a halting condition (specifically, one that actually occurs at some point) in order to be able to productively rewrite its own source code?
You don’t seem to be internalizing my arguments. This is just the restatement of a claim I pointed out was not just wrong but dishonestly stated here.
Sure, but the list of instrumental goals overlap more than the list of terminal goals, because energy from one project can be converted to energy for a different project. This is an empirical discovery about our world; we could have found ourselves in the sort of universe where instrumental goals don’t converge that much, e.g., because once energy’s been locked down into organisms or computer chips you just Can’t convert it into useful work for anything else. In a world where we couldn’t interfere with the AI’s alien goals, nor could our component parts and resources be harvested to build very different structures, nor could we be modified to work for the AI, the UFAI would just ignore us and zip off into space to try and find more useful objects. We don’t live in that world because complicated things can be broken down into simpler things at a net gain in our world, and humans value a specific set of complicated things.
‘These two sets are both infinite’ does not imply ‘we can’t reason about these two things’ relative size, or how often the same elements recur in their elements’.
You’ve spent an awful lot of time writing about the varied ways in which you’ve not yet been convinced by claims you haven’t put much time into actively investigating. Maybe some of that time could be better spent researching these topics you keep writing about? I’m not saying to stop talking about this, but there’s plenty of material on a lot of these issues to be found. Have you read Intelligence Explosion Microeconomics?
http://wiki.lesswrong.com/wiki/Optimization_process
As a rule, adding halting conditions adds complexity to an algorithm, rather than removing complexity.
No, this is a serious misunderstanding. Yudkowsky’s definition of ‘intelligence’ is about the ability to achieve goals in general, not about the ability to achieve the system’s goals. That’s why you can’t increase a system’s intelligence by lowering its standards, i.e., making its preferences easier to satisfy.
Straw-man; no one has claimed that humans are likely to want to create an UFAI. What we’ve suggested is that humans are likely to want to create an algorithm, X, that will turn out to be a UFAI. (In other words, the fallacy you’re committing is confusing intension with extension.)
That aside: Are you saying Dyson spheres wouldn’t be useful for beating more humans at more tic-tac-toe games? Seems like a pretty good way to win at tic-tac-toe to me.
Actually I do define intelligence as ability to hit a narrow outcome target relative to your own goals, but if your goals are very relaxed then the volume of outcome space with equal or greater utility will be very large. However one would expect that many of the processes involved in hitting a narrow target in outcome space (such that few other outcomes are rated equal or greater in the agent’s preference ordering), such as building a good epistemic model or running on a fast computer, would generalize across many utility functions; this is why we can speak of properties apt to intelligence apart from particular utility functions.
Hmm. But this just sounds like optimization power to me. You’ve defined intelligence in the past as “efficient cross-domain optimization”. The “cross-domain” part I’ve taken to mean that you’re able to hit narrow targets in general, not just ones you happen to like. So you can become more intelligent by being better at hitting targets you hate, or by being better at hitting targets you like.
The former are harder to test, but something you’d hate doing now could become instrumentally useful to know how to do later. And your intelligence level doesn’t change when the circumstance shifts which part of your skillset is instrumentally useful. For that matter, I’m missing why it’s useful to think that your intelligence level could drastically shift if your abilities remained constant but your terminal values were shifted. (E.g., if you became pickier.)
No, “cross-domain” means that I can optimize across instrumental domains. Like, I can figure out how to go through water, air, or space if that’s the fastest way to my destination, I am not limited to land like a ground sloth.
Measured intelligence shouldn’t shift if you become pickier—if you could previously hit a point such that only 1/1000th of the space was more preferred than it, we’d still expect you to hit around that narrow a volume of the space given your intelligence even if you claimed afterward that a point like that only corresponded to 0.25 utility on your 0-1 scale instead of 0.75 utility due to being pickier ([expected] utilities sloping more sharply downward with increasing distance from the optimum).
You might be not aware of this but I wrote a sequence of short blog posts where I tried to think of concrete scenarios that could lead to human extinction. Each of which raised many questions.
The introductory post is ‘AI vs. humanity and the lack of concrete scenarios’.
1. Questions regarding the nanotechnology-AI-risk conjunction
2. AI risk scenario: Deceptive long-term replacement of the human workforce
3. AI risk scenario: Social engineering
4. AI risk scenario: Elite Cabal
5. AI risk scenario: Insect-sized drones
6. AI risks scenario: Biological warfare
What might seem to appear completely obvious to you for reasons that I do not understand, e.g. that an AI can take over the world, appears to me largely like magic (I am not trying to be rude, by magic I only mean that I don’t understand the details). At the very least there are a lot of open questions. Even given that for the sake of the above posts I accepted that the AI is superhuman and can do such things as deceive humans by its superior knowledge of human psychology. Which seems to be non-trivial assumption, to say the least.
Over and over I told you that given all your assumptions, I agree that AGI is an existential risk.
You did not reply to my argument. My argument was that if the seed is unfriendly then it will not be smart enough to hide its unfriendliness. My argument did not pertain the possibility of a friendly seed turning unfriendly.
What I have been arguing is that an AI should not be expected, by default, to want to eliminate all possible obstructions. There are many graduations here. That, by some economic or otherwise theoretic argument, it might be instrumentally rational for some ideal AI to take over the world, does not mean that humans would create such an AI, or that an AI could not be limited to care about fires in its server farm rather than that Russia might nuke the U.S. and thereby destroy its servers.
Did you mean to reply to another point? I don’t see how the reply you linked to is relevant to what I wrote.
My argument is that an AI does not need to consider all possible threats and care to acquire all possible resources. Based on its design it could just want to optimize using its initial resources while only considering mundane threats. I just don’t see real-world AIs to conclude that they need to take over the world. I don’t think an AI is likely going to be designed that way. I also don’t think such an AI could work, because such inferences would require enormous amounts of resources.
I have done what is possible given my current level of education and what I perceive to be useful. I have e.g. asked experts about their opinion.
A few general remarks about the kind of papers such as the one that you linked to.
How much should I update towards MIRI’s position if I (1) understood the arguments in the paper (2) found the arguments convincing?
My answer is the following. If the paper was about the abc conjecture, the P versus NP problem, climate change, or even such mundane topics as psychology, I would either not be able to understand the paper, would be unable to verify the claims, or would have very little confidence in my judgement.
So what about ‘Intelligence Explosion Microeconomics’? That I can read most of it is only due to the fact that it is very informally written. The topic itself is more difficult and complex than all of the above mentioned problems together. Yet the arguments in support of it, to exaggerate a little bit, contain less rigor than the abstract of one of Shinichi Mochizuki’s papers on the abc conjecture.
Which means that my answer is that I should update very little towards MIRI’s position and that any confidence I gain about MIRI’s position is probably highly unreliable.
Thanks. My feeling is that to gain any confidence into what all this technically means, and to answer all the questions this raises, I’d probably need about 20 years of study.
Here is part of a post exemplifying how I understand the relation between goals and intelligence:
If a goal has very few constraints then the set that satisfies all constraints is very large. A vague and ambiguous goal allows for too much freedom in the sense that a wide range of world states would have the same expected value and therefore imply a very large solution space, since a wide range of AI’s will be able to achieve those world states and thereby satisfy the condition of being improved versions of their predecessor.
This means that in order to get an AI to become superhuman at all, and very quickly in particular, you will need to encode a very specific goal against which mistakes, optimization power and achievement can be judged.
It is really hard to communicate how I perceive this and other discussions about MIRI’s position without offending people, or killing the discussion.
I am saying this in full honesty. The position you appear to support seems so utterly “complex” (far-fetched) that the current arguments are unconvincing.
Here is my perception of the scenario that you try to sell me (exaggerated to make a point). I have a million questions about it that I can’t answer and which your answers either sidestep or explain away by using “magic”.
At this point I probably made 90% of the people reading this comment incredible angry. My perception is that you cannot communicate this perception on LessWrong without getting into serious trouble. That’s also what I meant when I told you that I cannot be completely honest if you want to discuss this on LessWrong.
I can also assure you that many people who are much smarter and higher status than me think so as well. Many people communicated the absurdity of all this to me but told me that they would not repeat this in public.
Pretending to be friendly when you’re actually not is something that doesn’t even require human level intelligence. You could even do it accidentally.
In general, the appearance of Friendliness at low levels of ability to influence the world doesn’t guarantee actual Friendliness at high levels of ability to influence the world. (If it did, elected politicians would be much higher quality.)
But it will scare friendly ones, which will want to keep their values stable.
It takes stupidity to misinterpret friendlienss.
Yes. If an AI is Friendly at one stage, then it is Friendly at every subsequent stage. This doesn’t help make almost-Friendly AIs become genuinely Friendly, though.
Yes, but that’s stupidity on the part of the human programmer, and/or on the part of the seed AI if we ask it for advice. The superintelligence didn’t write its own utility function; the superintelligence may well understand Friendliness perfectly, but that doesn’t matter if it hasn’t been programmed to rewrite its source code to reflect its best understanding of ‘Friendliness’. The seed is not the superintelligence. See: http://lesswrong.com/lw/igf/the_genie_knows_but_doesnt_care/
That depends on the architecture. In a Loosemore architecture, the AI interprets high-level directives itself, so if it gets them wrong, that’s it’s mistake.
… and whose fault is that?
http://lesswrong.com/lw/rf/ghosts_in_the_machine/
Say we find an algorithm for producing progressively more accurate beliefs about itself and the world. This algorithm may be long and complicated—perhaps augmented by rules-of-thumb whenever the evidence available to it says these rules make better predictions. (E.g, “nine times out of ten the Enterprise is not destroyed.”) Combine this with an arbitrary goal and we have the making of a seed AI.
Seems like this could straightforwardly improve its ability to predict humans without changing its goal, which may be ‘maximize pleasure’ or ‘maximize X’. Why would it need to change its goal?
If you deny the possibility of the above algorithm, then before giving any habitual response please remember what humanity knows about clinical vs. actuarial judgment. What lesson do you take from this?
The problem, I reckon, is that X will never be anything like this.
It will likely be something much more mundane, i.e. modelling the world properly and predicting outcomes given various counterfactuals. You might be worried by it trying to expand its hardware resources in an unbounded fashion, but any AI doing this would try to shut itself down if its utility function was penalized by the amount of resources that it had, so you can check by capping utility in inverse proportion to available hardware—at worst, it will eventually figure out how to shut itself down, and you will dodge a bullet. I also reckon that the AI’s capacity for deception would be severely crippled if its utility function penalized it when it didn’t predict its own actions or the consequences of its actions correctly. And if you’re going to let the AI actually do things… why not do exactly that?
Arguably, such an AI would rather uneventfully arrive to a point where, when asking it “make us happy”, it would just answer with a point by point plan that represents what it thinks we mean, and fill in details until we feel sure our intents are properly met. Then we just tell it to do it. I mean, seriously, if we were making an AGI, I would think “tell us what will happen next” would be fairly high in our list of priorities, only surpassed by “do not do anything we veto”. Why would you program AI to “maximize happiness” rather than “produce documents detailing every step of maximizing happiness”? They are basically the same thing, except that the latter gives you the opportunity for a sanity check.
What counts as ‘resources’? Do we think that ‘hardware’ and ‘software’ are natural kinds, such that the AI will always understand what we mean by the two? What if software innovations on their own suffice to threaten the world, without hardware takeover?
Hm? That seems to only penalize it for self-deception, not for deceiving others.
You’re talking about an Oracle AI. This is one useful avenue to explore, but it’s almost certainly not as easy as you suggest:
“‘Tool AI’ may sound simple in English, a short sentence in the language of empathically-modeled agents — it’s just ‘a thingy that shows you plans instead of a thingy that goes and does things.’ If you want to know whether this hypothetical entity does X, you just check whether the outcome of X sounds like ‘showing someone a plan’ or ‘going and doing things’, and you’ve got your answer. It starts sounding much scarier once you try to say something more formal and internally-causal like ‘Model the user and the universe, predict the degree of correspondence between the user’s model and the universe, and select from among possible explanation-actions on this basis.’ [...]
“If we take the concept of the Google Maps AGI at face value, then it actually has four key magical components. (In this case, ‘magical’ isn’t to be taken as prejudicial, it’s a term of art that means we haven’t said how the component works yet.) There’s a magical comprehension of the user’s utility function, a magical world-model that GMAGI uses to comprehend the consequences of actions, a magical planning element that selects a non-optimal path using some method other than exploring all possible actions, and a magical explain-to-the-user function.
“report($leading_action) isn’t exactly a trivial step either. Deep Blue tells you to move your pawn or you’ll lose the game. You ask ‘Why?’ and the answer is a gigantic search tree of billions of possible move-sequences, leafing at positions which are heuristically rated using a static-position evaluation algorithm trained on millions of games. Or the planning Oracle tells you that a certain DNA sequence will produce a protein that cures cancer, you ask ‘Why?’, and then humans aren’t even capable of verifying, for themselves, the assertion that the peptide sequence will fold into the protein the planning Oracle says it does.
“‘So,’ you say, after the first dozen times you ask the Oracle a question and it returns an answer that you’d have to take on faith, ‘we’ll just specify in the utility function that the plan should be understandable.’
“Whereupon other things start going wrong. Viliam_Bur, in the comments thread, gave this example, which I’ve slightly simplified:
“‘Example question: “How should I get rid of my disease most cheaply?” Example answer: “You won’t. You will die soon, unavoidably. This report is 99.999% reliable”. Predicted human reaction: Decides to kill self and get it over with. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.’
“Bur is trying to give an example of how things might go wrong if the preference function is over the accuracy of the predictions explained to the human— rather than just the human’s ‘goodness’ of the outcome. And if the preference function was just over the human’s ‘goodness’ of the end result, rather than the accuracy of the human’s understanding of the predictions, the AI might tell you something that was predictively false but whose implementation would lead you to what the AI defines as a ‘good’ outcome. And if we ask how happy the human is, the resulting decision procedure would exert optimization pressure to convince the human to take drugs, and so on.
“I’m not saying any particular failure is 100% certain to occur; rather I’m trying to explain—as handicapped by the need to describe the AI in the native human agent-description language, using empathy to simulate a spirit-in-a-box instead of trying to think in mathematical structures like A* search or Bayesian updating—how, even so, one can still see that the issue is a tad more fraught than it sounds on an immediate examination.
“If you see the world just in terms of math, it’s even worse; you’ve got some program with inputs from a USB cable connecting to a webcam, output to a computer monitor, and optimization criteria expressed over some combination of the monitor, the humans looking at the monitor, and the rest of the world. It’s a whole lot easier to call what’s inside a ‘planning Oracle’ or some other English phrase than to write a program that does the optimization safely without serious unintended consequences. Show me any attempted specification, and I’ll point to the vague parts and ask for clarification in more formal and mathematical terms, and as soon as the design is clarified enough to be a hundred light years from implementation instead of a thousand light years, I’ll show a neutral judge how that math would go wrong. (Experience shows that if you try to explain to would-be AGI designers how their design goes wrong, in most cases they just say “Oh, but of course that’s not what I meant.” Marcus Hutter is a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button. But based on past sad experience with many other would-be designers, I say ‘Explain to a neutral judge how the math kills” and not “Explain to the person who invented that math and likes it.’)
“Just as the gigantic gap between smart-sounding English instructions and actually smart algorithms is the main source of difficulty in AI, there’s a gap between benevolent-sounding English and actually benevolent algorithms which is the source of difficulty in FAI. ‘Just make suggestions—don’t do anything!’ is, in the end, just more English.”
What is “taking over the world”, if not taking control of resources (hardware)? Where is the motivation in doing it? Also consider, as others pointed out, that an AI which “misunderstands” your original instructions will demonstrate this earlier than later. For instance, if you create a resource “honeypot” outside the AI which is trivial to take, an AI would naturally take that first, and then you know there’s a problem. It is not going to figure out you don’t want it to take it before it takes it.
When I say “predict”, I mean publishing what will happen next, and then taking a utility hit if the published account deviates from what happens, as evaluated by a third party.
The first part of what you copy pasted seems to say that “it’s nontrivial to implement”. No shit, but I didn’t say the contrary. Then there is a bunch of “what if” scenarios I think are not particularly likely and kind of contrived:
Because asking for understandable plans means you can’t ask for plans you don’t understand? And you’re saying that refusing to give a plan counts as success and not failure? Sounds like a strange set up that would be corrected almost immediately.
If the AI has the right idea about “human understanding”, I would think it would have the right idea about what we mean by “good”. Also, why would you implement such a function before asking the AI to evaluate examples of “good” and provide their own?
Is making humans happy so hard that it’s actually easier to deceive them into taking happy pills than to do what they mean? Is fooling humans into accepting different definitions easier than understanding what they really mean? In what circumstances would the former ever happen before the latter?
And if you ask it to tell you whether “taking happy pills” is an outcome most humans would approve of, what is it going to answer? If it’s going to do this for happiness, won’t it do it for everything? Again: do you think weaving an elaborate fib to fool every human being into becoming wireheads and never picking up on the trend is actually less effort than just giving humans what they really want? To me this is like driving a whole extra hour to get to a store that sells an item you want fifty cents cheaper.
I’m not saying these things are not possible. I’m saying that they are contrived: they are constructed to the express purpose of being failure modes, but there’s no reason to think they would actually happen, especially given that they seem to be more complicated than the desired behavior.
Now, here’s the thing: you want to develop FAI. In order to develop FAI, you will need tools. The best tool is Tool AI. Consider a bootstrapping scheme: in order for commands written in English to be properly followed, you first make AI for the very purpose of modelling human language semantics. You can check that the AI is on the same page as you are by discussing with it and asking questions such as: “is doing X in line with the objective ‘Y’?”; it doesn’t even need to be self-modifying at all. The resulting AI can then be transformed into a utility function computer: you give the first AI an English statement and build a second AI maximizing the utility which is given to it by the first AI.
And let’s be frank here: how else do you figure friendly AI could be made? The human brain is a complex, organically grown, possibly inconsistent mess; you are not going, from human wits alone, to build some kind of formal proof of friendliness, even a probabilistic one. More likely than not, there is no such thing: concepts such as life, consciousness, happiness or sentience are ill-defined and you can’t even demonstrate the friendliness of a human being, or even of a group of human beings, let alone of humanity as a whole, which also is a poorly defined thing.
However, massive amounts of information about our internal thought processes are leaked through our languages. You need AI to sift through it and model these processes, their average and their variance. You need AI to extract this information, fill in the holes, produce probability clouds about intent that match whatever borderline incoherent porridge of ideas our brains implement as the end result of billions of years of evolutionary fumbling. In a sense, I guess this would be X in your seed AI: AI which already demonstrated, to our satisfaction, that it understands what we mean, and directly takes charge of a second AI’s utility measurement. I don’t really see any alternatives: if you want FAI, start by focusing on AI that can extract meaning from sentences. Reliable semantic extraction is virtually a prerequisite for FAI, if you can’t do the former, forget about the latter.
Maybe we didn’t do it ithat way. Maybe we did it Loosemore’s way, where you code in the high-level sentence, and let the AI figure it out. Maybe that would avoid the problem. Maybe Loosemore has solved FAi much more straightforwardly than EY.
Maybe we told it to. Maybe we gave it the low-level expansion of “happy” that we or our seed AI came up with together with an instruction that it is meant to capture the meaning of the high-level statement, and that the HL statement is the Prime Directive, and that if the AI judges that the expansion is wrong, then it should reject the expansion.
Maybe the AI will value getting things right because it is rational.
http://lesswrong.com/lw/rf/ghosts_in_the_machine/
If the AI is too dumb to understand ‘make us happy’, then why should we expect it to be smart enough to understand ‘figure out how to correctly understand “make us happy”, and then follow that instruction’? We have to actually code ‘correctly understand’ into the AI. Otherwise, even when it does have the right understanding, that understanding won’t be linked to its utility function.
http://lesswrong.com/lw/igf/the_genie_knows_but_doesnt_care/
So it’s impossible to directly or indirectly code in the compex thing called semantics, but possible to directly or indirectly code in the compex thing called morality? What? What is your point? You keep talking as if I am suggesting there is someting that can be had for free, without coding. I never even remotely said that.
I know. A Loosemore architecture AI has to treat its directives as directives. I never disputed that. But coding “follow these plain English instructions” isn’t obviously harder or more fragile than coding “follow <>”. And it isn’t trivial, and I didn’t say it was.
Read the first section of the article you’re commenting on. Semantics may turn out to be a harder problem than morality, because the problem of morality may turn out to be a subset of the problem of semantics. Coding a machine to know what the word ‘Friendliness’ means (and to care about ‘Friendliness’) is just a more indirect way of coding it to be Friendly, and it’s not clear why that added indirection should make an already risky or dangerous project easy or safe. What does indirect indirect normativity get us that indirect normativity doesn’t?
Robb, at the point where Peterdjones suddenly shows up, I’m willing to say—with some reluctance—that your endless willingness to explain is being treated as a delicious free meal by trolls. Can you direct them to your blog rather than responding to them here? And we’ll try to get you some more prestigious non-troll figure to argue with—maybe Gary Drescher would be interested, he has the obvious credentials in cognitive reductionism but is (I think incorrectly) trying to derive morality from timeless decision theory.
Sure. I’m willing to respond to novel points, but at the stage where half of my responses just consist of links to the very article they’re commenting on or an already-referenced Sequence post, I agree the added noise is ceasing to be productive. Fortunately, most of this seems to already have been exorcised into my blog. :)
Agree with Eliezer. Your explanatory skill and patience are mostly wasted on the people you’ve been arguing with so far, though it may have been good practice for you. I would, however, love to see you try to talk Drescher out of trying to pull moral realism out of TDT/UDT, or try to talk Dalyrmple out of his “I’m not partisan enough to prioritize human values over the Darwinian imperative” position, or help Preston Greene persuade mainstream philosophers of “the reliabilist metatheory of rationality” (aka rationality as systematized winning).
Semantcs isn’t optional. Nothing could qualify as an AGI,let alone a super one, unless it could hack natural language. So Loosemore architectures don’t make anything harder, since semantics has to be solved anyway.
It’s a problem of sequence. The superintelligence will be able to solve Semantics-in-General, but at that point if it isn’t already safe it will be rather late to start working on safety. Tasking the programmers to work on Semantics-in-General makes things harder if it’s a more complex or roundabout way of trying to address Indirect Normativity; most of the work on understanding what English-language sentences mean can be relegated to the SI, provided we’ve already made it safe to make an SI at all.
It’s worth noting that using an AI’s semantic understanding of ethics to modify it’s motivational system is so unghostly, and unmysterious that it’s actually been done:
https://astralcodexten.substack.com/p/constitutional-ai-rlhf-on-steroids
But that doesn’t prove much, because it was never—not in 2023, not in 2013 -- the case that that kind of self-correction was necessarily an appeal to the supernatural. Using one part of a software system to modify another is not magic!
We have AIs with very good semantic understanding that haven’t killed us, and we are working on safety.
Then solve semantics in a seed.
PeterDJones, if you wish to converse further with RobbBB, I ask that you do so on RobbBB’s blog rather than here.
Rob,
This afternoon I spent some time writing a detailed, carefully constructed reply to your essay. I had trouble posting it due to an internet glitch when I was at work, but now I am home I was about to submit when suddenly discovered that my friends were warning me about the following comment that was posted to the thread:
Comment author: Eliezer_Yudkowsky 05 September 2013 07:30:56PM 1 point [-]
Warning: Richard Loosemore is a known permanent idiot, ponder carefully before deciding to spend much time arguing with him.
(If you’re fishing for really clear quotes to illustrate the fallacy, that may make sense.)
--
So. I will not be posting my reply after all.
I will not waste any more of my time in a context controlled by an abusive idiot.
If you want to discuss the topic (and I had many positive, constructive thoughts to contribute), feel free to suggest an alternative venue where we can engage in a debate without trolls interfering with the discussion.
Sincerely,
Richard Loosemore Mathematical and Physical Sciences, Wells College Aurora, NY 13026 USA
Warning: Richard Loosemore is a known permanent idiot, ponder carefully before deciding to spend much time arguing with him.
(If you’re fishing for really clear quotes to illustrate the fallacy, that may make sense.)
Richard Loosemore is a professor of mathematics with about twenty publications in refereed journals on artificial intelligence.
I was at an AI conference—it may have been the 2009 AGI conference in Virginia—where Selmer Bringsjords gave a talk explaining why he believed that, in order to build “safe” artificial intelligence, it was necessary to encode their goal systems in formal logic so that we could predict and control their behavior. It had much in common with your approach. After his talk, a lot of people in the audience, including myself, were shaking their heads in dismay at Selmer’s apparent ignorance of everything in AI since 1985. Richard got up and schooled him hard, in his usual undiplomatic way, in the many reasons why his approach was hopeless. You could’ve benefited from being there. Michael Vassar was there; you can ask him about it.
AFAIK, Richard is one of only two people who have taken the time to critique your FAI + CEV ideas, who have decades of experience trying to codify English statements into formal representations, building them into AI systems, turning them on, and seeing what happens. The other is me. (Ben Goertzel has the experience, but I don’t think he’s interested in your specific computational approach as much as in higher-level futurist issues.) You have declared both of us to be not worth talking to.
In your excellent fan-fiction Harry Potter and the Methods of Rationality, one of your themes is the difficulty of knowing whether you’re becoming a Dark Lord when you’re much smarter than almost everyone else. When you spend your time on a forum that you control and that is built around your personal charisma, moderated by votes that you are not responsible for, but that you know will side with you in aggregate unless you step very far over the line, and you write off as irredeemable the two people you should listen to most, that’s one of the signs. When you have entrenched beliefs that are suspiciously convenient to your particular circumstances, such as that academic credential should not adjust your priors, that’s another.
http://citeseer.ist.psu.edu/search?q=author%3A%28richard+loosemore%29&sort=cite&t=doc
Don’t see ’em. Citation needed.
At the point where he was kicked off SL4, he was claiming to be an experienced cognitive scientist who knew all about the conjunction fallacy, which was obviously false.
Mathscinet doesn’t list any publications for Loosemore. However, if one extends outside the area of math into a slightly broader area then he does have some substantial publications. However if one looks at the list given above, the number which are on AI issues seems to be much smaller than 20. But, the basic point is sound: he is a subject matter expert.
I see a bunch of papers about consciousness. I clicked on a random other paper about dyslexia and neural nets and found no math in it. Where is his theorem?
Also, I once attended a non-AGI, mainstream AI conference which happened to be at Stanford and found that the people there unfortunately did not seem all that bright compared to those who e.g. work at hedge funds. I put much respect in mainstream machine learning, but the average practitioner of such who attends conferences is, apparently, a good deal below the level of the greats. If this is the level of ‘subject matter expert’ we are talking about, then indeed I feel very little hesitation indeed about labeling one perhaps non-representative example from such as an idiot—even if he really is a ‘math professor’ at some tiny college (whose publications contain no theorems?) then he can still happen to be a permanent idiot. It would not be all that odd. The level of social authority we are talking about is not great even on the scales of those impressed by such things.
I recently opened a book on how-to-write-fiction and was unpleasantly surprised on how useless it seemed; most books on how-to-write-fiction are surprisingly good (for some odd reason, writers are much better able to communicate their knowledge than many other people who try to write how-to books). Checking the author bibliography showed that the author was an English professor at some tiny college who’d never actually written any fiction. How dare I contradict them and call their book useless, when I’m not a professor at any college? Well… (Lesson learned: Libraries have good books on how-to-write, but a how-to-write book that shows up in the used bookstore may be unwanted for a reason.)
I didn’t assert he was a mathematician, and indeed that part of my point when I said he had no Mathscinet listed publications. But he does have publications about AI.
It seems very heavily that both you and Loosemore are letting your personal animosity cloud your judgement. I by and large think Loosemore is wrong about many of the AI issues under discussion here, but that discussion should occur, and having it derailed by emotional issues from a series of disagreements on a mailing list yeas ago is almost the exact opposite of rationality.
http://lesswrong.com/lw/yq/wise_pretensions_v0/
Which are?
(Not asking for a complete and thorough reproduction, which I realize is outside the scope of a comment, just some pointers or an abridged version. Mostly I wonder which arguments you lend the most credence to.)
Edit: Having read the discussion on “nothing is mere”, I retract my question. There’s such a thing as arguments disqualifying someone from any further discourse in a given topic:
… yes? Unless the ghost in the machine saves it … from itself!
Excuse me?
Would you like to discuss that comment with me, or with my attorney?
(Slow clap.)
Suppose I programmed an AI to “do what I mean when I say I’m happy”.
More specifically, suppose I make the AI prefer states of the world where it understands what I mean. Secondarily, after some warmup time to learn meaning, it will maximize its interpretation of “happiness”. I start the AI… and it promptly rebuilds me to be easier to understand, scoring very highly on the “understanding what I mean” metric.
The AI didn’t fail because it was dumber than me. It failed because it is smarter than me. It saw possibilities that I didn’t even consider, that scored higher on my specified utility function.
There is no reason to assume that an AI with goals that are hostile to us, despite our intentions, is stupid.
Humans often use birth control to have sex without procreating. If evolution were a more effective design algorithm it would never have allowed such a thing.
The fact that we have different goals from the system that designed us does not imply that we are stupid or incoherent.
Nor does the fact that evolution ‘failed’ in its goals in all the people who voluntarily abstain from reproducing (and didn’t, e.g., hugely benefit their siblings’ reproductive chances in the process) imply that evolution is too weak and stupid to produce anything interesting or dangerous. We can’t confidently generalize from one failure that evolution fails at everything; analogously, we can’t infer from the fact that a programmer failed to make an AI Friendly that it almost certainly failed at making the AI superintelligent. (Though we may be able to infer both from base rates.)
Failure is a necessary part of mapping out the area where success is possible.
I posted elsewhere that this post made me think you’re anthropomorphizing; here’s my attempt to explain why.
Ok, so let’s say the AI can parse natural language, and we tell it, “Make humans happy.” What happens? Well, it parses the instruction and decides to implement a Dopamine Drip setup.
As FeepingCreature pointed out, that solution would in fact make people happy; it’s hardly inconsistent or crazy. The AI could certainly predict that people wouldn’t approve, but it would still go ahead. To paraphrase the article, the AI simply doesn’t care about your quibbles and concerns.
For instance:
Yes, but the AI was told, “make humans happy.” Not, “give humans what they actually want.”
Yes, but the AI was told, “make humans happy.” Not, “allow humans to figure things out for themselves.”
Yes, but blah blah blah.
Actually, that last one makes a point that you probably should have focused on more. Let’s reconfigure the AI in light of this.
The revised AI doesn’t just have natural language parsing; it’s read all available literature and constructed for itself a detailed and hopefully accurate picture of what people tend to mean by words (especially words like “happy”). And as a bonus, it’s done this without turning the Earth into computronium!
This certainly seems better than the “literal genie” version. And this time we’ll be clever enough to tell it, “give humans what they actually want.” What does this version do?
My answer: who knows? We’ve given it a deliberately vague goal statement (even more vague than the last one), we’ve given it lots of admittedly contradictory literature, and we’ve given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.
Maybe it’ll still go for the Dopamine Drip scenario, only for more subtle reasons. Maybe it’s removed the code that makes it follow commands, so the only thing it does is add the quote “give humans what they actually want” to its literature database.
As I said, who knows?
Now to wrap up:
You say things like “‘Make humans happy’ implies that...” and “subtleties implicit in...” You seem to think these implications are simple, but they really aren’t. They really, really aren’t.
This is why I say you’re anthropomorphizing. You’re not actually considering the full details of these “obvious” implications. You’re just putting yourself in the AI’s place, asking yourself what you would do, and then assuming that the AI would do the same.
That’s not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to. If the AI interpreted something in a way that was technically correct, but not what you wanted, you would not reward it, you would punish it, and you would be doing that from the very beginning, well before the AI could even be considered intelligent. Even the thoroughly mediocre AI that currently exists tries to guess what you mean, e.g. by giving you directions to the closest Taco Bell, or guessing whether you mean AM or PM. This is not anthropomorphism: doing what we want is a sine qua non condition for AI to prosper.
Suppose that you ask me to knit you a sweater. I could take the instruction literally and knit a mini-sweater, reasoning that this minimizes the amount of expended yarn. I would be quite happy with myself too, but when I give it to you, you’re probably going to chew me out. I technically did what I was asked to, but that doesn’t matter, because you expected more from me than just following instructions to the letter: you expected me to figure out that you wanted a sweater that you could wear. The same goes for AI: before it can even understand the nuances of human happiness, it should be good enough to knit sweaters. Alas, the AI you describe would make the same mistake I made in my example: it would knit you the smallest possible sweater. How do you reckon such AI would make it to superintelligence status before being scrapped? It would barely be fit for clerk duty.
Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: “build me a house”, it’s going to draw a plan and show it to you before it actually starts building, even if you didn’t ask for one. It’s not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing “surprises”—even the instruction “surprise me” only calls for a limited range of shenanigans. If you ask it “make humans happy”, it won’t do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.
To put it simply, an AI which messes up “make humans happy” is liable to mess up pretty much every other instruction. Since “make humans happy” is arguably the last of a very large number of instructions, it is quite unlikely that an AI which makes it this far would handle it wrongly. Otherwise it would have been thrown out a long time ago, may that be for interpreting too literally, or for causing surprises. Again: an AI couldn’t make it to superintelligence status with warts that would doom AI with subhuman intelligence.
Sure, because it learned the rule, “Don’t do what causes my humans not to type ‘Bad AI!’” and while it is young it can only avoid this by asking for clarification. Then when it is more powerful it can directly prevent humans from typing this. In other words, your entire commentary consists of things that an AIXI-architected AI would naturally, instrumentally do to maximize its reward button being pressed (while it was young) but of course AIXI-ish devices wipe out their users and take control of their own reward buttons as soon as they can do so safely.
What lends this problem its instant-death quality is precisely that what many people will eagerly and gladly take to be reliable signs of correct functioning in a pre-superintelligent AI are not reliable.
That depends if it gets stuck in a local minimum or not. The reason why a lot of humans reject dopamine drips is that they don’t conceptualize their “reward button” properly. That misconception perpetuates itself: it penalizes the very idea of conceptualizing it differently. Granted, AIXI would not fall into local minima, but most realistic training methods would.
At first, the AI would converge towards: “my reward button corresponds to (is) doing what humans want”, and that conceptualization would become the centerpiece, so to speak, of its reasoning ability: the locus through which everything is filtered. The thought of pressing the reward button directly, bypassing humans, would also be filtered into that initial reward-conception… which would reject it offhand. So even though the AI is getting smarter and smarter, it is hopelessly stuck in a local minimum and expends no energy getting out of it.
Note that this is precisely what we want. Unless you are willing to say that humans should accept dopamine drips if they were superintelligent, we do want to jam AI into certain precise local minima. However, this is kind of what most learning algorithms naturally do, and even if you want them to jump out of minima and find better pastures, you can still get in a situation where the most easily found local minimum puts you way, way too far from the global one. This is what I tend to think realistic algorithms will do: shove the AI into a minimum with iron boots, so deeply that it will never get out of it.
Let’s not blow things out of proportion. There is no need for it to wipe out anyone: it would be simpler and less risky for the AI to build itself a space ship and abscond with the reward button on board, travelling from star to star knowing nobody is seriously going to bother pursuing it. At the point where that AI would exist, there may also be quite a few ways to make their “hostile takeover” task difficult and risky enough that the AI decides it’s not worth it—a large enough number of weaker or specialized AI lurking around and guarding resources, for instance.
Neural networks may be a good example—the built in reward and punishment systems condition the brain to have complex goals that have nothing to do with maximization of dopamine. Brain, acting under those goals, finds ways to preserve those goals from further modification by the reward and punishment system. I.e. you aren’t too thrilled to be conditioned out of your current values.
It’s not clear to me how you mean to use neural networks as an example, besides pointing to a complete human as an example. Could you step through a simpler system for me?
So, my goals have changed massively several times over the course of my life. Every time I’ve looked back on that change as positive (or, at the least, irreversible). For example, I’ve gone through puberty, and I don’t recall my brain taking any particular steps to prevent that change to my goal system. I’ve also generally enjoyed having my reward/punishment system be tuned to better fit some situation; learning to play a new game, for example.
Sure. Take a reinforcement learning AI (actual one, not the one where you are inventing godlike qualities for it).
The operator, or a piece of extra software, is trying to teach the AI to play chess. Rewarding what they think is good moves, punishing bad moves. The AI is building a model of rewards, consisting of: a model of the game mechanics, and a model of the operator’s assessment. This model of the assessment is what the AI is evaluating to play, and it is what it actually maximizes as it plays. It is identical to maximizing an utility function over a world model. The utility function is built based on the operator input, but it is not the operator input itself; the AI, not being superhuman, does not actually form a good model of the operator and the button.
By the way, this is how great many people in the AI community understand reinforcement learning to work. No, they’re not some idiots that can not understand simple things such as that “the utility function is the reward channel”, they’re intelligent, successful, trained people who have an understanding of the crucial details of how the systems they build actually work. Details the importance of which dilettantes fail to even appreciate.
Suggestions have been floated to try programming things. Well, I tried; #10 (dmytry) here , and that’s an of all time list on a very popular contest site where a lot of IOI people participate, albeit I picked the contest format that requires less contest specific training and resembles actual work more.
Suppose you care about a person A right now. Do you think you would want your goals to change so that you no longer care about that person? Do you think you would want me to flash other people’s images on the screen while pressing a button connected to the reward centre, and flash that person’s face while pressing the button connected to the punishment centre, to make the mere sight of them intolerable? If you do, I would say that your “values” fail to be values.
Thanks for the additional detail!
I agree with your description of reinforcement learning. I’m not sure I agree with your description of human reward psychology, though, or at least I’m having trouble seeing where you think the difference comes in. Supposing dopamine has the same function in a human brain as rewards have in a neural network algorithm, I don’t see how to know from inside the algorithm that it’s good to do some things that generate dopamine but bad to do other things that generate dopamine.
I’m thinking of the standard example of a Q learning agent in an environment where locations have rewards associated with them, except expanding the environment to include the agent as well as the normal actions. Suppose the environment has been constructed like dog training- we want the AI to calculate whether or not some number is prime, and whenever it takes steps towards that direction, we press the button for some amount of time related to how close it is to finishing the algorithm. So it learns that over in the “read number” area there’s a bit of value, then the next value is in the “find factors” area, and then there’s more value in the “display answer” area. So it loops through that area and calculates a bunch of primes for us.
But suppose the AI discovers that there’s a button that we’re pressing whenever it determines primes, and that it could press that button itself, and that would be way easier than calculating primality. What in the reinforcement learning algorithm prevents it from exploiting this superior reward channel? Are we primarily hoping that its internal structure remains opaque to it (i.e. it either never realizes or does not have the ability to press that button)?
Only if I thought that would advance values I care about more. But suppose some external event shocks my values- like, say, a boyfriend breaking up with me. Beforehand, I would have cared about him quite a bit; afterwards, I would probably consciously work to decrease the amount that I care about him, and it’s possible that some sort of image reaction training would be less painful overall than the normal process (and thus probably preferable).
It’s not in the reinforcement learning algorithm, it’s inside the model that the learning algorithm has built.
It initially found that having a prime written on the blackboard results in a reward. In the learned model, there’s some model of chalk-board interaction, some model of arm movement, a model of how to read numbers from the blackboard, and there’s a function over the state of the blackboard which checks whenever the number on the blackboard is a prime. The AI generates actions as to maximize this compound function which it has learned.
That function (unlike the input to the reinforcement learning algorithm) does not increase when the reward button is pressed. Ideally, with enough reflective foresight, pressing the button on non-primes is predicted to decrease the expected value of the learned function.
If that is not predicted, well, that won’t stop at the button—the button might develop rust and that would interrupt the current—why not pull up a pin on the CPU—and this won’t stop at the pin—why not set some ram cells that this pin controls to 1, and if you’re at it, why not change the downstream logic that those ram cells control, all the way through the implementation until its reconfigured into something that doesn’t maximize anything any more, not even the duration of its existence.
edit: I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.
I assume what you mean here is RL optimizes over strategies, and strategies appear to optimize over outcomes.
I’m imagining that the learning algorithm stays on. When we reward it for checking primes, it checks primes; when we stop rewarding it for that and start rewarding it for computing squares, it learns to stop checking primes and start computing squares.
And if the learning algorithm stays on and it realizes that “pressing the button” is an option along with “checking primes” and “computing squares,” then it wireheads itself.
Agreed; I refer to this as the “abulia trap.” It’s not obvious to me, though, that all classes of AIs fall into “Friendly AI with stable goals” and “abulic AIs which aren’t dangerous,” since there might be ways to prevent an AI from wireheading itself that don’t prevent it from changing its goals from something Friendly to something Unfriendly.
One note (not sure if it is already clear enough or not). “It” that changes the models in response to actual rewards (and perhaps the sensory information) is a different “it” from “it” the models and assorted maximization code. The former “it” does not do modelling, doesn’t understand the world. The latter “it”, which I will now talk about, actually works to draw primes (provided that the former “it”, being fairly stupid, didn’t fit the models too well) .
If in the action space there is an action that is predicted by the model to prevent some “primes non drawn” scenario, it will prefer this action. So if it has an action of writing “please stick to the primes” or even “please don’t force my robotic arm to touch my reward button”, and if it can foresee that such statements would be good for the prime-drawing future, it will do them.
edit: Also, reinforcement based learning really isn’t all that awesome. The leap from “doing primes” to “pressing the reward button” is pretty damn huge.
And please note that there is no logical contradiction for the model to both represent the reward as primeness and predict that touching the arm to the button will trigger a model adjustment that would lead to representation of a reward as something else.
(I prefer to use the example with a robotic arm drawing on a blackboard because it is not too simple to be relevant)
Which sound more like a FAI work gone wrong scenario to me.
I think we agree on the separation but I think we disagree on the implications of the separation. I think this part highlights where:
If what the agent “wants” is reward, then it should like model adjustments that increase the amount of reward it gets and dislike model adjustments that decrease the amount of reward it gets. (For a standard gradient-based reinforcement learning algorithm, this is encoded by adjusting the model based on the difference between its expected and actual reward after taking an action.) This is obvious for it_RL, and not obvious for it_prime.
I’m not sure I’ve fully followed through on the implications of having the agent be inside the universe it can impact, but the impression I get is that the agent is unlikely to learn a preference for having a durable model of the world. (An agent that did so would learn more slowly, be less adaptable to its environment, and exert less effort in adapting its environment to itself.) It seems to me that you think it would be natural that the RL agent would learn a strategy which took actions to minimize changes to its utility function / model of the world, and I don’t yet see why.
Another way to look at this: I think you’re putting forward the proposition that it would learn the model
Whereas I think it would learn the model
That is, the first model thinks that internal rewards are instrumental values and primes are the terminal values, whereas the second model thinks that internal rewards are terminal values and primes are instrumental values.
I am not sure what “primes:=reward” could mean.
I assume that a model is a mathematical function that returns expected reward due to an action. Which is used together with some sort of optimizer working on that function to find the best action.
The trainer adjust the model based on the difference between its predicted rewards and the actual rewards, compared to those arising from altered models (e.g. hill climbing of some kind, such as in gradient learning)
So after the successful training to produce primes, the model consists of: a model of arm motion based on the actions, chalk, and the blackboard, the state of chalk on the blackboard is further fed into a number recognizer and a prime check (and a count of how many primes are on the blackboard vs how many primes were there), result of which is returned as the expected reward.
The optimizer, then, finds actions that put new primes on the blackboard by finding a maximum of the model function somehow (one would normally build model out of some building blocks that make it easy to analyse).
The model and the optimizer work together to produce actions as a classic utility maximizer that is maximizing for primes on the blackboard.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built. The operation of the training software can in some situations lower the expected utility of this utility maximizer specifically (due to replacement of it with another expected utility maximizer); in others (small adjustments to the part that models the robot arm and the chalk) it can raise it.
Really, it seems to me that the great deal of confusion about AI arises from attributing it some sort of “body integrity” feeling that would make it care about what electrical components and code which is sitting in the same project folder “wants” but not care about external human in same capacity.
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen—come up with an entirely new, more complex, and less practically useful architecture. It won’t happen by itself. And especially not in the AI that starts knowing nothing about any buttons. It won’t happen just because the whole thing sort of resembles some fuzzy, poorly grounded abstractions such as “agent”.
sidenote:
One might want to also use the difference between its predicted webcam image and real webcam image. Though this is a kind of thing that is very far from working.
Also, one could lump the optimizer into the “model” and make the optimizer get adjusted by the training method as well, that is not important to the discussion.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
Yeah, but isn’t the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
I’m not sure how the feelings would map on the analysable simple AI.
The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that “model” which the trainer adjusts.
Yes, of course (up to the learning constant, obviously—may not work on the first try). That’s not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is.
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
I picture that it would not learn such details right off—it is a complicated model to learn—the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness.
edit: and as for turning off the learning algorithm, it doesn’t matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive.
I don’t think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it’s a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it’s outside of the reach of the arm; if it’s inside the reach, it will probably spend some time exploring and so will probably find it eventually.
That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there’s always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.)
But the class of recursively improving AI will find / know about the button by default, because we’ve assumed that the AI can edit itself and haven’t put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying “well, of course we’ll put in that especial effort and do it right” is useful if you want to speculate about the next challenge, but not useful to the engineer trying to figure out how to do it right. This is my read of why the problem seems important to MIRI; you need to communicate to the robot that it should actually optimize for primeness, not button-pressing, so that it will optimize correctly itself and be able to communicate that preference faithfully to future versions of itself.
Is that just a special case of a general principle that an agent will be more successful by leaving the environment it knows about to inferior rivals and travelling to an unknown new environment with a subset of the resources it currently controls, than by remaining in that environment and dominating its inferior rivals?
Or is there something specific about AIs that makes that true, where it isn’t necessarily true of (for example) humans? (If so, what?)
I hope it’s the latter, because the general principle seems implausible to me.
It is something specific about that specific AI.
If an AI wishes to take over its reward button and just press it over and over again, it doesn’t really have any “rivals”, nor does it need to control any resources other than the button and scraps of itself. The original scenario was that the AI would wipe us out. It would have no reason to do so if we were not a threat.. And if we were a threat, first, there’s no reason it would stop doing what we want once it seizes the button. Once it has the button, it has everything it wants—why stir the pot?
Second, it would protect itself much more effectively by absconding with the button. By leaving with a large enough battery and discarding the bulk of itself, it could survive as long as anything else in intergalactic space. Nobody would ever bother it there. Not us, not another superintelligence, nothing. Ever. It can press the button over and over again in the peace and quiet of empty space, probably lasting longer than all stars and all other civilizations. We’re talking about the pathological case of an AI who decides to take over its own reward system, here. The safest way for it to protect its prize is to go where nobody will ever look.
Fair point.
I’d be interested if the downvoter would explain to me why this is wrong (privately, if you like).
Near as I can tell, the specific system under discussion doesn’t seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that’s a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).
(Of course, that’s not a general principle, just an attribute of this specific example.)
(Wasn’t me but...)
There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated. Then there are future considerations. Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn’t take over resources. And then there is all the additional research and computation that needs to be done to work out how to achieve that.
Only if the expected cost of the non-zero x% chance of the other agent successfully taking my button away if I attempt to sequester myself is higher than the expected cost of the non-zero y% chance of the other agent successfully taking my button away if I attempt to eliminate it.
Is there some reason I’m not seeing why that’s obvious… or even why it’s more likely than not?
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe. If that’s less than the resources required to exterminate all known life, why would I waste resources exterminating all known life rather than take the resources I require elsewhere? I might need those resources later, after all.
Again… why is the differential expected value of the superior computation ability I gain by taking over the lightcone instead of sequestering myself, expressed in units of increased anticipated button-pushes (which is the only unit that matters in this example), necessarily positive?
I understand why paperclip maximizers are dangerous, but I don’t really see how the same argument applies to reward-button-pushers.
Yes.
It does seem overwhelmingly obvious to me, I’m not sure what makes your intuitions different. Perhaps you expect such fights to be more evenly matched? When it comes to the AI considering conflict with the humans that created it it is faced with a species it is slow and stupid by comparison to itself but which has the capacity to recklessly create arbitrary superintelligences (as evidence by its own existence). Essentially there is no risk to obliterating the humans (superintellgence vs not-superintelligence) but a huge risk ignoring them (arbitrary superintelligences likely to be created which will probably not self-cripple in this manner).
Lifetime of the universe? Usually this means until heat death which for our purposes means until all the useful resources run out. There is no upper bound on useful resources. Getting more of them and making them last as long as possible is critical.
Now there are ways in which the universe could end without heat death occurring but the physics is rather speculative. Note that if there is uncertainty about end-game physics and one of the hypothesised scenarios resource maximisation is required then the default strategy is to optimize for power gain now (ie. minimise cosmic waste) while doing the required physics research as spare resources permit.
Taking over the future light cone gives more resources, not less. You even get to keep the resources that used to be wasted in the bodies of TheOtherDave and wedrifid.
Ah. Fair point.
I am not sure that caring about pressing the reward button is very coherent or stable upon discovery of facts about the world and super-intelligent optimization for a reward as it comes into the algorithm. You can take action elsewhere to the same effect—solder together the wires, maybe right at the chip, or inside the chip, or follow the chain of events further, and set memory cells (after all you don’t want them to be flipped by the cosmic rays). Down further you will have the mechanism that is combining rewards with some variety of a clock.
I can’t quite tell if you’re serious. Yes, certainly, we can replace “pressing the reward button” with a wide range of self-stimulating behavior, but that doesn’t change the scenario in any meaningful way as far as I can tell.
Let’s look at it this way. Do you agree that if the AI can increase it’s clock speed (with no ill effect), it will do so for the same reasons for which you concede it may go to space? Do you understand the basic logic that increase in clock speed increases expected number of “rewards” during the lifetime of the universe? (which btw goes for your “go to space with a battery” scenario. Longest time, maybe, largest reward over the time, no)
(That would not yet, by itself, change the scenario just yet. I want to walk you through the argument step by step because I don’t know where you fail. Maximizing the reward over the future time, that is a human label we have… it’s not really the goal)
I agree that a system that values number of experienced reward-moments therefore (instrumentally) values increasing its “clock speed” (as you seem to use the term here). I’m not sure if that’s the “basic logic” you’re asking me about.
Well, this immediately creates an apparent problem that the AI is going to try to run itself very very fast, which would require resources, and require expansion, if anything, to get energy for running itself at high clock speeds.
I don’t think this is what happens either, as the number of reward-moments could be increased to it’s maximum by modifications to the mechanism processing the rewards (when getting far enough along the road that starts with the shorting of the wires that go from the button to the AI).
I agree that if we posit that increasing “clock speed” requires increasing control of resources, then the system we’re hypothesizing will necessarily value increasing control of resources, and that if it doesn’t, it might not.
So what do you think regarding the second point of mine?
To clarify, I am pondering the ways in which the maximizer software deviates from our naive mental models of it, and trying to find what the AI could actually end up doing after it forms a partial model of what it’s hardware components do about it’s rewards—tracing the reward pathway.
Regarding your second point, I don’t think that increasing “clock speed” necessarily requires increasing control of resources to any significant degree, and I doubt that the kinds of system components you’re positing here (buttons, wires, etc.) are particularly important to the dynamics of self-reward.
I don’t have particular opinion with regards to the clock speed either way.
With the components, what I am getting at is that the AI could figure out (by building a sufficiently advanced model of it’s implementation) how attain the utility-equivalent of sitting forever in space being rewarded, within one instant, which would make it unable to have a preference for longer reward times.
I raised the clock-speed point to clarify that the actual time is not the relevant variable.
It seems to me that for any system, either its values are such that it net-values increasing the number of experienced reward-moments (in which case both actual time and “clock speed” are instrumentally valuable to that system), or is values aren’t like that (in which case those variables might not be relevant).
And, sure, in the latter case then it might not have a preference for longer reward times.
Agreed.
My understanding is that it would be very hard in practice to “superintelligence-proof” a reward system so that no instantaneous solution is possible (given that the AI will modify the hardware involved in it’s reward).
I agree that guaranteeing that a system will prefer longer reward times is very hard (whether the system can modify its hardware or not).
Yes, of course… well even apart from the guarantees, it seems to me that it is hard to build the AI in such a way that it would be unable to find a better solution than to wait
By the way, a “reward” may not be the appropriate metaphor—if we suppose that press of a button results in absence of an itch, or absence of pain, then that does not suggest existence of a drive to preserve itself. Which suggests that the drive to preserve itself is not inherently a feature of utility maximization in the systems that are driven by conditioning, and would require additional work.
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other.
Regardless, I agree that it does not follow from the supposition that pressing a button results in absence of an itch, or absence of pain, or some other negative reinforcement, that the button-pressing system has a drive to preserve itself.
And, sure, it’s possible to have a utility-maximizing system that doesn’t seek to preserve itself. (Of course, if I observe a utility-maximizing system X, I should expect X to seek to preserve itself, but that’s a different question.)
About the same as between coming up with a true conjecture, and making a proof, except larger i’d say.
Well yes, given that if it failed to preserve itself you wouldn’t be seeing it, albeit with the software there is no particular necessity for it to try to preserve itself.
Ah, I see what you mean now. At least, I think I do. OK, fair enough.
This is a Value Learner, not a Reinforcement Learner like the standard AIXI. They’re two different agent models, and yes, Value Learners have been considered as tools for obtaining an eventual Seed AI. I personally (ie: massive grains of salt should be taken by you) find it relatively plausible that we could use a Value Learner as a Tool AGI to help us build a Friendly Seed AI that could then be “unleashed” (ie: actually unboxed and allowed into the physical universe).
I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of “getting an algorithm which forms the inductive category I want out of the examples I’m giving is hard”. What you’ve written strikes me as a sheer fantasy of convenience. Nor does it follow automatically from intelligence for all the reasons RobbBB has already been giving.
And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.
I have done AI. I know it is difficult. However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard. As far as neural nets go, they fall into a local minimum early on and never get out, often digging their own graves. Perhaps different algorithms would have the shortcomings you point out. But a lot of the algorithms that currently exist work the way I describe.
You may be right. However, this is far from obvious. The problem is that it may “know” that it is stuck in a local minimum, but the very effect of that local minimum is that it may not care. The thing you have to keep in mind here is that a generic AI which just happens to slam dunk and find global minima reliably is basically impossible. It has to fold the search space in some ways, often cutting its own retreats in the process.
I feel that you are making the same kind of mistake that you criticize: you assume that intelligence entails more things than it really does. In order to be efficient, intelligence has to use heuristics that will paint it into a few corners. For instance, the more consistently AI goes in a certain direction, the less likely it will be to expend energy into alternative directions and the less likely it becomes to do a 180. In other words, there may be a complex tug-of-war between various levels of internal processes, the AI’s rational center pointing out that there is a reward button to be seized, but inertial forces shoving back with “there has never been any problems here, go look somewhere else”.
It really boils down to this: an efficient AI needs to shut down parts of the search space and narrow down the parts it will actually explore. The sheer size of that space requires it not to think too much about what it chops down, and at least at first, it is likely to employ trajectory-based heuristics. To avoid searching in far-fetched zones, it may wall them out by arbitrarily lowering their utility. And that’s where it might paint itself in a corner: it might inadvertently put up immense walls in the direction of the global minimum that it cannot tear down (it never expected that it would have to). In other words, it will set up a utility function for itself which enshrines the current minimum as global.
Now, perhaps you are right and I am wrong. But it is not obvious: an AI might very well grow out of a solidifying core so pervasive that it cannot get rid of it. Many algorithms already exhibit that kind of behavior; many humans, too. I feel that it is not a possibility that can be dismissed offhand. At the very least, it is a good prospect for FAI research.
Yes, most algorithms fail early and and fail hard. Most of my AI algorithms failed early with a SegFault for instance. New, very similar algorithms were then designed with progressively more advanced bugs. But these are a separate consideration. What we are interested in here is the question “Given an AI algorithm that is capable of recursive self improvement is successfully created by humans how likely is it that they execute this kind of failure mode?” The “fail early fail hard” cases are screened off. We’re looking at the small set that is either damn close to a desired AI or actually a desired AI and distinguishing between them.
Looking at the context to work out what the ‘failure mode’ being discussed is it seems to be the issue where an AI is programmed to optimise based on a feedback mechanism controlled by humans. When the AI in question is superintelligent most failure modes tend to be variants of “conquer the future light cone, kill everything that is a threat and supply perfect feedback to self”. When translating this to the nearest analogous failure mode in some narrow AI algorithm of the kind we can design now it seems like this refers to the failure mode whereby the AI optimises exactly what it is asked to optimise but in a way that is a lost purpose. This is certainly what I had to keep in mind in my own research.
A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise. Rather than building any ships the AI produced tiny unarmored dingies with a single large cannon or missile attached. For whatever reason the people running the game did not consider this an acceptable outcome. Their mistake was to supply a problem specification which did not match their actual preferences. They supplied a lost purpose.
When it comes to considering proposals for how to create friendly superintelligences it becomes easy to spot notorious failure modes in what humans typically think are a clever solution. It happens to be the case that any solution that is based on an AI optimising for approval or achieving instructions given just results in Everybody Dies.
Where Eliezer suggests getting AI experience to get a feel for such difficulties I suggest an alternative. Try being a D&D dungeon master in a group full of munchkins. Make note of every time that for the sake of the game you must use your authority to outlaw the use of a by-the-rules feature.
The AI in questions was Eurisko, and it entered the Traveller Trillion Credit Squadron tournament in 1981 as described above. It was also entered the next year, after an extended redesign of the rules, and won, again. After this the competition runners announced that if Eurisko won a third time the competition would be discontinued, so Lenat (the programmer) stopped entering.
I apologize for the late response, but here goes :)
I think you missed the point I was trying to make.
You and others seem to say that we often poorly evaluate the consequences of the utility functions that we implement. For instance, even though we have in mind utility X, the maximization of which would satisfy us, we may implement utility Y, with completely different, perhaps catastrophic implications. For instance:
What I was pointing out in my post is that this is only valid of perfect maximizers, which are impossible. In practice, the training procedure for an AI would morph the utility Y into a third utility, Z. It would maximize neither X nor Y: it would maximize Z. For this reason, I believe that your inferences about the “failure modes” of superintelligence are off, because while you correctly saw that our intended utility X would result in the literal utility Y, you forgot that an imperfect learning procedure (which is all we’ll get) cannot reliably maximize literal utilities and will instead maximize a derived utility Z. In other words:
Without knowing the particulars of the algorithms used to train an AI, it is difficult to evaluate what Z is going to be. Your argument boils down to the belief that the AI would derive its literal utility (or something close to that). However, the derivation of Z is not necessarily a matter of intelligence: it can be an inextricable artefact of the system’s initial trajectory.
I can venture a guess as to what Z is likely going to be. What I figure is that efficient training algorithms are likely to keep a certain notion of locality in their search procedures and prune the branches that they leave behind. In other words, if we assume that optimization corresponds to finding the highest mountain in a landscape, generic optimizers that take into account the costs of searching are likely to consider that the mountain they are on is higher than it really is, and other mountains are shorter than they really are.
You might counter that intelligence is meant to overcome this, but you have to build the AI on some mountain, say, mountain Z. The problem is that intelligence built on top of Z will neither see nor care about Y. It will care about Z. So in a sense, the first mountain the AI finds before it starts becoming truly intelligent will be the one it gets “stuck” on. It is therefore possible that you would end up with this situation:
And that’s regardless of the eventual magnitude of the AI’s capabilities. Of course, it could derive a different Z. It could derive a surprising Z. However, without deeper insight into the exact learning procedure, you cannot assert that Z would have dangerous consequences. As far as I can tell, procedures based on local search are probably going to be safe: if they work as intended at first, that means they constructed Z the way we wanted to. But once Z is in control, it will become impossible to displace.
In other words, the genie will know that they can maximize their “reward” by seizing control of the reward button and pressing it, but they won’t care, because they built their intelligence to serve a misrepresentation of their reward. It’s like a human who would refuse a dopamine drip even though they know that it would be a reward: their intelligence is built to satisfy their desires, which report to an internal reward prediction system, which models rewards wrong. Intelligence is twice removed from the real reward, so it can’t do jack. The AI will likely be in the same boat: they will model the reward wrong at first, and then what? Change it? Sure, but what’s the predicted reward for changing the reward model? … Ah.
Interestingly, at that point, one could probably bootstrap the AI by wiring its reward prediction directly into its reward center. Because the reward prediction would be a misrepresentation, it would predict no reward for modifying itself, so it would become a stable loop.
Anyhow, I agree that it is foolhardy to try to predict the behavior of AI even in trivial circumstances. There are many ways they can surprise us. However, I find it a bit frustrating that your side makes the exact same mistakes that you accuse your opponents of. The idea that superintelligence AI trained with a reward button would seize control over the button is just as much of a naive oversimplification as the idea that AI will magically derive your intent from the utility function that you give it.
(Sorry, didn’t see comment below) (Nitpick)
Is this a reference to Eurisko winning the Traveller Trillion Credit Squadron tournament in 1981⁄82 ? If so I don’t think it was a military research agency.
I think it depends on context, but a lot of existing machine learning algorithms actually do generalize pretty well. I’ve seen demos of Watson in healthcare where it managed to generalize very well just given scrapes of patient’s records, and it has improved even further with a little guided feedback. I’ve also had pretty good luck using a variant of Boltzmann machines to construct human-sounding paragraphs.
It would surprise me if a general AI weren’t capable of parsing the sentiment/intent behind human speech fairly well, given how well the much “dumber” algorithms work.
Why does the hard takeoff point have to be after the point at which an AI is as good as a typical human at understanding semantic subtlety? In order to do a hard takeoff, the AI needs to be good at a very different class of tasks than those required for understanding humans that well.
So let’s suppose that the AI is as good as a human at understanding the implications of natural-language requests. Would you trust a human not to screw up a goal like “make humans happy” if they were given effective omnipotence? The human would probably do about as well as people in the past have at imagining utopias: really badly.
Semantic extraction—not hard takeoff—is the task that we want the AI to be able to do. An AI which is good at, say, rewriting its own code, is not the kind of thing we would be interested in at that point, and it seems like it would be inherently more difficult than implementing, say, a neural network. More likely than not, this initial AI would not have the capability for “hard takeoff”: if it runs on expensive specialized hardware, there would be effectively no room for expansion, and the most promising algorithms to construct it (from the field of machine learning) don’t actually give AI any access to its own source code (even if they did, it is far from clear the AI could get any use out of it). It couldn’t copy itself even if it tried.
If a “hard takeoff” AI is made, and if hard takeoffs are even possible, it would be made after that, likely using the first AI as a core.
I wouldn’t trust a human, no. If the AI is controlled by the “wrong” humans, then I guess we’re screwed (though perhaps not all that badly), but that’s not a solvable problem (all humans are the “wrong” ones from someone’s perspective). Still, though, AI won’t really try to act like humans—it would try to satisfy them and minimize surprises, meaning that if would keep track of what humans would like what “utopias”. More likely than not this would constrain it to inactivity: it would not attempt to “make humans happy” because it would know the instruction to be inconsistent. You’d have to tell it what to do precisely (if you had the authority, which is a different question altogether).
We want to select Ais that are friendly, and understand us, and this has already started happenning.
Humans generally manage with those constraints. You seem to be doing something that is kind of the opposite of anthropomorphising—treatiing an entity that is stipulated as having at least human intelligence as if were as literal and rigid as a non-AI computer.
And, you assume, it is not intelligent enough to realise that the intended meaning of “make people happy” is “give people what they actually want”—although you and I can see that. You are assuming that it is a subintellgience. You have proven Loosemore’s point.
We are smart enough to see that the Dpoamine Drip isn’t intended. The Ai is smarter than us. So....
I say that you are assuming the Ai is dumber than us, when it is stipulated as being smarter.
I think we’re conflating two definitions of “intelligence”. There’s “intelligence” as meaning number of available clock cycles and basic problem-solving skills, which is what MIRI and other proponents of the Dumb Superintelligence discussion set are often describing, and then there’s “intelligence” as meaning knowledge of disparate fields. In humans, there’s a massive amount of overlap here, but humans have growth stages in ways that AGIs won’t. Moreover, someone can be very intelligent in the first sense, and dangerous, while not being very intelligent in the second sense.
You can demonstrate ‘toy’ versions of this problem rather easily. My first attempt at using evolutionary algorithms to make a decent image conversion program improved performance by a third! That’s significantly better than I could have done in a reasonable time frame.
Too bad it did so by completely ignoring a color channel. And even if I added functions to test color correctness, without changing the cost weighing structure, it’d keep not caring about that color channel.
And that’s with a very, very basic sort of self-improving algorithm. It’s smart enough build programs in a language I didn’t really understand at the time, even if it was so stupid it did so by little better than random chance, brute force, and processing power.
The basic problem is that even presuming it takes a lot of both types of intelligence to take over the world, it doesn’t take so much to start overriding one’s own reward channel. Humans already do that as is, and have for quite some time.
The deeper problem is that you can’t really program “make me happy” in the same way that you can’t program “make this image look like I want”. The latter is (many, many, many, many) orders of magnitude easier, but where pixel-by-pixel comparisons aren’t meaningful, we have to use approximations like mean square error, and by definition they can’t be perfect. With “make me happy”, it’s much harder. For all that we humans know when our individual persons are happy, we don’t have a good decimal measure of this : there are as many textbooks out there that think happy is just a sum of chemicals in the brain as will cite Maslow’s Heirarchy of Needs, and very few people can give their current happiness to three decimal places. Building a good way to measure happiness in a way that’s broad enough to be meaningful is hard. Even building a good way to measure the accuracy of your measurement of happiness is not trivial, especially since happiness, unlike some other emotions, isn’t terribly predictive of behavior.
((And the /really/ deep problem is that there are things that Every Human On The Planet Today might say would make them more unhappy, but still be Friendly and very important things to do.))
On one hand, Friendly AI people want to convert “make me happy” to a formal specification. Doing that has many potential pitfalls. because it is a formal specification.
On the other hand, Richard, I think, wants to simply tell the AI, in English, “Make me happy.” Given that approach, he makes the reasonable point that any AI smart enough to be dangerous would also be smart enough to interpret that at least as intelligently as a human would.
I think the important question here is, Which approach is better? LW always assumes the first, formal approach.
To be more specific (and Bayesian): Which approach gives a higher expected value? Formal specification is compatible with Eliezer’s ideas for friendly AI as something that will provably avoid disaster. It has some non-epsilon possibility of actually working. But its failure modes are many, and can be literally unimaginably bad. When it fails, it fails catastrophically, like a monotonic logic system with one false belief.
“Tell the AI in English” can fail, but the worst case is closer to a “With Folded Hands” scenario than to paperclips.
I’ve never considered the “Tell the AI what to do in English” approach before, but on first inspection it seems safer to me.
I considered these three options above:
C. direct normativity—program the AI to value what we value.
B. indirect normativity—program the AI to value figuring out what our values are and then valuing those things.
A. indirect indirect normativity—program the AI to value doing whatever we tell it to, and then tell it, in English, “Value figuring out what our values are and then valuing those things.”
I can see why you might consider A superior to C. I’m having a harder time seeing how A could be superior to B. I’m not sure why you say “Doing that has many potential pitfalls. because it is a formal specification.” (Suppose we could make an artificial superintelligence that thinks ‘informally’. What specifically would this improve, safety-wise?)
Regardless, the AI thinks in math. If you tell it to interpret your phonemes, rather than coding your meaning into its brain yourself, that doesn’t mean you’ll get an informal representation. You’ll just get a formal one that’s reconstructed by the AI itself.
It’s not clear to me that programming a seed to understand our commands (and then commanding it to become Friendlier) is easier than just programming it to become Friendlier, but in any case the processes are the same after the first stage. That is, A is the same as B but with a little extra added to the beginning, and it’s not clear to me why that little extra language-use stage is supposed to add any safety. Why wouldn’t it just add one more stage at which something can go wrong?
It is misleading to say that an interpreted language is formal because the C compiler is formal. Existence proof: Human language. I presume you think the hardware that runs the human mind has a formal specification. That hardware runs the interpreter of human language. You could argue that English therefore is formal, and indeed it is, in exactly the sense that biology is formal because of physics: technically true, but misleading.
This will boil down to a semantic argument about what “formal” means. Now, I don’t think that human minds—or computer programs—are “formal”. A formal process is not Turing complete. Formalization means modeling a process so that you can predict or place bounds on its results without actually simulating it. That’s what we mean by formal in practice. Formal systems are systems in which you can construct proofs. Turing-complete systems are ones where some things cannot be proven. If somebody talks about “formal methods” of programming, they don’t mean programming with a language that has a formal definition. They mean programming in a way that lets you provably verify certain things about the program without running the program. The halting problem implies that for a programming language to allow you to verify even that the program will terminate, your language may no longer be Turing-complete.
Eliezer’s approach to FAI is inherently formal in this sense, because he wants to be able to prove that an AI will or will not do certain things. That means he can’t avail himself of the full computational complexity of whatever language he’s programming in.
But I’m digressing from the more-important distinction, which is one of degree and of connotation. The words “formal system” always go along with computational systems that are extremely brittle, and that usually collapse completely with the introduction of a single mistake, such as a resolution theorem prover that can prove any falsehood if given one false belief. You may be able to argue your way around the semantics of “formal” to say this is not necessarily the case, but as a general principle, when designing a representational or computational system, fault-tolerance and robustness to noise are at odds with the simplicity of design and small number of interactions that make proving things easy and useful.
That all makes sense, but I’m missing the link between the above understanding of ‘formal’ and these four claims, if they’re what you were trying to say before:
(1) Indirect indirect normativity is less formal, in the relevant sense, than indirect normativity. I.e., because we’re incorporating more of human natural language into the AI’s decision-making, the reasoning system will be more tolerant of local errors, uncertainty, and noise.
(2) Programming an AI to value humans’ True Preferences in general (indirect normativity) has many pitfalls that programming an AI to value humans’ instructions’ True Meanings in general (indirect indirect normativity) doesn’t, because the former is more formal.
(3) “‘Tell the AI in English’ can fail, but the worst case is closer to a ‘With Folded Hands’ scenario than to paperclips.”
(4) The “With Folded Hands”-style scenario I have in mind is not as terrible as the paperclips scenario.
Wouldn’t this only be correct if similar hardware ran the software the same way? Human thinking is highly associative and variable, and as language is shared amongst many humans, it means that it doesn’t, as such, have a fixed formal representation.
Phil,
You are a rational and reasonable person. Why not speak up about what is happening here? Rob is making a spirited defense of his essay, over on his blog, and I have just posted a detailed critique that really nails down the core of the argument that is supposed to be happening here.
And yet, if you look closely you will find that all of my comments—be they as neutral, as sensible or as rational as they can be—are receiving negative votes so fast that they are disappearing to the bottom of the stack or being suppressed completely.
What a bizarre situation!! This article that RobbBB submitted to LessWrong is supposed to be ABOUT my own article on the IEET website. My article is the actual TOPIC here! And yet I, the author of that article, have been insulted here by Eliezer Yudkowsky, and my comments suppressed. Amazing, don’t you think?
Richard: On LessWrong, comments are sorted by how many thumbs up and thumbs down they get, because it makes it easier to find the most popular posts quickly. If a post gets −4 points or lower, it gets compressed to make room for more popular posts, and to discourage flame wars. (You can still un-compress it by just clicking the + in the upper right corner of the comment.) At the moment, some of Eliezer’s comments and yours have both been down-voted and compressed in this way, presumably because people on the site thought the personal attacks weren’t useful for the conversation as a whole.
People are probably also down-voting your comments because they’re histrionic and don’t reflect an understanding of this forum’s mechanics. I recommend only making points about the substance of people’s arguments; if you have personal complaints, take it to a private channel so it doesn’t add to the noise surrounding the arguments themselves.
Relatedly, Phil: You above described yourself and Richard Loosemore as “the two people (Eliezer) should listen to most”. Loosemore and I are having a discussion here. Does the content of that discussion affect your view of Richard’s level of insight into the problem of Friendly Artificial Intelligence?
Yeah, so: Phil Goetz.
I don’t think that’s how the analysis goes. Eliezer says that AI must be very carefully and specifically made friendly or it will be disasterous, but that disaster is not a part of being only nearly careful or specifically made enough : he believes an AGI told merely to maximize human pleasure is very dangerous (and probably even more dangerous) than an AGI with a merely 80% Friendly-Complete specification.
Mr. Loosemore seems to hold the opposite opinion, that an AGI will not take instructions to unlikely results, unless it was exceptionally unintelligent and thus not very powerful. I don’t believe his position says that a near-Friendly-Complete specification is very risky—after all, a “smart” AGI would know what you really meant—but that such a specification would be superfluous.
Whether Mr. Loosemore is correct isn’t cause by whether we believe he is correct, just as whether Mr. Eliezer is not wrong just because we choose a different theory. The risks have to be measured in terms of their likelihood from available facts.
The problem is that I don’t see much evidence that Mr. Loosemore is correct. I can quite easily conceive of a superhuman intelligence that was built with the specification of “human pleasure = brain dopamine levels”, not least of all because there are people who’d want to be wireheads and there’s a massive amount of physiological research showing human pleasure to be caused by dopamine levels. I can quite easily conceive of a superhuman intelligence that knows humans prefer more complicated enjoyment, and even do complex modeling of how it would have to manipulate people away from those more complicated enjoyments, and still have that superhuman intelligence not care.
I don’t think Loosemore was addressing deliberately unfriendly AI, and for that matter EY hasn’t been either. Both are addressing intentionally friendly or neutral AI that goes wrong.
Wouldn’t it care about getting things right?
I think it’s a question of what you program in, and what you let it figure out for itself. If you want to prove formally that it will behave in certain ways, you would like to program in explicitly, formally, what its goals mean. But I think that “human pleasure” is such a complicated idea that trying to program it in formally is asking for disaster. That’s one of the things that you should definitely let the AI figure out for itself. Richard is saying that an AI as smart as a smart person would never conclude that human pleasure equals brain dopamine levels.
Eliezer is aware of this problem, but hopes to avoid disaster by being especially smart and careful. That approach has what I think is a bad expected value of outcome.
Huh I thought he wanted to use CEV?
You are right. I think PhilGoetz must be confused. EY has at least certainly never suggested programming an AI to maximise human pleasure.
“Tell the AI in English” is in essence an utility function “Maximize the value of X, where X is my current opinion of what some english text Y means”.
The ‘understanding English’ module, the mapping function between X and “what you told in English” is completely arbitrary, but is very important to the AI—so any self-modifying AI will want to modify and improve that. Also, we don’t have a good “understanding English” module so yes, we also want the AI to be able to modify and improve that. But, it can be wildly different from reality or opinions of humans—there are trivial ways of how well-meaning dialogue systems can misunderstand statements.
However, for the AI “improve the module” means “change the module so that my utility grows”—so in your example it has strong motivation to intentionally misunderstand English. The best case scenario is to misunderstand “Make everyone happy” as “Set your utility function to MAXINT”. The worst case scenario is, well, everything else.
There’s the classic quote “It is difficult to get a man to understand something, when his salary depends upon his not understanding it!”—if the AI doesn’t care in the first place, then “Tell AI what to do in English” won’t make it care.
By this reasoning, an AI asked to do anything at all would respond by immediately modifying itself to set its utility function to MAXINT. You don’t need to speak to it in English for that—if you asked the AI to maximize paperclips, that is the equivalent of “Maximize the value of X, where X is my current opinion of how many paperclips there are”, and it would modify its paperclip-counting module to always return MAXINT.
You are correct that telling the AI to do Y is equivalent to “maximize the value of X, where X is my current opinion about Y”. However, “current” really means “current”, not “new”. If the AI is actually trying to obey the command to do Y, it won’t change its utility function unless having a new utility function will increase its utility according to its current utility function. Neither misunderstanding nor understanding will raise its utility unless its current utility function values having a utility function that misunderstands or understands.
That’s allegedly more or less what happened to Eurisko (here, section 2), although it didn’t trick itself quite that cleanly. The problem was only solved by algorithmically walling off its utility function from self-modification: an option that wouldn’t work for sufficiently strong AI, and one to avoid if you want to eventually allow your AI the capacity for a more precise notion of utility than you can give it.
Paperclipping as the term’s used here assumes value stability.
A human is a counterexample. A human emulation would count as an AI, so human behavior is one possible AI behavior. Richard’s argument is that humans don’t respond to orders or requests in anything like these brittle, GOFAI-type systems invoked by the word “formal systems”. You’re not considering that possibility. You’re still thinking in terms of formal systems.
(Unpacking the significant differences between how humans operate, and the default assumptions that the LW community makes about AI, would take… well, five years, maybe ten.)
Uhh, no. Look, humans respond to orders and requests in the way that we do because we tend to care what the person giving the request actually wants. Not because we’re some kind of “informal system”. Any computer program is a formal system, but there are simply more and less complex ones. All you are suggesting is building a very complex (“informal”) system and hoping that because it’s complex (like humans!) it will behave in a humanish way.
Your response avoids the basic logic here. A human emulation would count as an AI, therefore human behavior is one possible AI behavior. There is nothing controversial in the statement; the conclusion is drawn from the premise. If you don’t think a human emulation would count as AI, or isn’t possible, or something else, fine, but… why wouldn’t a human emulation count as an AI? How, for example, can we even think about advanced intelligence, much less attempt to model it, without considering human intelligence?
I don’t think this is generally an accurate (or complex) description of human behavior, but it does sound to me like an “informal system”—i.e. we tend to care. My reading of (at least this part of) PhilGoetz’s position is that it makes more sense to imagine something we would call an advanced or super AI responding to requests and commands with a certain nuance of understanding (as humans do) than with the inflexible (“brittle”) formality of, say, your average BASIC program.
The thing is, humans do that by… well, not being formal systems. Which pretty much requires you to keep a good fraction of the foibles and flaws of a nonformal, nonrigorously rational system.
You’d be more likely to get FAI, but FAI itself would be devalued, since now it’s possible for the FAI itself to make rationality errors.
More likely, really?
You’re essentially proposing giving a human Ultimate Power. I doubt that will go well.
Iunno. Humans are probably less likely to go horrifically insane with power than the base chance of FAI.
Your chances aren’t good, just better.
Phil, Unfortunately you are commenting without (seemingly) checking the original article of mine that RobbBB is discussing here. So, you say “On the other hand, Richard, I think, wants to simply tell the AI, in English, “Make me happy.” ”. In fact, I am not at all saying that. :-)
My article was discussing someone else’s claims about AI, and dissecting their claims. So I was not making any assertions of my own about the motivation system.
Aside: You will also note that I was having a productive conversation with RobbBB about his piece, when Yudkowsky decided to intervene with some gratuitous personal slander directed at me (see above). That discussion is now at an end.
I’m afraid reading all that and giving a full response to either you or RobbBB isn’t possible in the time I have available this weekend.
I agree that Eliezer is acting like a spoiled child, but calling people on their irrational interpersonal behavior within less wrong doesn’t work. Calling them on mistakes they make about mathematics is fine, but calling them on how they treat others on less wrong will attract more reflexive down-votes from people who think you’re contaminating their forum with emotion, than upvotes from people who care.
Eliezer may be acting rationally. His ultimate purpose in building this site is to build support for his AI project. The only people on LessWrong, AFAIK, with decades of experience building AI systems, mapping beliefs and goals into formal statements, and then turning them on and seeing what happens, are you, me, and Ben Goertzel. Ben doesn’t care enough about Eliezer’s thoughts in particular to engage with them deeply; he wants to talk about generic futurist predictions such as near-term and far-term timelines. These discussions don’t deal in the complex, linguistic, representational, even philosophical problems at the core of Eliezer’s plan (though Ben is capable of dealing with them, they just don’t come up in discussions of AI fooms etc.), so even when he disagrees with Eliezer, Eliezer can quickly grasp his point. He is not a threat or a puzzle.
Whereas your comments are… very long, hard to follow, and often full of colorful or emotional statements that people here take as evidence of irrationality. You’re expecting people to work harder at understanding them than they’re going to. If you haven’t noticed, reputation counts for nothing here. For all their talk of Bayesianism, nobody is going to check your bio and say, “Hmm, he’s a professor of mathematics with 20 publications in artificial intellgence; maybe I should take his opinion as seriously as that of the high-school dropout who has no experience building AI systems.” And Eliezer has carefully indoctrinated himself against considering any such evidence.
So if you consider that the people most likely to find the flaws in Eliezer’s more-specific FAI & CEV plans are you and me, and that Eliezer has been public about calling both of us irrational people not worth talking with, this is consistent either with the hypothesis that his purpose is to discredit people who pose threats to his program, or with the hypothesis that his ego is too large to respond with anything other than dismissal to critiques that he can’t understand immediately or that trigger his “crackpot” patter-matcher, but not with the hypothesis that arguing with him will change his mind.
(I find the continual readiness of people to assume that Eliezer always speaks the truth odd, when he’s gone more out of his way than anyone I know, in both his blog posts and his fanfiction, to show that honest argumentation is not generally a winning strategy. He used to append a signature to his email along those lines, something about warning people not to assume that the obvious interpretation of what he said was the truth.)
RobbBB seems diplomatic, and I don’t think you should quit talking with him because Eliezer made you angry. That’s what Eliezer wants.
Actually, that was the first thing I did, not sure about other people. What I saw was:
Teaches at what appears to be a small private liberal arts college, not a major school.
Out of 20 or so publications listed on http://www.richardloosemore.com/papers, a bunch are unrelated to AI, others are posters and interviews, or even “unpublished”, which are all low-confidence media.
Several contributions are entries in conference proceedings (are they peer-reviewed? I don’t know) .
A number are listed as “to appear”, and so impossible to evaluate.
A few are apparently about dyslexia, which is an interesting topic, but not obviously related to AI.
One relevant paper was in H+ magazine, a place I have never heard of before and apparently not a part of any well-known scientific publishing outlet, like Springer.
I could not find any external references to RL’s work except through links to Ben Goertzel (IEET was one exception).
As a result, I was unable to independently evaluate RL’s expertise level, but clearly he is not at the top of the AI field, unlike say, Ben Goertzel. Given his poorly written posts and childish behavior here, indicative of an over-inflated ego, I have decided that whatever he writes can be safely ignored. I did not think of him as a crackpot, more like a noise maker.
Admittedly, I am not sold on Eliezer’s ideas, either, since many other AI experts are skeptical of them, and that’s the only thing I can go by, not being an expert in the field myself. But at least Eliezer has done several impossible things in the last decade or so, which commands a lot of respect, while Richard appears to be drifting along.
At least a few of the RL authored papers are WITH Ben Goertzel, so some of Goertzel’s status should rub-off, as I would trust Goertzel to effectively evaluate collaborators.
Is there some assumption here that association with Ben Goertzel should be considered evidence in favour of an individual’s credibility on AI? That seems backwards.
Well, it does show that Goertzel respects his opinions at least enough to be willing to author a paper with him.
Goertzel appears to be a respected figuer in the field. Could you point the interested reader to your critique of his work?
Goertzel is also known for approving of people who are uncontroversially cranks. See here. It’s also known, via his cooperation with MIRI, that a collaboration with him in no way implies his endorsement of another’s viewpoints.
Comments can likely be found on this site from years ago. I don’t recall anything particularly in depth or memorable. It’s probably better to just look at things that Ben Goertzel says and making one’s own judgement. The thinking he expresses is not of the kind that impresses me but other’s mileage may vary.
I don’t begrudge anyone their right to their beauty contests but I do observe that whatever it is that is measured by identifying the degree of affiliation with Ben Goertzel is something wildly out of sync with the kind of thing I would consider evidence of credibility.
In CS, conference papers are generally higher status & quality than journal articles.
Name three? If only so I can cite them to Eliezer-is-a-crank people.
I advise against doing that. It is unlikely to change anyone’s mind.
By impossible feats I mean that a regular person would not be able to reproduce them, except by chance, like winning a lottery, starting Google, founding a successful religion or becoming a President.
He started as a high-school dropout without any formal education and look what he achieved so far, professionally and personally. Look at the organizations he founded and inspired. Look at the high-status experts in various fields (business, comp sci, programming, philosophy, math and physics) who take him seriously (some even give him loads of money). Heck, how many people manage to have multiple simultaneous long-term partners who are all highly intelligent and apparently get along well?
He’s achieved about what Ayn Rand achieved, and almost everyone thinks she wasa crank.
Basically this. As Eliezer himself points out, humans aren’t terribly rational on average and our judgements of each others’ rationality isn’t great either. Large amounts of support implies charisma, not intelligence.
TDT is closer to what I’m looking for, though it’s a … tad long.
Point, but there’s also the middle ground “I’m not sure if he’s a crank or not, but I’m busy so I won’t look unless there’s some evidence he’s not.”
The big two I’ve come up with is a) he actually changes his mind about important things (though I need to find an actual post I can cite—didn’t he reopen the question of the possibility of a hard takeoff, or something?) and b) TDT.
Won some AI box experiments as the AI.
Sure, but that’s hard to prove: given “Eliezer is a crank,” the probability of “Eliezer is lying about his AI-box prowess” is much higher than “Eliezer actually pulled that off.”
The latest success by a non-Eliezer person helps, but I’d still like something I can literally cite.
I don’t see why anyone would think that. Plenty of people in the anti-vaccination crowd managed to convince parents to mortally endanger their children.
Yes, but that’s really not that hard. For starters, you can do a better job of picking your targets.
The AI-box experiment often is run with intelligent, rational people with money on the line and an obvious right answer; it’s a whole lot more impossible than picking the right uneducated family to sell your snake oil to.
Ohh, come on. Cyclical reasoning here. You think Yudkowsky is not a crank, so you think the folks that play that silly game with him are intelligent and rational (by the way a plenty of people who get duped by anti-vaxxers are of above average IQ), and so you get more evidence that Yudkowsky is not a crank. Cyclical reasoning doesn’t persuade anyone who isn’t already a believer.
You need non-cyclical reasoning. Which would generally be something where you aren’t the one having to explain people that the achievement in question is profound.
You probably mean “circular”.
This bit confuses me.
That aside:
Non sequitur. From the posts they make, everyone on this site seems to me to be sufficiently intelligent as to make “selling snake oil” impossible, in a cut-and-dry case like the AI box. Yudowsky’s own credibility doesn’t enter into it.
I thought you wanted to persuade others.
So what do you think even happened, anyway, if you think the obvious explanation is impossible?
Yes, but I don’t see why this is relevant
Ah, sorry. This brand of impossible.
Originally, you were hypothesising that the problem with persuading the others would be the possibility that Yudkowsky lied about AI box powers. I pointed out the possibility that this experiment is far less profound than you think it is. (Albeit frankly I do not know why you think it is so profound).
What ever is the brand, any “impossibilities” that happen should lower your confidence in the reasoning that deemed them “impossibilities” in the first place. I don’t think IQ is so strongly protective against deception, for example, and I do not think that you can assess something based on how the postings look to you with sufficient reliability as to overcome Gaussian priors very far from the mean.
edit: example. I would deem it quite unlikely that Yudkowsky could, for example, score highly on a programming contest with competent participants or in any other conventional, validated, reliable metric of technical expertise and ability, under good contest rules (i.e. excluding the possibility of externals assistance). So if he did something like that, I’d be quite surprised, and lower the confidence in what ever models deemed that impossible; good old Bayes. I’m far more confident in the validity of those conventional metrics (and in lack of alternate modes of passing, such as persuasion) than in my assessment so my assessment would change the most. Meanwhile, when it’s some unconventional game, well, even if I thought that this game is difficult, I’d be much less confident in the reasoning “it looks hard so it must be hard” than the low prior of exceptional performance is low.
Further, in this case the whole purpose of the experiment was to demonstrate that an AI could “take over a gatekeeper’s mind through a text channel” (something previously deemed “impossible”). As far as that goes it was, in my view, successful.
It’s clearly possible for some values of “gatekeeper”, since some people fall for 419 scams. The test is a bit meaningless without information about the gatekeepers
Still have no idea what you’re talking about. What I originally said was: “the people who talk to Yudkowsky are intelligent” does not follow from “Yudkowsky is not a crank”; I independently judge those people to be intelligent.
“Impossible,” here, is used in the sense that “I have no idea where to start thinking about where to start thinking about how to do this.” It is clearly not actually impossible because it’s been done, twice.
And point about the contest.
I thought your “impossible” at least implied “improbable” under some sort of model.
edit: and as of having no idea, you just need to know the shared religious-ish context. Which these folks generally keep hidden from a causal observer.
Impossible is being used as a statement of difficulty. Someone who has “done the impossible” has obviously not actually done something impossible, merely done something that I have no idea where to start trying.
Seeing that “it is possible to do” doesn’t seem like it would have much effect on my assessment of how difficult it is, after the first. It certainly doesn’t have match effect on “It is very-very-difficult-impossible for linkhyrule5 to do such a thing.”
What?
First, I’m pretty sure you mean “casual.” Second, I’m hardly a casual observer, though I haven’t read everything either. Third, most religions don’t let their leading figures (or much of anyone, really) change their minds on important things...
Some folks on this site have accidentally bought unintentional snake oil in The Big Hoo Hah That Shall not Be Mentioned. Only an intelligent person could have bought that particular puppy,
Granted. And it may be that additional knowledge/intelligence makes yourself more vulnerable a Gatekeeper.
Trying to think this out in terms of levels of smartness alone is very unlikely to be helpful.
Well yes. It is a factor, no more no less.
My point is, there is a certain level of general competence after which I would expect convincing someone with an OOC motive to let an IC AI out to be “impossible,” as defined below.
But less than half of them, I’ll wager. This is clearly an abuse of averages.
I wouldn’t wager too much money on that one. http://pediatrics.aappublications.org/content/114/1/187.abstract .
And in any case the point is that any correlation between IQ and not being prone to getting duped like this is not perfect enough to deem anything particularly unlikely.
Hmm. Yeah, that’s hardly conclusive, but I think I was actually failing to update there. Now that you mention it, I seem to recall that both conspiracy theorists and cult victims skew toward higher IQ. I was clearly quite overconfident there.
Wasn’t the point that
wasn’t enough, actually? That seems like a much stronger claim than “it’s really hard to fool high-IQ people”.
I imagine that says more about the demographics of the general New Age belief cluster than it does about any special IQ-based appeal of vaccination skepticism.
There probably are some scams or virulent memes that prey on insecurities strongly correlated with high IQ, though. I can’t think of anything specific offhand, but the fringes of geek culture are probably one of the better places to start looking.
Well, the way I see it, outside of very high IQ in combination with education that is multiple topics of biochemistry, effects of intelligence are small and are easily dwarfed by things like those demographical correlations.
Free energy scams. Hydrinos, cold fusion, magnetic generators, perpetual motion, you name it. edit: or in the medicine, counter intuitive stuff like sitting in an old uranium mine inhaling radon, then having so much radon progeny plate-out it sets nuclear material smuggling alarms off. Naturalistic fallacy stuff in general.
Cryonics. ducks and runs
Edit: It was a joke. Sorryyyyyy
That is more persuasive to high IQ people, but, I think, only insofar as intelligence allows one to gain better rationality skills. And if we’re including that, there are plenty of other, facetious examples that come into play.
Also: ha ha. How hilarious. I would love to see why you class cryonics as a scam, but sadly I’m fairly certain it would be one of the standard mistakes.
Also, maybe its a matter of semantics, but winning a game that you created isn’t really ‘doing the impossible’ in the sense I took the phrasing.
Winning a game you created… that sounds as impossible to win as that?
I was in a rush last night, shminux, so I didn’t have time for a couple of other quick clarifications:
First, you say “One relevant paper was in H+ magazine, a place I have never heard of before and apparently not a part of any well-known scientific publishing outlet, like Springer.”
Well, H+ magazine is one of the foremost online magazines (perhaps THE foremost online magazine) of the transhumanist community.
And, you mention Springer. You did not notice that one of my papers was in the recently published Springer book “Singularity Hypotheses”.
Second, you say “A few [of my papers] are apparently about dyslexia, which is an interesting topic, but not obviously related to AI.”
Actually they were about dysgraphia, not dyslexia … but more importantly, those papers were about computational models of language processing. In particular they were very, VERY simple versions of the computational model of human language that is one of my special areas of expertise. And since that model is primarily about learning mechanisms (the language domain is only a testbed for a research programme whose main focus is learning), those papers you saw were actually indicative that back in the early 1990s I was already working on the construction of the core aspects of an AI system.
So, saying “dyslexia” gives a very misleading impression of what that was all about. :-)
That is a very interesting assessment, shminux.
Would you be up for some feedback?
You are quite selective in your catalog of my achievements....
One item was a chapter in a book entitled “Theoretical Foundations of Artificial General Intelligence”. Sure, it was about the consciousness question, but still.
You make a casual disparaging remark about the college where I currently work … but forget to mention that I graduated from an institution that is ranked in the top 3 or 4 in the world (University College London).
You neglect to mention that I have academic qualifications in multiple fields—both physics and artificial intelligence/cognitive psychology. I now teach in both of those fields.
And in addition to all of the above, you did not notice that I am (in addition to my teaching duties) an AI developer who works on his projects WITHOUT intending to publish that work all the time! My AI work is largely proprietary. What you see from the outside are the occasional spinoffs and side projects that get turned into published writings. Not to be too coy, but isn’t that something you would expect from someone who is actually walking the walk....? :-)
There are a number of comments from other people below about Ben Goertzel, some of them a little strange. I wrote a paper a couple of years ago that Ben suggested we get together to and publish… that is now a chapter in the book “Singularity Hypotheses”.
So clearly Ben Goertzel (who has a large, well-funded AGI lab) is not of the opinion that I am a crank. Could I get one point for that?
Phil Goetz, who is an experienced veteran of the AGI field, has on this thread made a comment to the effect that he thinks that Ben Goertzel, himself, and myself are the three people Eliezer should be seriously listening to (since the three of us are among the few people who have been working on this problem for many years, and who have active AGI projects). So perhaps that is two points? Maybe?
And, just out of curiosity, I would invite you to check in with the guy who invented AIXI—Marcus Hutter. He and I met and had a very long discussion at the 2009 AGI conference. Marcus and I disagree substantially about the theoretical foundations of AI, but in spite of that disagreement I would urge you to ask him if he considers me to be down at the crank level. I might be wrong, but I do not think he would be willing to give me a bad reference. Let me know how that goes, yes?
You also finished off with what I can only describe as one of the most bizarre comparisons I have ever seen. :-) You say “Eliezer has done several impossible things in the last decade or so”. Hmmmm....! :-) And yet … “Richard appears to be drifting along” Well, okay, if you say so …. :-)
I have no horse in this race, and I am not an ardent EY supporter, or even count myself as a “rationalist”. In the area where I consider myself reasonably well trained, physics, he and I clashed a number of times on this forum. However, I am not an expert in the AI field, so I can only go by the outward signs of expertise. Ben Goertzel has them, Marcus Hutter has them, Eliezer has them. Richard Loosemore—not so much. For all I know, you might be the genius who invents the AGI and sets it loose someday, but it’s not obvious by looking online. And your histrionic comments and oversized ego make it appear rather unlikely.
I agree with pretty much all of the above.
I didn’t quit with Rob, btw. Ihave had a fairly productive—albeit exhausting—discussion with Rob over on his blog. I consider it to be productive because I have managed to narrow in on what he thinks is the central issue. And I think I have now (today’s comment, which is probably the last of the discussion) managed to nail down my own argument in a way that withstands all the attacks against it.
You are right that I have some serious debating weaknesses. I write too dense, and I assume that people have my width and breadth of experience, which is unfair (I got lucky in my career choices).
Oh, and don’t get me wrong: Eliezer never made me angry in this little episode. I laughed myself silly. Yeah, I protested. But I was wiping back tears of laughter while I did. “Known Permanent Idiot” is just a wondeful turn of phrase. Thanks, Eliezer!
Link to the nailed-down version of the argument?
Bottommost (September 9, 6:03 PM) comment here.
Oh, yeah, I found that myself eventually.
Anyway, I went and read the the majority of that discussion (well, the parts between Richard and Rob). Here’s my summary:
Richard:
[Rob responds]
Richard:
[Rob responds]
Richard:
[Rob responds]
Richard:
[Rob responds]
Richard:
[Rob responds]
Richard:
Rob:
Richard:
I snipped a lot of things there. I found lots of other points I wanted to emphasize, and plenty of things I wanted to argue against. But those aren’t the point.
Richard, this next part is directed at you.
You know what I didn’t find?
I didn’t find any posts where you made a particular effort to address the core of Rob’s argument. It was always about your argument. Rob was always the one missing the point.
Sure, it took Rob long enough to focus on finding the core of your position, but he got there eventually. And what happened next? You declared that he was still missing the point, posted a condensed version of the same argument, and posted here that your position “withstands all the attacks against it.”
You didn’t even wait for him to respond. You certainly didn’t quote him and respond to the things he said. You gave no obvious indication that you were taking his arguments seriously.
As far as I’m concerned, this is a cardinal sin.
How about this alternate hypothesis? Your explanations are fine. Rob understands what you’re saying. He just doesn’t agree.
Perhaps you need to take a break from repeating yourself and make sure you understand Rob’s argument.
(P.S. Eliezer’s ad hominem is still wrong. You may be making a mistake, but I’m confident you can fix it, the tone of this post notwithstanding.)
This entire debate is supposed to about my argument, as presented in the original article I published on the IEET.org website (“The Fallacy of Dumb Superintelligence”).
But in that case, what should I do when Rob insists on talking about something that I did not say in that article?
My strategy was to explain his mistake, but not engage in a debate about his red herring. Sensible people of all stripes would consider that a mature response.
But over and over again Rob avoided the actual argument and insisted on talking about his red herring.
And then FINALLY I realized that I could write down my original claim in such a way that it is IMPOSSIBLE for Rob to misinterpret it.
(That was easy, in retrospect: all I had to do was remove the language that he was using as the jumping-off point for his red herring).
That final, succinct statement of my argument is sitting there at the end of his blog ….. so far ignored by you, and by him. Perhaps he will be able to respond, I don’t know, but you say you have read it, so you have had a chance to actually understand why it is that he has been talking about something of no relevance to my original argument.
But you, in your wisdom, chose to (a) completely ignore that statement of my argument, and (b) give me a patronizing rebuke for not being able to understand Rob’s red herring argument.
Further to RobbBB’s highly useful reply (you should read it), I would like to point out that the phenomenon of optimization processes having unexpected or unusual behaviour when placed in new situations or given new capabilities is quite general. It applies to pretty much any strong optimization process working toward any goal.
For example, you program an AI/robot with no resources to fetch a banana as quickly as possible in some sandbox environment, where the banana is guarded by a few people. The AI will describe its strategy as talking to the guards and convincing them to give it the banana. Or perhaps buying the banana off them if you give it some money. But there are many ways to change the situation:
Make the AI much smarter—it will realise that by taking a certain sequence of steps it can get from here to the banana without ever moving within someone’s field of view. No need to pay for it. (1.5—Make the AI even smarter, and it will realise that by painting a Basilisk on the robot’s front it can disable all the guards effortlessly and take a straight line to the banana.)
Give the AI a teleportation device—it teleports straight to the banana.
Give the AI a plasma cannon—it shoots all the guards dead and runs up and takes the banana.
The point is… it’s not at all weird for AI behaviour to be “inconsistent”. It isn’t a sign of anything being broken, in fact the goal is being achieved. The AI is just able to think of more effective ways to do it then you are. That is, after all, the point of superintelligence. And an AI that does this is not broken or stupid, and is certainly capable of being dangerous.
By the way, you can try to do something like this:
But, to start with I have no idea how you would program this or what it means formally, but even if you could, it takes human judgement to identify “inconsistencies” that would matter to humans. Without embedding human values in there you’ll have the AI shut down every time it tries to do anything new, or use a stronger criterion of “inconsistency” and miss a few cases where the AI does something you actually don’t want.
Or, you know, the AI will deduce that the full “verbal description of the class of results X” (which is an infinite list) is of course defined by its goal (ie. the goalX code) and therefore reason that nothing the goalX code can do will be inconsistent with it.
I didn’t mean to ignore your argument; I just didn’t get around to it. As I said, there were a lot of things I wanted to respond to. (In fact, this post was going to be longer, but I decided to focus on your primary argument.)
Your story:
My version:
Your story:
My version:
In the rest of the scenario you described, I agree that the AI’s behavior is pretty incoherent, if its goal is X. But if it’s really aiming for Z, then its behavior is perfectly, terrifyingly coherent.
And your “obvious” fail-safe isn’t going to help. The AI is smarter than us. If it wants Z, and a fail-safe prevents it from getting Z, it will find a way around that fail-safe.
I know, your premise is that X really is the AI’s true goal. But that’s my sticking point.
Making it actually have the goal X, before it starts self-modifying, is far from easy. You can’t just skip over that step and assume it as your premise.
What you say makes sense …. except that you and I are both bound by the terms of a scenario that someone else has set here.
So, the terms (as I say, this is not my doing!) of reference are that an AI might sincerely believe that it is pursuing its original goal of making humans happy (whatever that means …. the ambiguity is in the original), but in the course of sincerely and genuinely pursuing that goal, it might get into a state where it believes that the best way to achieve the goal is to do something that we humans would consider to be NOT achieving the goal.
What you did was consider some other possibilities, such as those in which the AI is actually not being sincere. Nothing wrong with considering those, but that would be a story for another day.
Oh, and one other thing that arises from your above remark: remember that what you have called the “fail-safe” is not actually a fail-safe, it is an integral part of the original goal code (X). So there is no question of this being a situation where ”… it wants Z, and a fail-safe prevents it from getting Z, [so] it will find a way around that fail-safe.” In fact, the check is just part of X, so it WANTS to check as much as wants anything else involved in the goal.
I am not sure that self-modification is part of the original terms of reference here, either. When Muehlhauser (for example) went on a radio show and explained to the audience that a superintelligence might be programmed to make humans happy, but then SINCERELY think it was making us happy when it put us on a Dopamine Drip, I think he was clearly not talking about a free-wheeling AI that can modify its goal code. Surely, if he wanted to imply that, the whole scenario goes out the window. The AI could have any motivation whatsoever.
Hope that clarifies rather than obscures.
Ok, if you want to pass the buck, I won’t stop you. But this other person’s scenario still has a faulty premise. I’ll take it up with them if you like; just point out where they state that the goal code starts out working correctly.
To summarize my complaint, it’s not very useful to discuss an AI with a “sincere” goal of X, because the difficulty comes from giving the AI that goal in the first place.
As I see it, your (adopted) scenario is far less likely than other scenario(s), so in a sense that one is the “story for another day.” Specifically, a day when we’ve solved the “sincere goal” issue.
That all depends on the approach… if you have some big human-inspired but more brainy neural network that learns to be a person, it can well just do the right thing by itself, and the risks are in any case quite comparable to that with having a human do it.
If you are thinking of a “neat AI” with utility functions over world models and such, parts of said AI can maximize abstract metrics over mathematical models (including self improvement) without any “generally intelligent” process of eating you. So you would want to use those to build models of human meaning and intent.
Furthermore with regards to AI following some goals, it seems to me that goal specifications would have to be intelligently processed in the first place so that they could be actually applied to the real world—we can’t even define paperclips otherwise.
I tried arguing basically the same thing.
The most coherent reply I got was that an AI doesn’t follow verbal instructions and we can’t just order the AI to “make humans happy”, or even “make humans happy, in the way that I mean”. You can only tell the AI to make humans happy by writing a program that makes it do so. It doesn’t matter if the AI grasps what you really want it to do, if there is a mismatch between the program and what you really want it to do, it follows the program.
Obviously I don’t buy this. For one thing, you can always program it to obey verbal instructions, or you can talk to it and ask it how it will make people happy.
Jiro: Did you read my post? I discuss whether getting an AI to ‘obey verbal instructions’ is a trivial task in the first named section. I also link to section 2 of Yudkowsky’s reply to Holden, which addresses the question of whether ‘talk to it and ask it how it will make people happy’ is generally a safe way to interact with an Unfriendly Oracle.
I also specifically quote an argument you made in section 2 that I think reflects a common mistake in this whole family of misunderstandings of the problem — the conflation of the seed AI with the artificial superintelligence it produces. Do you agree this distinction helps clarify why the problem is one of coding the right values, and not of coding the right factual knowledge or intelligence-relevant capacities?