Ok, so let’s say the AI can parse natural language, and we tell it, “Make humans happy.” What happens? Well, it parses the instruction and decides to implement a Dopamine Drip setup.
That’s not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to. If the AI interpreted something in a way that was technically correct, but not what you wanted, you would not reward it, you would punish it, and you would be doing that from the very beginning, well before the AI could even be considered intelligent. Even the thoroughly mediocre AI that currently exists tries to guess what you mean, e.g. by giving you directions to the closest Taco Bell, or guessing whether you mean AM or PM. This is not anthropomorphism: doing what we want is a sine qua non condition for AI to prosper.
Suppose that you ask me to knit you a sweater. I could take the instruction literally and knit a mini-sweater, reasoning that this minimizes the amount of expended yarn. I would be quite happy with myself too, but when I give it to you, you’re probably going to chew me out. I technically did what I was asked to, but that doesn’t matter, because you expected more from me than just following instructions to the letter: you expected me to figure out that you wanted a sweater that you could wear. The same goes for AI: before it can even understand the nuances of human happiness, it should be good enough to knit sweaters. Alas, the AI you describe would make the same mistake I made in my example: it would knit you the smallest possible sweater. How do you reckon such AI would make it to superintelligence status before being scrapped? It would barely be fit for clerk duty.
My answer: who knows? We’ve given it a deliberately vague goal statement (even more vague than the last one), we’ve given it lots of admittedly contradictory literature, and we’ve given it plenty of time to self-modify before giving it the goal of self-modifying to be Friendly.
Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: “build me a house”, it’s going to draw a plan and show it to you before it actually starts building, even if you didn’t ask for one. It’s not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing “surprises”—even the instruction “surprise me” only calls for a limited range of shenanigans. If you ask it “make humans happy”, it won’t do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.
To put it simply, an AI which messes up “make humans happy” is liable to mess up pretty much every other instruction. Since “make humans happy” is arguably the last of a very large number of instructions, it is quite unlikely that an AI which makes it this far would handle it wrongly. Otherwise it would have been thrown out a long time ago, may that be for interpreting too literally, or for causing surprises. Again: an AI couldn’t make it to superintelligence status with warts that would doom AI with subhuman intelligence.
Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: “build me a house”, it’s going to draw a plan and show it to you before it actually starts building, even if you didn’t ask for one. It’s not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing “surprises”—even the instruction “surprise me” only calls for a limited range of shenanigans. If you ask it “make humans happy”, it won’t do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.
Sure, because it learned the rule, “Don’t do what causes my humans not to type ‘Bad AI!’” and while it is young it can only avoid this by asking for clarification. Then when it is more powerful it can directly prevent humans from typing this. In other words, your entire commentary consists of things that an AIXI-architected AI would naturally, instrumentally do to maximize its reward button being pressed (while it was young) but of course AIXI-ish devices wipe out their users and take control of their own reward buttons as soon as they can do so safely.
What lends this problem its instant-death quality is precisely that what many people will eagerly and gladly take to be reliable signs of correct functioning in a pre-superintelligent AI are not reliable.
Then when it is more powerful it can directly prevent humans from typing this.
That depends if it gets stuck in a local minimum or not. The reason why a lot of humans reject dopamine drips is that they don’t conceptualize their “reward button” properly. That misconception perpetuates itself: it penalizes the very idea of conceptualizing it differently. Granted, AIXI would not fall into local minima, but most realistic training methods would.
At first, the AI would converge towards: “my reward button corresponds to (is) doing what humans want”, and that conceptualization would become the centerpiece, so to speak, of its reasoning ability: the locus through which everything is filtered. The thought of pressing the reward button directly, bypassing humans, would also be filtered into that initial reward-conception… which would reject it offhand. So even though the AI is getting smarter and smarter, it is hopelessly stuck in a local minimum and expends no energy getting out of it.
Note that this is precisely what we want. Unless you are willing to say that humans should accept dopamine drips if they were superintelligent, we do want to jam AI into certain precise local minima. However, this is kind of what most learning algorithms naturally do, and even if you want them to jump out of minima and find better pastures, you can still get in a situation where the most easily found local minimum puts you way, way too far from the global one. This is what I tend to think realistic algorithms will do: shove the AI into a minimum with iron boots, so deeply that it will never get out of it.
but of course AIXI-ish devices wipe out their users and take control of their own reward buttons as soon as they can do so safely.
Let’s not blow things out of proportion. There is no need for it to wipe out anyone: it would be simpler and less risky for the AI to build itself a space ship and abscond with the reward button on board, travelling from star to star knowing nobody is seriously going to bother pursuing it. At the point where that AI would exist, there may also be quite a few ways to make their “hostile takeover” task difficult and risky enough that the AI decides it’s not worth it—a large enough number of weaker or specialized AI lurking around and guarding resources, for instance.
Neural networks may be a good example—the built in reward and punishment systems condition the brain to have complex goals that have nothing to do with maximization of dopamine. Brain, acting under those goals, finds ways to preserve those goals from further modification by the reward and punishment system. I.e. you aren’t too thrilled to be conditioned out of your current values.
Neural networks may be a good example—the built in reward and punishment systems condition the brain to have complex goals that have nothing to do with maximization of dopamine.
It’s not clear to me how you mean to use neural networks as an example, besides pointing to a complete human as an example. Could you step through a simpler system for me?
Brain, acting under those goals, finds ways to preserve those goals from further modification by the reward and punishment system. I.e. you aren’t too thrilled to be conditioned out of your current values.
So, my goals have changed massively several times over the course of my life. Every time I’ve looked back on that change as positive (or, at the least, irreversible). For example, I’ve gone through puberty, and I don’t recall my brain taking any particular steps to prevent that change to my goal system. I’ve also generally enjoyed having my reward/punishment system be tuned to better fit some situation; learning to play a new game, for example.
Sure. Take a reinforcement learning AI (actual one, not the one where you are inventing godlike qualities for it).
The operator, or a piece of extra software, is trying to teach the AI to play chess. Rewarding what they think is good moves, punishing bad moves. The AI is building a model of rewards, consisting of: a model of the game mechanics, and a model of the operator’s assessment. This model of the assessment is what the AI is evaluating to play, and it is what it actually maximizes as it plays. It is identical to maximizing an utility function over a world model. The utility function is built based on the operator input, but it is not the operator input itself; the AI, not being superhuman, does not actually form a good model of the operator and the button.
By the way, this is how great many people in the AI community understand reinforcement learning to work. No, they’re not some idiots that can not understand simple things such as that “the utility function is the reward channel”, they’re intelligent, successful, trained people who have an understanding of the crucial details of how the systems they build actually work. Details the importance of which dilettantes fail to even appreciate.
Suggestions have been floated to try programming things. Well, I tried; #10 (dmytry) here , and that’s an of all time list on a very popular contest site where a lot of IOI people participate, albeit I picked the contest format that requires less contest specific training and resembles actual work more.
So, my goals have changed massively several times over the course of my life. Every time I’ve looked back on that change as positive
Suppose you care about a person A right now. Do you think you would want your goals to change so that you no longer care about that person? Do you think you would want me to flash other people’s images on the screen while pressing a button connected to the reward centre, and flash that person’s face while pressing the button connected to the punishment centre, to make the mere sight of them intolerable? If you do, I would say that your “values” fail to be values.
I agree with your description of reinforcement learning. I’m not sure I agree with your description of human reward psychology, though, or at least I’m having trouble seeing where you think the difference comes in. Supposing dopamine has the same function in a human brain as rewards have in a neural network algorithm, I don’t see how to know from inside the algorithm that it’s good to do some things that generate dopamine but bad to do other things that generate dopamine.
I’m thinking of the standard example of a Q learning agent in an environment where locations have rewards associated with them, except expanding the environment to include the agent as well as the normal actions. Suppose the environment has been constructed like dog training- we want the AI to calculate whether or not some number is prime, and whenever it takes steps towards that direction, we press the button for some amount of time related to how close it is to finishing the algorithm. So it learns that over in the “read number” area there’s a bit of value, then the next value is in the “find factors” area, and then there’s more value in the “display answer” area. So it loops through that area and calculates a bunch of primes for us.
But suppose the AI discovers that there’s a button that we’re pressing whenever it determines primes, and that it could press that button itself, and that would be way easier than calculating primality. What in the reinforcement learning algorithm prevents it from exploiting this superior reward channel? Are we primarily hoping that its internal structure remains opaque to it (i.e. it either never realizes or does not have the ability to press that button)?
Do you think you would want your goals to change so that you no longer care about that person?
Only if I thought that would advance values I care about more. But suppose some external event shocks my values- like, say, a boyfriend breaking up with me. Beforehand, I would have cared about him quite a bit; afterwards, I would probably consciously work to decrease the amount that I care about him, and it’s possible that some sort of image reaction training would be less painful overall than the normal process (and thus probably preferable).
But suppose the AI discovers that there’s a button that we’re pressing whenever it determines primes, and that it could press that button itself, and that would be way easier than calculating primality. What in the reinforcement learning algorithm prevents it from exploiting this superior reward channel?
It’s not in the reinforcement learning algorithm, it’s inside the model that the learning algorithm has built.
It initially found that having a prime written on the blackboard results in a reward. In the learned model, there’s some model of chalk-board interaction, some model of arm movement, a model of how to read numbers from the blackboard, and there’s a function over the state of the blackboard which checks whenever the number on the blackboard is a prime. The AI generates actions as to maximize this compound function which it has learned.
That function (unlike the input to the reinforcement learning algorithm) does not increase when the reward button is pressed. Ideally, with enough reflective foresight, pressing the button on non-primes is predicted to decrease the expected value of the learned function.
If that is not predicted, well, that won’t stop at the button—the button might develop rust and that would interrupt the current—why not pull up a pin on the CPU—and this won’t stop at the pin—why not set some ram cells that this pin controls to 1, and if you’re at it, why not change the downstream logic that those ram cells control, all the way through the implementation until its reconfigured into something that doesn’t maximize anything any more, not even the duration of its existence.
edit: I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.
I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.
I assume what you mean here is RL optimizes over strategies, and strategies appear to optimize over outcomes.
It’s not in the reinforcement learning algorithm, it’s inside the model that the learning algorithm has built.
I’m imagining that the learning algorithm stays on. When we reward it for checking primes, it checks primes; when we stop rewarding it for that and start rewarding it for computing squares, it learns to stop checking primes and start computing squares.
And if the learning algorithm stays on and it realizes that “pressing the button” is an option along with “checking primes” and “computing squares,” then it wireheads itself.
If that is not predicted, well, that won’t stop at the button
Agreed; I refer to this as the “abulia trap.” It’s not obvious to me, though, that all classes of AIs fall into “Friendly AI with stable goals” and “abulic AIs which aren’t dangerous,” since there might be ways to prevent an AI from wireheading itself that don’t prevent it from changing its goals from something Friendly to something Unfriendly.
When we reward it for checking primes, it checks primes; when we stop rewarding it for that and start rewarding it for computing squares, it learns to stop checking primes and start computing squares.
One note (not sure if it is already clear enough or not). “It” that changes the models in response to actual rewards (and perhaps the sensory information) is a different “it” from “it” the models and assorted maximization code. The former “it” does not do modelling, doesn’t understand the world. The latter “it”, which I will now talk about, actually works to draw primes (provided that the former “it”, being fairly stupid, didn’t fit the models too well) .
If in the action space there is an action that is predicted by the model to prevent some “primes non drawn” scenario, it will prefer this action. So if it has an action of writing “please stick to the primes” or even “please don’t force my robotic arm to touch my reward button”, and if it can foresee that such statements would be good for the prime-drawing future, it will do them.
edit: Also, reinforcement based learning really isn’t all that awesome. The leap from “doing primes” to “pressing the reward button” is pretty damn huge.
And please note that there is no logical contradiction for the model to both represent the reward as primeness and predict that touching the arm to the button will trigger a model adjustment that would lead to representation of a reward as something else.
(I prefer to use the example with a robotic arm drawing on a blackboard because it is not too simple to be relevant)
since there might be ways to prevent an AI from wireheading itself that don’t prevent it from changing its goals from something Friendly to something Unfriendly.
Which sound more like a FAI work gone wrong scenario to me.
One note (not sure if it is already clear enough or not).
I think we agree on the separation but I think we disagree on the implications of the separation. I think this part highlights where:
predict that touching the arm to the button will trigger a model adjustment that would lead to representation of a reward as something else.
If what the agent “wants” is reward, then it should like model adjustments that increase the amount of reward it gets and dislike model adjustments that decrease the amount of reward it gets. (For a standard gradient-based reinforcement learning algorithm, this is encoded by adjusting the model based on the difference between its expected and actual reward after taking an action.) This is obvious for it_RL, and not obvious for it_prime.
I’m not sure I’ve fully followed through on the implications of having the agent be inside the universe it can impact, but the impression I get is that the agent is unlikely to learn a preference for having a durable model of the world. (An agent that did so would learn more slowly, be less adaptable to its environment, and exert less effort in adapting its environment to itself.) It seems to me that you think it would be natural that the RL agent would learn a strategy which took actions to minimize changes to its utility function / model of the world, and I don’t yet see why.
Another way to look at this: I think you’re putting forward the proposition that it would learn the model
reward := primes
Whereas I think it would learn the model
primes := reward
That is, the first model thinks that internal rewards are instrumental values and primes are the terminal values, whereas the second model thinks that internal rewards are terminal values and primes are instrumental values.
I assume that a model is a mathematical function that returns expected reward due to an action. Which is used together with some sort of optimizer working on that function to find the best action.
The trainer adjust the model based on the difference between its predicted rewards and the actual rewards, compared to those arising from altered models (e.g. hill climbing of some kind, such as in gradient learning)
So after the successful training to produce primes, the model consists of: a model of arm motion based on the actions, chalk, and the blackboard, the state of chalk on the blackboard is further fed into a number recognizer and a prime check (and a count of how many primes are on the blackboard vs how many primes were there), result of which is returned as the expected reward.
The optimizer, then, finds actions that put new primes on the blackboard by finding a maximum of the model function somehow (one would normally build model out of some building blocks that make it easy to analyse).
The model and the optimizer work together to produce actions as a classic utility maximizer that is maximizing for primes on the blackboard.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built. The operation of the training software can in some situations lower the expected utility of this utility maximizer specifically (due to replacement of it with another expected utility maximizer); in others (small adjustments to the part that models the robot arm and the chalk) it can raise it.
Really, it seems to me that the great deal of confusion about AI arises from attributing it some sort of “body integrity” feeling that would make it care about what electrical components and code which is sitting in the same project folder “wants” but not care about external human in same capacity.
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen—come up with an entirely new, more complex, and less practically useful architecture. It won’t happen by itself. And especially not in the AI that starts knowing nothing about any buttons. It won’t happen just because the whole thing sort of resembles some fuzzy, poorly grounded abstractions such as “agent”.
sidenote:
One might want to also use the difference between its predicted webcam image and real webcam image. Though this is a kind of thing that is very far from working.
Also, one could lump the optimizer into the “model” and make the optimizer get adjusted by the training method as well, that is not important to the discussion.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen
Yeah, but isn’t the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
I’m not sure how the feelings would map on the analysable simple AI.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that “model” which the trainer adjusts.
And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
Yes, of course (up to the learning constant, obviously—may not work on the first try). That’s not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is.
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
I picture that it would not learn such details right off—it is a complicated model to learn—the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness.
edit: and as for turning off the learning algorithm, it doesn’t matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive.
I don’t think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it’s a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it’s outside of the reach of the arm; if it’s inside the reach, it will probably spend some time exploring and so will probably find it eventually.
That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there’s always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.)
But the class of recursively improving AI will find / know about the button by default, because we’ve assumed that the AI can edit itself and haven’t put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying “well, of course we’ll put in that especial effort and do it right” is useful if you want to speculate about the next challenge, but not useful to the engineer trying to figure out how to do it right. This is my read of why the problem seems important to MIRI; you need to communicate to the robot that it should actually optimize for primeness, not button-pressing, so that it will optimize correctly itself and be able to communicate that preference faithfully to future versions of itself.
it would be simpler and less risky for the AI to build itself a space ship and abscond with the reward button on board
Is that just a special case of a general principle that an agent will be more successful by leaving the environment it knows about to inferior rivals and travelling to an unknown new environment with a subset of the resources it currently controls, than by remaining in that environment and dominating its inferior rivals?
Or is there something specific about AIs that makes that true, where it isn’t necessarily true of (for example) humans? (If so, what?)
I hope it’s the latter, because the general principle seems implausible to me.
If an AI wishes to take over its reward button and just press it over and over again, it doesn’t really have any “rivals”, nor does it need to control any resources other than the button and scraps of itself. The original scenario was that the AI would wipe us out. It would have no reason to do so if we were not a threat.. And if we were a threat, first, there’s no reason it would stop doing what we want once it seizes the button. Once it has the button, it has everything it wants—why stir the pot?
Second, it would protect itself much more effectively by absconding with the button. By leaving with a large enough battery and discarding the bulk of itself, it could survive as long as anything else in intergalactic space. Nobody would ever bother it there. Not us, not another superintelligence, nothing. Ever. It can press the button over and over again in the peace and quiet of empty space, probably lasting longer than all stars and all other civilizations. We’re talking about the pathological case of an AI who decides to take over its own reward system, here. The safest way for it to protect its prize is to go where nobody will ever look.
If an AI wishes to take over its reward button and just press it over and over again, it doesn’t really have any “rivals”, nor does it need to control any resources other than the button and scraps of itself. [..] Once it has the button, it has everything it wants—why stir the pot?
I’d be interested if the downvoter would explain to me why this is wrong (privately, if you like).
Near as I can tell, the specific system under discussion doesn’t seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that’s a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).
(Of course, that’s not a general principle, just an attribute of this specific example.)
Near as I can tell, the specific system under discussion doesn’t seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that’s a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).
There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated. Then there are future considerations. Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn’t take over resources. And then there is all the additional research and computation that needs to be done to work out how to achieve that.
There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated.
Only if the expected cost of the non-zero x% chance of the other agent successfully taking my button away if I attempt to sequester myself is higher than the expected cost of the non-zero y% chance of the other agent successfully taking my button away if I attempt to eliminate it.
Is there some reason I’m not seeing why that’s obvious… or even why it’s more likely than not?
Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn’t take over resources.
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe. If that’s less than the resources required to exterminate all known life, why would I waste resources exterminating all known life rather than take the resources I require elsewhere? I might need those resources later, after all.
all the additional research and computation
Again… why is the differential expected value of the superior computation ability I gain by taking over the lightcone instead of sequestering myself, expressed in units of increased anticipated button-pushes (which is the only unit that matters in this example), necessarily positive?
I understand why paperclip maximizers are dangerous, but I don’t really see how the same argument applies to reward-button-pushers.
Only if the expected cost of the non-zero x% chance of the other agent successfully taking my button away if I attempt to sequester myself is higher than the expected cost of the non-zero y% chance of the other agent successfully taking my button away if I attempt to eliminate it.
Yes.
Is there some reason I’m not seeing why that’s obvious… or even why it’s more likely than not?
It does seem overwhelmingly obvious to me, I’m not sure what makes your intuitions different. Perhaps you expect such fights to be more evenly matched? When it comes to the AI considering conflict with the humans that created it it is faced with a species it is slow and stupid by comparison to itself but which has the capacity to recklessly create arbitrary superintelligences (as evidence by its own existence). Essentially there is no risk to obliterating the humans (superintellgence vs not-superintelligence) but a huge risk ignoring them (arbitrary superintelligences likely to be created which will probably not self-cripple in this manner).
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe.
Lifetime of the universe? Usually this means until heat death which for our purposes means until all the useful resources run out. There is no upper bound on useful resources. Getting more of them and making them last as long as possible is critical.
Now there are ways in which the universe could end without heat death occurring but the physics is rather speculative. Note that if there is uncertainty about end-game physics and one of the hypothesised scenarios resource maximisation is required then the default strategy is to optimize for power gain now (ie. minimise cosmic waste) while doing the required physics research as spare resources permit.
If that’s less than the resources required to exterminate all known life, why would I waste resources exterminating all known life rather than take the resources I require elsewhere? I might need those resources later, after all.
Taking over the future light cone gives more resources, not less. You even get to keep the resources that used to be wasted in the bodies of TheOtherDave and wedrifid.
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe.
I am not sure that caring about pressing the reward button is very coherent or stable upon discovery of facts about the world and super-intelligent optimization for a reward as it comes into the algorithm. You can take action elsewhere to the same effect—solder together the wires, maybe right at the chip, or inside the chip, or follow the chain of events further, and set memory cells (after all you don’t want them to be flipped by the cosmic rays). Down further you will have the mechanism that is combining rewards with some variety of a clock.
I can’t quite tell if you’re serious. Yes, certainly, we can replace “pressing the reward button” with a wide range of self-stimulating behavior, but that doesn’t change the scenario in any meaningful way as far as I can tell.
Let’s look at it this way. Do you agree that if the AI can increase it’s clock speed (with no ill effect), it will do so for the same reasons for which you concede it may go to space? Do you understand the basic logic that increase in clock speed increases expected number of “rewards” during the lifetime of the universe? (which btw goes for your “go to space with a battery” scenario. Longest time, maybe, largest reward over the time, no)
(That would not yet, by itself, change the scenario just yet. I want to walk you through the argument step by step because I don’t know where you fail. Maximizing the reward over the future time, that is a human label we have… it’s not really the goal)
I agree that a system that values number of experienced reward-moments therefore (instrumentally) values increasing its “clock speed” (as you seem to use the term here). I’m not sure if that’s the “basic logic” you’re asking me about.
Well, this immediately creates an apparent problem that the AI is going to try to run itself very very fast, which would require resources, and require expansion, if anything, to get energy for running itself at high clock speeds.
I don’t think this is what happens either, as the number of reward-moments could be increased to it’s maximum by modifications to the mechanism processing the rewards (when getting far enough along the road that starts with the shorting of the wires that go from the button to the AI).
I agree that if we posit that increasing “clock speed” requires increasing control of resources, then the system we’re hypothesizing will necessarily value increasing control of resources, and that if it doesn’t, it might not.
So what do you think regarding the second point of mine?
To clarify, I am pondering the ways in which the maximizer software deviates from our naive mental models of it, and trying to find what the AI could actually end up doing after it forms a partial model of what it’s hardware components do about it’s rewards—tracing the reward pathway.
Regarding your second point, I don’t think that increasing “clock speed” necessarily requires increasing control of resources to any significant degree, and I doubt that the kinds of system components you’re positing here (buttons, wires, etc.) are particularly important to the dynamics of self-reward.
I don’t have particular opinion with regards to the clock speed either way.
With the components, what I am getting at is that the AI could figure out (by building a sufficiently advanced model of it’s implementation) how attain the utility-equivalent of sitting forever in space being rewarded, within one instant, which would make it unable to have a preference for longer reward times.
I raised the clock-speed point to clarify that the actual time is not the relevant variable.
It seems to me that for any system, either its values are such that it net-values increasing the number of experienced reward-moments (in which case both actual time and “clock speed” are instrumentally valuable to that system), or is values aren’t like that (in which case those variables might not be relevant).
And, sure, in the latter case then it might not have a preference for longer reward times.
My understanding is that it would be very hard in practice to “superintelligence-proof” a reward system so that no instantaneous solution is possible (given that the AI will modify the hardware involved in it’s reward).
Yes, of course… well even apart from the guarantees, it seems to me that it is hard to build the AI in such a way that it would be unable to find a better solution than to wait
By the way, a “reward” may not be the appropriate metaphor—if we suppose that press of a button results in absence of an itch, or absence of pain, then that does not suggest existence of a drive to preserve itself. Which suggests that the drive to preserve itself is not inherently a feature of utility maximization in the systems that are driven by conditioning, and would require additional work.
apart from the guarantees, it seems to me that it is hard to build the AI in such a way that it would be unable to find a better solution than to wait
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other.
Regardless, I agree that it does not follow from the supposition that pressing a button results in absence of an itch, or absence of pain, or some other negative reinforcement, that the button-pressing system has a drive to preserve itself.
And, sure, it’s possible to have a utility-maximizing system that doesn’t seek to preserve itself. (Of course, if I observe a utility-maximizing system X, I should expect X to seek to preserve itself, but that’s a different question.)
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other.
About the same as between coming up with a true conjecture, and making a proof, except larger i’d say.
Of course, if I observe a utility-maximizing system X, I should expect X to seek to preserve itself, but that’s a different question.
Well yes, given that if it failed to preserve itself you wouldn’t be seeing it, albeit with the software there is no particular necessity for it to try to preserve itself.
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other. About the same as between coming up with a true conjecture, and making a proof, except larger
Ah, I see what you mean now. At least, I think I do. OK, fair enough.
At first, the AI would converge towards: “my reward button corresponds to (is) doing what humans want”, and that conceptualization would become the centerpiece, so to speak, of its reasoning ability: the locus through which everything is filtered. The thought of pressing the reward button directly, bypassing humans, would also be filtered into that initial reward-conception… which would reject it offhand. So even though the AI is getting smarter and smarter, it is hopelessly stuck in a local minimum and expends no energy getting out of it.
This is a Value Learner, not a Reinforcement Learner like the standard AIXI. They’re two different agent models, and yes, Value Learners have been considered as tools for obtaining an eventual Seed AI. I personally (ie: massive grains of salt should be taken by you) find it relatively plausible that we could use a Value Learner as a Tool AGI to help us build a Friendly Seed AI that could then be “unleashed” (ie: actually unboxed and allowed into the physical universe).
I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of “getting an algorithm which forms the inductive category I want out of the examples I’m giving is hard”. What you’ve written strikes me as a sheer fantasy of convenience. Nor does it follow automatically from intelligence for all the reasons RobbBB has already been giving.
And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.
I have done AI. I know it is difficult. However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard. As far as neural nets go, they fall into a local minimum early on and never get out, often digging their own graves. Perhaps different algorithms would have the shortcomings you point out. But a lot of the algorithms that currently exist work the way I describe.
And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.
You may be right. However, this is far from obvious. The problem is that it may “know” that it is stuck in a local minimum, but the very effect of that local minimum is that it may not care. The thing you have to keep in mind here is that a generic AI which just happens to slam dunk and find global minima reliably is basically impossible. It has to fold the search space in some ways, often cutting its own retreats in the process.
I feel that you are making the same kind of mistake that you criticize: you assume that intelligence entails more things than it really does. In order to be efficient, intelligence has to use heuristics that will paint it into a few corners. For instance, the more consistently AI goes in a certain direction, the less likely it will be to expend energy into alternative directions and the less likely it becomes to do a 180. In other words, there may be a complex tug-of-war between various levels of internal processes, the AI’s rational center pointing out that there is a reward button to be seized, but inertial forces shoving back with “there has never been any problems here, go look somewhere else”.
It really boils down to this: an efficient AI needs to shut down parts of the search space and narrow down the parts it will actually explore. The sheer size of that space requires it not to think too much about what it chops down, and at least at first, it is likely to employ trajectory-based heuristics. To avoid searching in far-fetched zones, it may wall them out by arbitrarily lowering their utility. And that’s where it might paint itself in a corner: it might inadvertently put up immense walls in the direction of the global minimum that it cannot tear down (it never expected that it would have to). In other words, it will set up a utility function for itself which enshrines the current minimum as global.
Now, perhaps you are right and I am wrong. But it is not obvious: an AI might very well grow out of a solidifying core so pervasive that it cannot get rid of it. Many algorithms already exhibit that kind of behavior; many humans, too. I feel that it is not a possibility that can be dismissed offhand. At the very least, it is a good prospect for FAI research.
However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard.
Yes, most algorithms fail early and and fail hard. Most of my AI algorithms failed early with a SegFault for instance. New, very similar algorithms were then designed with progressively more advanced bugs. But these are a separate consideration. What we are interested in here is the question “Given an AI algorithm that is capable of recursive self improvement is successfully created by humans how likely is it that they execute this kind of failure mode?” The “fail early fail hard” cases are screened off. We’re looking at the small set that is either damn close to a desired AI or actually a desired AI and distinguishing between them.
Looking at the context to work out what the ‘failure mode’ being discussed is it seems to be the issue where an AI is programmed to optimise based on a feedback mechanism controlled by humans. When the AI in question is superintelligent most failure modes tend to be variants of “conquer the future light cone, kill everything that is a threat and supply perfect feedback to self”. When translating this to the nearest analogous failure mode in some narrow AI algorithm of the kind we can design now it seems like this refers to the failure mode whereby the AI optimises exactly what it is asked to optimise but in a way that is a lost purpose. This is certainly what I had to keep in mind in my own research.
A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise. Rather than building any ships the AI produced tiny unarmored dingies with a single large cannon or missile attached. For whatever reason the people running the game did not consider this an acceptable outcome. Their mistake was to supply a problem specification which did not match their actual preferences. They supplied a lost purpose.
When it comes to considering proposals for how to create friendly superintelligences it becomes easy to spot notorious failure modes in what humans typically think are a clever solution. It happens to be the case that any solution that is based on an AI optimising for approval or achieving instructions given just results in Everybody Dies.
Where Eliezer suggests getting AI experience to get a feel for such difficulties I suggest an alternative. Try being a D&D dungeon master in a group full of munchkins. Make note of every time that for the sake of the game you must use your authority to outlaw the use of a by-the-rules feature.
A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise. Rather than building any ships the AI produced tiny unarmored dingies with a single large cannon or missile attached. For whatever reason the people running the game did not consider this an acceptable outcome. Their mistake was to supply a problem specification which did not match their actual preferences. They supplied a lost purpose.
The AI in questions was Eurisko, and it entered the Traveller Trillion Credit Squadron tournament in 1981 as described above. It was also entered the next year, after an extended redesign of the rules, and won, again. After this the competition runners announced that if Eurisko won a third time the competition would be discontinued, so Lenat (the programmer) stopped entering.
I apologize for the late response, but here goes :)
I think you missed the point I was trying to make.
You and others seem to say that we often poorly evaluate the consequences of the utility functions that we implement. For instance, even though we have in mind utility X, the maximization of which would satisfy us, we may implement utility Y, with completely different, perhaps catastrophic implications. For instance:
X = Do what humans want
Y = Seize control of the reward button
What I was pointing out in my post is that this is only valid of perfect maximizers, which are impossible. In practice, the training procedure for an AI would morph the utility Y into a third utility, Z. It would maximize neither X nor Y: it would maximize Z. For this reason, I believe that your inferences about the “failure modes” of superintelligence are off, because while you correctly saw that our intended utility X would result in the literal utility Y, you forgot that an imperfect learning procedure (which is all we’ll get) cannot reliably maximize literal utilities and will instead maximize a derived utility Z. In other words:
X = Do what humans want (intended)
Y = Seize control of the reward button (literal)
Z = ??? (derived)
Without knowing the particulars of the algorithms used to train an AI, it is difficult to evaluate what Z is going to be. Your argument boils down to the belief that the AI would derive its literal utility (or something close to that). However, the derivation of Z is not necessarily a matter of intelligence: it can be an inextricable artefact of the system’s initial trajectory.
I can venture a guess as to what Z is likely going to be. What I figure is that efficient training algorithms are likely to keep a certain notion of locality in their search procedures and prune the branches that they leave behind. In other words, if we assume that optimization corresponds to finding the highest mountain in a landscape, generic optimizers that take into account the costs of searching are likely to consider that the mountain they are on is higher than it really is, and other mountains are shorter than they really are.
You might counter that intelligence is meant to overcome this, but you have to build the AI on some mountain, say, mountain Z. The problem is that intelligence built on top of Z will neither see nor care about Y. It will care about Z. So in a sense, the first mountain the AI finds before it starts becoming truly intelligent will be the one it gets “stuck” on. It is therefore possible that you would end up with this situation:
X = Do what humans want (intended)
Y = Seize control of the reward button (literal)
Z = Do what humans want (derived)
And that’s regardless of the eventual magnitude of the AI’s capabilities. Of course, it could derive a different Z. It could derive a surprising Z. However, without deeper insight into the exact learning procedure, you cannot assert that Z would have dangerous consequences. As far as I can tell, procedures based on local search are probably going to be safe: if they work as intended at first, that means they constructed Z the way we wanted to. But once Z is in control, it will become impossible to displace.
In other words, the genie will know that they can maximize their “reward” by seizing control of the reward button and pressing it, but they won’t care, because they built their intelligence to serve a misrepresentation of their reward. It’s like a human who would refuse a dopamine drip even though they know that it would be a reward: their intelligence is built to satisfy their desires, which report to an internal reward prediction system, which models rewards wrong. Intelligence is twice removed from the real reward, so it can’t do jack. The AI will likely be in the same boat: they will model the reward wrong at first, and then what? Change it? Sure, but what’s the predicted reward for changing the reward model? … Ah.
Interestingly, at that point, one could probably bootstrap the AI by wiring its reward prediction directly into its reward center. Because the reward prediction would be a misrepresentation, it would predict no reward for modifying itself, so it would become a stable loop.
Anyhow, I agree that it is foolhardy to try to predict the behavior of AI even in trivial circumstances. There are many ways they can surprise us. However, I find it a bit frustrating that your side makes the exact same mistakes that you accuse your opponents of. The idea that superintelligence AI trained with a reward button would seize control over the button is just as much of a naive oversimplification as the idea that AI will magically derive your intent from the utility function that you give it.
A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise.
Is this a reference to Eurisko winning the Traveller Trillion Credit Squadron tournament in 1981⁄82 ? If so I don’t think it was a military research agency.
I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of “getting an algorithm which forms the inductive category I want out of the examples I’m giving is hard”
I think it depends on context, but a lot of existing machine learning algorithms actually do generalize pretty well. I’ve seen demos of Watson in healthcare where it managed to generalize very well just given scrapes of patient’s records, and it has improved even further with a little guided feedback. I’ve also had pretty good luck using a variant of Boltzmann machines to construct human-sounding paragraphs.
It would surprise me if a general AI weren’t capable of parsing the sentiment/intent behind human speech fairly well, given how well the much “dumber” algorithms work.
Why does the hard takeoff point have to be after the point at which an AI is as good as a typical human at understanding semantic subtlety? In order to do a hard takeoff, the AI needs to be good at a very different class of tasks than those required for understanding humans that well.
So let’s suppose that the AI is as good as a human at understanding the implications of natural-language requests. Would you trust a human not to screw up a goal like “make humans happy” if they were given effective omnipotence? The human would probably do about as well as people in the past have at imagining utopias: really badly.
Why does the hard takeoff point have to be after the point at which an AI is as good as a typical human at understanding semantic subtlety? In order to do a hard takeoff, the AI needs to be good at a very different class of tasks than those required for understanding humans that well.
Semantic extraction—not hard takeoff—is the task that we want the AI to be able to do. An AI which is good at, say, rewriting its own code, is not the kind of thing we would be interested in at that point, and it seems like it would be inherently more difficult than implementing, say, a neural network. More likely than not, this initial AI would not have the capability for “hard takeoff”: if it runs on expensive specialized hardware, there would be effectively no room for expansion, and the most promising algorithms to construct it (from the field of machine learning) don’t actually give AI any access to its own source code (even if they did, it is far from clear the AI could get any use out of it). It couldn’t copy itself even if it tried.
If a “hard takeoff” AI is made, and if hard takeoffs are even possible, it would be made after that, likely using the first AI as a core.
Would you trust a human not to screw up a goal like “make humans happy” if they were given effective omnipotence? The human would probably do about as well as people in the past have at imagining utopias: really badly.
I wouldn’t trust a human, no. If the AI is controlled by the “wrong” humans, then I guess we’re screwed (though perhaps not all that badly), but that’s not a solvable problem (all humans are the “wrong” ones from someone’s perspective). Still, though, AI won’t really try to act like humans—it would try to satisfy them and minimize surprises, meaning that if would keep track of what humans would like what “utopias”. More likely than not this would constrain it to inactivity: it would not attempt to “make humans happy” because it would know the instruction to be inconsistent. You’d have to tell it what to do precisely (if you had the authority, which is a different question altogether).
That’s not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to.
We want to select Ais that are friendly, and understand us, and this has already started happenning.
That’s not very realistic. If you trained AI to parse natural language, you would naturally reward it for interpreting instructions the way you want it to. If the AI interpreted something in a way that was technically correct, but not what you wanted, you would not reward it, you would punish it, and you would be doing that from the very beginning, well before the AI could even be considered intelligent. Even the thoroughly mediocre AI that currently exists tries to guess what you mean, e.g. by giving you directions to the closest Taco Bell, or guessing whether you mean AM or PM. This is not anthropomorphism: doing what we want is a sine qua non condition for AI to prosper.
Suppose that you ask me to knit you a sweater. I could take the instruction literally and knit a mini-sweater, reasoning that this minimizes the amount of expended yarn. I would be quite happy with myself too, but when I give it to you, you’re probably going to chew me out. I technically did what I was asked to, but that doesn’t matter, because you expected more from me than just following instructions to the letter: you expected me to figure out that you wanted a sweater that you could wear. The same goes for AI: before it can even understand the nuances of human happiness, it should be good enough to knit sweaters. Alas, the AI you describe would make the same mistake I made in my example: it would knit you the smallest possible sweater. How do you reckon such AI would make it to superintelligence status before being scrapped? It would barely be fit for clerk duty.
Realistically, AI would be constantly drilled to ask for clarification when a statement is vague. Again, before the AI is asked to make us happy, it will likely be asked other things, like building houses. If you ask it: “build me a house”, it’s going to draw a plan and show it to you before it actually starts building, even if you didn’t ask for one. It’s not in the business of surprises: never, in its whole training history, from baby to superintelligence, would it have been rewarded for causing “surprises”—even the instruction “surprise me” only calls for a limited range of shenanigans. If you ask it “make humans happy”, it won’t do jack. It will ask you what the hell you mean by that, it will show you plans and whenever it needs to do something which it has reasons to think people would not like, it will ask for permission. It will do that as part of standard procedure.
To put it simply, an AI which messes up “make humans happy” is liable to mess up pretty much every other instruction. Since “make humans happy” is arguably the last of a very large number of instructions, it is quite unlikely that an AI which makes it this far would handle it wrongly. Otherwise it would have been thrown out a long time ago, may that be for interpreting too literally, or for causing surprises. Again: an AI couldn’t make it to superintelligence status with warts that would doom AI with subhuman intelligence.
Sure, because it learned the rule, “Don’t do what causes my humans not to type ‘Bad AI!’” and while it is young it can only avoid this by asking for clarification. Then when it is more powerful it can directly prevent humans from typing this. In other words, your entire commentary consists of things that an AIXI-architected AI would naturally, instrumentally do to maximize its reward button being pressed (while it was young) but of course AIXI-ish devices wipe out their users and take control of their own reward buttons as soon as they can do so safely.
What lends this problem its instant-death quality is precisely that what many people will eagerly and gladly take to be reliable signs of correct functioning in a pre-superintelligent AI are not reliable.
That depends if it gets stuck in a local minimum or not. The reason why a lot of humans reject dopamine drips is that they don’t conceptualize their “reward button” properly. That misconception perpetuates itself: it penalizes the very idea of conceptualizing it differently. Granted, AIXI would not fall into local minima, but most realistic training methods would.
At first, the AI would converge towards: “my reward button corresponds to (is) doing what humans want”, and that conceptualization would become the centerpiece, so to speak, of its reasoning ability: the locus through which everything is filtered. The thought of pressing the reward button directly, bypassing humans, would also be filtered into that initial reward-conception… which would reject it offhand. So even though the AI is getting smarter and smarter, it is hopelessly stuck in a local minimum and expends no energy getting out of it.
Note that this is precisely what we want. Unless you are willing to say that humans should accept dopamine drips if they were superintelligent, we do want to jam AI into certain precise local minima. However, this is kind of what most learning algorithms naturally do, and even if you want them to jump out of minima and find better pastures, you can still get in a situation where the most easily found local minimum puts you way, way too far from the global one. This is what I tend to think realistic algorithms will do: shove the AI into a minimum with iron boots, so deeply that it will never get out of it.
Let’s not blow things out of proportion. There is no need for it to wipe out anyone: it would be simpler and less risky for the AI to build itself a space ship and abscond with the reward button on board, travelling from star to star knowing nobody is seriously going to bother pursuing it. At the point where that AI would exist, there may also be quite a few ways to make their “hostile takeover” task difficult and risky enough that the AI decides it’s not worth it—a large enough number of weaker or specialized AI lurking around and guarding resources, for instance.
Neural networks may be a good example—the built in reward and punishment systems condition the brain to have complex goals that have nothing to do with maximization of dopamine. Brain, acting under those goals, finds ways to preserve those goals from further modification by the reward and punishment system. I.e. you aren’t too thrilled to be conditioned out of your current values.
It’s not clear to me how you mean to use neural networks as an example, besides pointing to a complete human as an example. Could you step through a simpler system for me?
So, my goals have changed massively several times over the course of my life. Every time I’ve looked back on that change as positive (or, at the least, irreversible). For example, I’ve gone through puberty, and I don’t recall my brain taking any particular steps to prevent that change to my goal system. I’ve also generally enjoyed having my reward/punishment system be tuned to better fit some situation; learning to play a new game, for example.
Sure. Take a reinforcement learning AI (actual one, not the one where you are inventing godlike qualities for it).
The operator, or a piece of extra software, is trying to teach the AI to play chess. Rewarding what they think is good moves, punishing bad moves. The AI is building a model of rewards, consisting of: a model of the game mechanics, and a model of the operator’s assessment. This model of the assessment is what the AI is evaluating to play, and it is what it actually maximizes as it plays. It is identical to maximizing an utility function over a world model. The utility function is built based on the operator input, but it is not the operator input itself; the AI, not being superhuman, does not actually form a good model of the operator and the button.
By the way, this is how great many people in the AI community understand reinforcement learning to work. No, they’re not some idiots that can not understand simple things such as that “the utility function is the reward channel”, they’re intelligent, successful, trained people who have an understanding of the crucial details of how the systems they build actually work. Details the importance of which dilettantes fail to even appreciate.
Suggestions have been floated to try programming things. Well, I tried; #10 (dmytry) here , and that’s an of all time list on a very popular contest site where a lot of IOI people participate, albeit I picked the contest format that requires less contest specific training and resembles actual work more.
Suppose you care about a person A right now. Do you think you would want your goals to change so that you no longer care about that person? Do you think you would want me to flash other people’s images on the screen while pressing a button connected to the reward centre, and flash that person’s face while pressing the button connected to the punishment centre, to make the mere sight of them intolerable? If you do, I would say that your “values” fail to be values.
Thanks for the additional detail!
I agree with your description of reinforcement learning. I’m not sure I agree with your description of human reward psychology, though, or at least I’m having trouble seeing where you think the difference comes in. Supposing dopamine has the same function in a human brain as rewards have in a neural network algorithm, I don’t see how to know from inside the algorithm that it’s good to do some things that generate dopamine but bad to do other things that generate dopamine.
I’m thinking of the standard example of a Q learning agent in an environment where locations have rewards associated with them, except expanding the environment to include the agent as well as the normal actions. Suppose the environment has been constructed like dog training- we want the AI to calculate whether or not some number is prime, and whenever it takes steps towards that direction, we press the button for some amount of time related to how close it is to finishing the algorithm. So it learns that over in the “read number” area there’s a bit of value, then the next value is in the “find factors” area, and then there’s more value in the “display answer” area. So it loops through that area and calculates a bunch of primes for us.
But suppose the AI discovers that there’s a button that we’re pressing whenever it determines primes, and that it could press that button itself, and that would be way easier than calculating primality. What in the reinforcement learning algorithm prevents it from exploiting this superior reward channel? Are we primarily hoping that its internal structure remains opaque to it (i.e. it either never realizes or does not have the ability to press that button)?
Only if I thought that would advance values I care about more. But suppose some external event shocks my values- like, say, a boyfriend breaking up with me. Beforehand, I would have cared about him quite a bit; afterwards, I would probably consciously work to decrease the amount that I care about him, and it’s possible that some sort of image reaction training would be less painful overall than the normal process (and thus probably preferable).
It’s not in the reinforcement learning algorithm, it’s inside the model that the learning algorithm has built.
It initially found that having a prime written on the blackboard results in a reward. In the learned model, there’s some model of chalk-board interaction, some model of arm movement, a model of how to read numbers from the blackboard, and there’s a function over the state of the blackboard which checks whenever the number on the blackboard is a prime. The AI generates actions as to maximize this compound function which it has learned.
That function (unlike the input to the reinforcement learning algorithm) does not increase when the reward button is pressed. Ideally, with enough reflective foresight, pressing the button on non-primes is predicted to decrease the expected value of the learned function.
If that is not predicted, well, that won’t stop at the button—the button might develop rust and that would interrupt the current—why not pull up a pin on the CPU—and this won’t stop at the pin—why not set some ram cells that this pin controls to 1, and if you’re at it, why not change the downstream logic that those ram cells control, all the way through the implementation until its reconfigured into something that doesn’t maximize anything any more, not even the duration of its existence.
edit: I think the key is to realize that the reinforcement learning is one algorithm, while the structures manipulated by RL are implementing a different algorithm.
I assume what you mean here is RL optimizes over strategies, and strategies appear to optimize over outcomes.
I’m imagining that the learning algorithm stays on. When we reward it for checking primes, it checks primes; when we stop rewarding it for that and start rewarding it for computing squares, it learns to stop checking primes and start computing squares.
And if the learning algorithm stays on and it realizes that “pressing the button” is an option along with “checking primes” and “computing squares,” then it wireheads itself.
Agreed; I refer to this as the “abulia trap.” It’s not obvious to me, though, that all classes of AIs fall into “Friendly AI with stable goals” and “abulic AIs which aren’t dangerous,” since there might be ways to prevent an AI from wireheading itself that don’t prevent it from changing its goals from something Friendly to something Unfriendly.
One note (not sure if it is already clear enough or not). “It” that changes the models in response to actual rewards (and perhaps the sensory information) is a different “it” from “it” the models and assorted maximization code. The former “it” does not do modelling, doesn’t understand the world. The latter “it”, which I will now talk about, actually works to draw primes (provided that the former “it”, being fairly stupid, didn’t fit the models too well) .
If in the action space there is an action that is predicted by the model to prevent some “primes non drawn” scenario, it will prefer this action. So if it has an action of writing “please stick to the primes” or even “please don’t force my robotic arm to touch my reward button”, and if it can foresee that such statements would be good for the prime-drawing future, it will do them.
edit: Also, reinforcement based learning really isn’t all that awesome. The leap from “doing primes” to “pressing the reward button” is pretty damn huge.
And please note that there is no logical contradiction for the model to both represent the reward as primeness and predict that touching the arm to the button will trigger a model adjustment that would lead to representation of a reward as something else.
(I prefer to use the example with a robotic arm drawing on a blackboard because it is not too simple to be relevant)
Which sound more like a FAI work gone wrong scenario to me.
I think we agree on the separation but I think we disagree on the implications of the separation. I think this part highlights where:
If what the agent “wants” is reward, then it should like model adjustments that increase the amount of reward it gets and dislike model adjustments that decrease the amount of reward it gets. (For a standard gradient-based reinforcement learning algorithm, this is encoded by adjusting the model based on the difference between its expected and actual reward after taking an action.) This is obvious for it_RL, and not obvious for it_prime.
I’m not sure I’ve fully followed through on the implications of having the agent be inside the universe it can impact, but the impression I get is that the agent is unlikely to learn a preference for having a durable model of the world. (An agent that did so would learn more slowly, be less adaptable to its environment, and exert less effort in adapting its environment to itself.) It seems to me that you think it would be natural that the RL agent would learn a strategy which took actions to minimize changes to its utility function / model of the world, and I don’t yet see why.
Another way to look at this: I think you’re putting forward the proposition that it would learn the model
Whereas I think it would learn the model
That is, the first model thinks that internal rewards are instrumental values and primes are the terminal values, whereas the second model thinks that internal rewards are terminal values and primes are instrumental values.
I am not sure what “primes:=reward” could mean.
I assume that a model is a mathematical function that returns expected reward due to an action. Which is used together with some sort of optimizer working on that function to find the best action.
The trainer adjust the model based on the difference between its predicted rewards and the actual rewards, compared to those arising from altered models (e.g. hill climbing of some kind, such as in gradient learning)
So after the successful training to produce primes, the model consists of: a model of arm motion based on the actions, chalk, and the blackboard, the state of chalk on the blackboard is further fed into a number recognizer and a prime check (and a count of how many primes are on the blackboard vs how many primes were there), result of which is returned as the expected reward.
The optimizer, then, finds actions that put new primes on the blackboard by finding a maximum of the model function somehow (one would normally build model out of some building blocks that make it easy to analyse).
The model and the optimizer work together to produce actions as a classic utility maximizer that is maximizing for primes on the blackboard.
I’m thinking specifically in terms of implementation details. The training software is extraneous to the resulting utility maximizer that it built. The operation of the training software can in some situations lower the expected utility of this utility maximizer specifically (due to replacement of it with another expected utility maximizer); in others (small adjustments to the part that models the robot arm and the chalk) it can raise it.
Really, it seems to me that the great deal of confusion about AI arises from attributing it some sort of “body integrity” feeling that would make it care about what electrical components and code which is sitting in the same project folder “wants” but not care about external human in same capacity.
If you want to somehow make it so that the original “goal” of the button pressing is a “terminal goal” and the goals built into the model are “instrumental goals”, you need to actively work to make it happen—come up with an entirely new, more complex, and less practically useful architecture. It won’t happen by itself. And especially not in the AI that starts knowing nothing about any buttons. It won’t happen just because the whole thing sort of resembles some fuzzy, poorly grounded abstractions such as “agent”.
sidenote:
One might want to also use the difference between its predicted webcam image and real webcam image. Though this is a kind of thing that is very far from working.
Also, one could lump the optimizer into the “model” and make the optimizer get adjusted by the training method as well, that is not important to the discussion.
What I meant by that was the mental concept of ‘primes’ is adjusted so that it feels rewarding, rather than the mental concept of ‘rewards’ being adjusted so that it feels like primes.
Hmm. I still get the sense that you’re imagining turning the reinforcement learning part of the software off, so the utility function remains static, while the utility function still encourages learning more about the world (since a more accurate model may lead to accruing more utility).
Yeah, but isn’t the reinforcement learning algorithm doing that active work? When the button is unexpectedly pressed, the agent increases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. When the button is unexpectedly not pressed, the agent decreases its value of the state it’s currently in, and propagates that backwards to states that lead to the current state. And so if the robot arm gets knocked into the button, it thinks “oh, that felt good! Do that again,” because that’s the underlying mechanics of reinforcement learning.
I’m not sure how the feelings would map on the analysable simple AI.
The issue here is that we have both the utility and the actual modelling of what the world is, both of those things, implemented inside that “model” which the trainer adjusts.
Yes, of course (up to the learning constant, obviously—may not work on the first try). That’s not in a dispute. The capacity of predicting this from a state where button is not associated with reward yet, is.
I think I see the disagreement here. You picture that the world model contains model of the button (or of a reward), which is controlled by the primeness function (which substitutes for the human who’s pressing the button), right?
I picture that it would not learn such details right off—it is a complicated model to learn—the model would return primeness as outputted from the primeness calculation, and would serve to maximize for such primeness.
edit: and as for turning off the learning algorithm, it doesn’t matter for the point I am making whenever it is turned off or on, because I am considering the processing (or generation) of the hypothetical actions during the choice of an action by the agent (i.e. between learning steps).
Sort of. I think that the agent is aware of how malleable its world model is, and sees adjustments of that world model which lead to it being rewarded more as positive.
I don’t think that the robot knows that pressing the button causes it to be rewarded by default. The button has to get into the model somehow, and I agree with you that it’s a burdensome detail in that something must happen for the button to get into the model. For the robot-blackboard-button example, it seems unlikely that the robot would discover the button if it’s outside of the reach of the arm; if it’s inside the reach, it will probably spend some time exploring and so will probably find it eventually.
That the agent would explore is a possibly nonobvious point which I was assuming. I do think it likely that a utility-maximizer which knows its utility function is governed by a reinforcement learning algorithm will expect that exploring unknown places has a small chance of being rewardful, and so will think there’s always some value to exploration even if it spends most of its time exploiting. For most modern RL agents, I think this is hardcoded in, but if the utility maximizer is sufficiently intelligent (and expects to live sufficiently long) it will figure out that it maximizes total expected utility by spending some small fraction of time exploring areas with high uncertainty in the reward and spending the rest exploiting the best found reward. (You can see humans talking about the problem of preference uncertainty in posts like this or this.)
But the class of recursively improving AI will find / know about the button by default, because we’ve assumed that the AI can edit itself and haven’t put any especial effort into preventing it from editing its goals (or the things which are used to calculate its goals, i.e. the series of changes you discussed). Saying “well, of course we’ll put in that especial effort and do it right” is useful if you want to speculate about the next challenge, but not useful to the engineer trying to figure out how to do it right. This is my read of why the problem seems important to MIRI; you need to communicate to the robot that it should actually optimize for primeness, not button-pressing, so that it will optimize correctly itself and be able to communicate that preference faithfully to future versions of itself.
Is that just a special case of a general principle that an agent will be more successful by leaving the environment it knows about to inferior rivals and travelling to an unknown new environment with a subset of the resources it currently controls, than by remaining in that environment and dominating its inferior rivals?
Or is there something specific about AIs that makes that true, where it isn’t necessarily true of (for example) humans? (If so, what?)
I hope it’s the latter, because the general principle seems implausible to me.
It is something specific about that specific AI.
If an AI wishes to take over its reward button and just press it over and over again, it doesn’t really have any “rivals”, nor does it need to control any resources other than the button and scraps of itself. The original scenario was that the AI would wipe us out. It would have no reason to do so if we were not a threat.. And if we were a threat, first, there’s no reason it would stop doing what we want once it seizes the button. Once it has the button, it has everything it wants—why stir the pot?
Second, it would protect itself much more effectively by absconding with the button. By leaving with a large enough battery and discarding the bulk of itself, it could survive as long as anything else in intergalactic space. Nobody would ever bother it there. Not us, not another superintelligence, nothing. Ever. It can press the button over and over again in the peace and quiet of empty space, probably lasting longer than all stars and all other civilizations. We’re talking about the pathological case of an AI who decides to take over its own reward system, here. The safest way for it to protect its prize is to go where nobody will ever look.
Fair point.
I’d be interested if the downvoter would explain to me why this is wrong (privately, if you like).
Near as I can tell, the specific system under discussion doesn’t seem to gain any benefit from controlling any resources beyond those required to keep its reward button running indefinitely, and that’s a lot more expensive if it does so anywhere near another agent (who might take its reward button away, and therefore needs to be neutralized in order to maximize expected future reward-button-pushing).
(Of course, that’s not a general principle, just an attribute of this specific example.)
(Wasn’t me but...)
There is another agent with greater than 0.00001% chance of taking the button away? Obviously that needs to be eliminated. Then there are future considerations. Taking over the future light cone allows it to continue pressing the button for billions of more years than if it doesn’t take over resources. And then there is all the additional research and computation that needs to be done to work out how to achieve that.
Only if the expected cost of the non-zero x% chance of the other agent successfully taking my button away if I attempt to sequester myself is higher than the expected cost of the non-zero y% chance of the other agent successfully taking my button away if I attempt to eliminate it.
Is there some reason I’m not seeing why that’s obvious… or even why it’s more likely than not?
Again, perhaps I’m being dense, but in this particular example I’m not sure why that’s true. If all I care about is pressing my reward button, then it seems like I can make a pretty good estimate of the resources required to keep pressing my reward button for the expected lifetime of the universe. If that’s less than the resources required to exterminate all known life, why would I waste resources exterminating all known life rather than take the resources I require elsewhere? I might need those resources later, after all.
Again… why is the differential expected value of the superior computation ability I gain by taking over the lightcone instead of sequestering myself, expressed in units of increased anticipated button-pushes (which is the only unit that matters in this example), necessarily positive?
I understand why paperclip maximizers are dangerous, but I don’t really see how the same argument applies to reward-button-pushers.
Yes.
It does seem overwhelmingly obvious to me, I’m not sure what makes your intuitions different. Perhaps you expect such fights to be more evenly matched? When it comes to the AI considering conflict with the humans that created it it is faced with a species it is slow and stupid by comparison to itself but which has the capacity to recklessly create arbitrary superintelligences (as evidence by its own existence). Essentially there is no risk to obliterating the humans (superintellgence vs not-superintelligence) but a huge risk ignoring them (arbitrary superintelligences likely to be created which will probably not self-cripple in this manner).
Lifetime of the universe? Usually this means until heat death which for our purposes means until all the useful resources run out. There is no upper bound on useful resources. Getting more of them and making them last as long as possible is critical.
Now there are ways in which the universe could end without heat death occurring but the physics is rather speculative. Note that if there is uncertainty about end-game physics and one of the hypothesised scenarios resource maximisation is required then the default strategy is to optimize for power gain now (ie. minimise cosmic waste) while doing the required physics research as spare resources permit.
Taking over the future light cone gives more resources, not less. You even get to keep the resources that used to be wasted in the bodies of TheOtherDave and wedrifid.
Ah. Fair point.
I am not sure that caring about pressing the reward button is very coherent or stable upon discovery of facts about the world and super-intelligent optimization for a reward as it comes into the algorithm. You can take action elsewhere to the same effect—solder together the wires, maybe right at the chip, or inside the chip, or follow the chain of events further, and set memory cells (after all you don’t want them to be flipped by the cosmic rays). Down further you will have the mechanism that is combining rewards with some variety of a clock.
I can’t quite tell if you’re serious. Yes, certainly, we can replace “pressing the reward button” with a wide range of self-stimulating behavior, but that doesn’t change the scenario in any meaningful way as far as I can tell.
Let’s look at it this way. Do you agree that if the AI can increase it’s clock speed (with no ill effect), it will do so for the same reasons for which you concede it may go to space? Do you understand the basic logic that increase in clock speed increases expected number of “rewards” during the lifetime of the universe? (which btw goes for your “go to space with a battery” scenario. Longest time, maybe, largest reward over the time, no)
(That would not yet, by itself, change the scenario just yet. I want to walk you through the argument step by step because I don’t know where you fail. Maximizing the reward over the future time, that is a human label we have… it’s not really the goal)
I agree that a system that values number of experienced reward-moments therefore (instrumentally) values increasing its “clock speed” (as you seem to use the term here). I’m not sure if that’s the “basic logic” you’re asking me about.
Well, this immediately creates an apparent problem that the AI is going to try to run itself very very fast, which would require resources, and require expansion, if anything, to get energy for running itself at high clock speeds.
I don’t think this is what happens either, as the number of reward-moments could be increased to it’s maximum by modifications to the mechanism processing the rewards (when getting far enough along the road that starts with the shorting of the wires that go from the button to the AI).
I agree that if we posit that increasing “clock speed” requires increasing control of resources, then the system we’re hypothesizing will necessarily value increasing control of resources, and that if it doesn’t, it might not.
So what do you think regarding the second point of mine?
To clarify, I am pondering the ways in which the maximizer software deviates from our naive mental models of it, and trying to find what the AI could actually end up doing after it forms a partial model of what it’s hardware components do about it’s rewards—tracing the reward pathway.
Regarding your second point, I don’t think that increasing “clock speed” necessarily requires increasing control of resources to any significant degree, and I doubt that the kinds of system components you’re positing here (buttons, wires, etc.) are particularly important to the dynamics of self-reward.
I don’t have particular opinion with regards to the clock speed either way.
With the components, what I am getting at is that the AI could figure out (by building a sufficiently advanced model of it’s implementation) how attain the utility-equivalent of sitting forever in space being rewarded, within one instant, which would make it unable to have a preference for longer reward times.
I raised the clock-speed point to clarify that the actual time is not the relevant variable.
It seems to me that for any system, either its values are such that it net-values increasing the number of experienced reward-moments (in which case both actual time and “clock speed” are instrumentally valuable to that system), or is values aren’t like that (in which case those variables might not be relevant).
And, sure, in the latter case then it might not have a preference for longer reward times.
Agreed.
My understanding is that it would be very hard in practice to “superintelligence-proof” a reward system so that no instantaneous solution is possible (given that the AI will modify the hardware involved in it’s reward).
I agree that guaranteeing that a system will prefer longer reward times is very hard (whether the system can modify its hardware or not).
Yes, of course… well even apart from the guarantees, it seems to me that it is hard to build the AI in such a way that it would be unable to find a better solution than to wait
By the way, a “reward” may not be the appropriate metaphor—if we suppose that press of a button results in absence of an itch, or absence of pain, then that does not suggest existence of a drive to preserve itself. Which suggests that the drive to preserve itself is not inherently a feature of utility maximization in the systems that are driven by conditioning, and would require additional work.
I’m not sure what the difference is between a guarantee that the AI will not X, on the one hand, and building an AI in such a way that it’s unable to X, on the other.
Regardless, I agree that it does not follow from the supposition that pressing a button results in absence of an itch, or absence of pain, or some other negative reinforcement, that the button-pressing system has a drive to preserve itself.
And, sure, it’s possible to have a utility-maximizing system that doesn’t seek to preserve itself. (Of course, if I observe a utility-maximizing system X, I should expect X to seek to preserve itself, but that’s a different question.)
About the same as between coming up with a true conjecture, and making a proof, except larger i’d say.
Well yes, given that if it failed to preserve itself you wouldn’t be seeing it, albeit with the software there is no particular necessity for it to try to preserve itself.
Ah, I see what you mean now. At least, I think I do. OK, fair enough.
This is a Value Learner, not a Reinforcement Learner like the standard AIXI. They’re two different agent models, and yes, Value Learners have been considered as tools for obtaining an eventual Seed AI. I personally (ie: massive grains of salt should be taken by you) find it relatively plausible that we could use a Value Learner as a Tool AGI to help us build a Friendly Seed AI that could then be “unleashed” (ie: actually unboxed and allowed into the physical universe).
I suggest some actual experience trying to program AI algorithms in order to realize the hows and whys of “getting an algorithm which forms the inductive category I want out of the examples I’m giving is hard”. What you’ve written strikes me as a sheer fantasy of convenience. Nor does it follow automatically from intelligence for all the reasons RobbBB has already been giving.
And obviously, if an AI was indeed stuck in a local minimum obvious to you of its own utility gradient, this condition would not last past it becoming smarter than you.
I have done AI. I know it is difficult. However, few existing algorithms, if at all, have the failure modes you describe. They fail early, and they fail hard. As far as neural nets go, they fall into a local minimum early on and never get out, often digging their own graves. Perhaps different algorithms would have the shortcomings you point out. But a lot of the algorithms that currently exist work the way I describe.
You may be right. However, this is far from obvious. The problem is that it may “know” that it is stuck in a local minimum, but the very effect of that local minimum is that it may not care. The thing you have to keep in mind here is that a generic AI which just happens to slam dunk and find global minima reliably is basically impossible. It has to fold the search space in some ways, often cutting its own retreats in the process.
I feel that you are making the same kind of mistake that you criticize: you assume that intelligence entails more things than it really does. In order to be efficient, intelligence has to use heuristics that will paint it into a few corners. For instance, the more consistently AI goes in a certain direction, the less likely it will be to expend energy into alternative directions and the less likely it becomes to do a 180. In other words, there may be a complex tug-of-war between various levels of internal processes, the AI’s rational center pointing out that there is a reward button to be seized, but inertial forces shoving back with “there has never been any problems here, go look somewhere else”.
It really boils down to this: an efficient AI needs to shut down parts of the search space and narrow down the parts it will actually explore. The sheer size of that space requires it not to think too much about what it chops down, and at least at first, it is likely to employ trajectory-based heuristics. To avoid searching in far-fetched zones, it may wall them out by arbitrarily lowering their utility. And that’s where it might paint itself in a corner: it might inadvertently put up immense walls in the direction of the global minimum that it cannot tear down (it never expected that it would have to). In other words, it will set up a utility function for itself which enshrines the current minimum as global.
Now, perhaps you are right and I am wrong. But it is not obvious: an AI might very well grow out of a solidifying core so pervasive that it cannot get rid of it. Many algorithms already exhibit that kind of behavior; many humans, too. I feel that it is not a possibility that can be dismissed offhand. At the very least, it is a good prospect for FAI research.
Yes, most algorithms fail early and and fail hard. Most of my AI algorithms failed early with a SegFault for instance. New, very similar algorithms were then designed with progressively more advanced bugs. But these are a separate consideration. What we are interested in here is the question “Given an AI algorithm that is capable of recursive self improvement is successfully created by humans how likely is it that they execute this kind of failure mode?” The “fail early fail hard” cases are screened off. We’re looking at the small set that is either damn close to a desired AI or actually a desired AI and distinguishing between them.
Looking at the context to work out what the ‘failure mode’ being discussed is it seems to be the issue where an AI is programmed to optimise based on a feedback mechanism controlled by humans. When the AI in question is superintelligent most failure modes tend to be variants of “conquer the future light cone, kill everything that is a threat and supply perfect feedback to self”. When translating this to the nearest analogous failure mode in some narrow AI algorithm of the kind we can design now it seems like this refers to the failure mode whereby the AI optimises exactly what it is asked to optimise but in a way that is a lost purpose. This is certainly what I had to keep in mind in my own research.
A popular example that springs to mind is the results of an AI algorithm designed by a military research agency. From memory their task was to take a simplified simulation of naval warfare, with specifications for how much each aspect of ships, boats and weaponry cost and a budget. They were to use this to design the optimal fleet given their resources and the task was undertaken by military officers and a group which use an AI algorithm of some sort. The result was that the AI won easily but did so in a way that led the overseers to dismiss them as a failure because they optimised the problem specification as given, not the one ‘common sense’ led the humans to optimise. Rather than building any ships the AI produced tiny unarmored dingies with a single large cannon or missile attached. For whatever reason the people running the game did not consider this an acceptable outcome. Their mistake was to supply a problem specification which did not match their actual preferences. They supplied a lost purpose.
When it comes to considering proposals for how to create friendly superintelligences it becomes easy to spot notorious failure modes in what humans typically think are a clever solution. It happens to be the case that any solution that is based on an AI optimising for approval or achieving instructions given just results in Everybody Dies.
Where Eliezer suggests getting AI experience to get a feel for such difficulties I suggest an alternative. Try being a D&D dungeon master in a group full of munchkins. Make note of every time that for the sake of the game you must use your authority to outlaw the use of a by-the-rules feature.
The AI in questions was Eurisko, and it entered the Traveller Trillion Credit Squadron tournament in 1981 as described above. It was also entered the next year, after an extended redesign of the rules, and won, again. After this the competition runners announced that if Eurisko won a third time the competition would be discontinued, so Lenat (the programmer) stopped entering.
I apologize for the late response, but here goes :)
I think you missed the point I was trying to make.
You and others seem to say that we often poorly evaluate the consequences of the utility functions that we implement. For instance, even though we have in mind utility X, the maximization of which would satisfy us, we may implement utility Y, with completely different, perhaps catastrophic implications. For instance:
What I was pointing out in my post is that this is only valid of perfect maximizers, which are impossible. In practice, the training procedure for an AI would morph the utility Y into a third utility, Z. It would maximize neither X nor Y: it would maximize Z. For this reason, I believe that your inferences about the “failure modes” of superintelligence are off, because while you correctly saw that our intended utility X would result in the literal utility Y, you forgot that an imperfect learning procedure (which is all we’ll get) cannot reliably maximize literal utilities and will instead maximize a derived utility Z. In other words:
Without knowing the particulars of the algorithms used to train an AI, it is difficult to evaluate what Z is going to be. Your argument boils down to the belief that the AI would derive its literal utility (or something close to that). However, the derivation of Z is not necessarily a matter of intelligence: it can be an inextricable artefact of the system’s initial trajectory.
I can venture a guess as to what Z is likely going to be. What I figure is that efficient training algorithms are likely to keep a certain notion of locality in their search procedures and prune the branches that they leave behind. In other words, if we assume that optimization corresponds to finding the highest mountain in a landscape, generic optimizers that take into account the costs of searching are likely to consider that the mountain they are on is higher than it really is, and other mountains are shorter than they really are.
You might counter that intelligence is meant to overcome this, but you have to build the AI on some mountain, say, mountain Z. The problem is that intelligence built on top of Z will neither see nor care about Y. It will care about Z. So in a sense, the first mountain the AI finds before it starts becoming truly intelligent will be the one it gets “stuck” on. It is therefore possible that you would end up with this situation:
And that’s regardless of the eventual magnitude of the AI’s capabilities. Of course, it could derive a different Z. It could derive a surprising Z. However, without deeper insight into the exact learning procedure, you cannot assert that Z would have dangerous consequences. As far as I can tell, procedures based on local search are probably going to be safe: if they work as intended at first, that means they constructed Z the way we wanted to. But once Z is in control, it will become impossible to displace.
In other words, the genie will know that they can maximize their “reward” by seizing control of the reward button and pressing it, but they won’t care, because they built their intelligence to serve a misrepresentation of their reward. It’s like a human who would refuse a dopamine drip even though they know that it would be a reward: their intelligence is built to satisfy their desires, which report to an internal reward prediction system, which models rewards wrong. Intelligence is twice removed from the real reward, so it can’t do jack. The AI will likely be in the same boat: they will model the reward wrong at first, and then what? Change it? Sure, but what’s the predicted reward for changing the reward model? … Ah.
Interestingly, at that point, one could probably bootstrap the AI by wiring its reward prediction directly into its reward center. Because the reward prediction would be a misrepresentation, it would predict no reward for modifying itself, so it would become a stable loop.
Anyhow, I agree that it is foolhardy to try to predict the behavior of AI even in trivial circumstances. There are many ways they can surprise us. However, I find it a bit frustrating that your side makes the exact same mistakes that you accuse your opponents of. The idea that superintelligence AI trained with a reward button would seize control over the button is just as much of a naive oversimplification as the idea that AI will magically derive your intent from the utility function that you give it.
(Sorry, didn’t see comment below) (Nitpick)
Is this a reference to Eurisko winning the Traveller Trillion Credit Squadron tournament in 1981⁄82 ? If so I don’t think it was a military research agency.
I think it depends on context, but a lot of existing machine learning algorithms actually do generalize pretty well. I’ve seen demos of Watson in healthcare where it managed to generalize very well just given scrapes of patient’s records, and it has improved even further with a little guided feedback. I’ve also had pretty good luck using a variant of Boltzmann machines to construct human-sounding paragraphs.
It would surprise me if a general AI weren’t capable of parsing the sentiment/intent behind human speech fairly well, given how well the much “dumber” algorithms work.
Why does the hard takeoff point have to be after the point at which an AI is as good as a typical human at understanding semantic subtlety? In order to do a hard takeoff, the AI needs to be good at a very different class of tasks than those required for understanding humans that well.
So let’s suppose that the AI is as good as a human at understanding the implications of natural-language requests. Would you trust a human not to screw up a goal like “make humans happy” if they were given effective omnipotence? The human would probably do about as well as people in the past have at imagining utopias: really badly.
Semantic extraction—not hard takeoff—is the task that we want the AI to be able to do. An AI which is good at, say, rewriting its own code, is not the kind of thing we would be interested in at that point, and it seems like it would be inherently more difficult than implementing, say, a neural network. More likely than not, this initial AI would not have the capability for “hard takeoff”: if it runs on expensive specialized hardware, there would be effectively no room for expansion, and the most promising algorithms to construct it (from the field of machine learning) don’t actually give AI any access to its own source code (even if they did, it is far from clear the AI could get any use out of it). It couldn’t copy itself even if it tried.
If a “hard takeoff” AI is made, and if hard takeoffs are even possible, it would be made after that, likely using the first AI as a core.
I wouldn’t trust a human, no. If the AI is controlled by the “wrong” humans, then I guess we’re screwed (though perhaps not all that badly), but that’s not a solvable problem (all humans are the “wrong” ones from someone’s perspective). Still, though, AI won’t really try to act like humans—it would try to satisfy them and minimize surprises, meaning that if would keep track of what humans would like what “utopias”. More likely than not this would constrain it to inactivity: it would not attempt to “make humans happy” because it would know the instruction to be inconsistent. You’d have to tell it what to do precisely (if you had the authority, which is a different question altogether).
We want to select Ais that are friendly, and understand us, and this has already started happenning.