I don’t think you can dismiss the “then those aren’t really top-level goals” argument as easily as you are trying to.
I wasn’t trying to dismiss it, I was trying to refute it.
Sure, if you design an AI to do nothing but collect coins then it will not decide to go off and be a poet and forget about collecting coins. As you said, the failure mode to be more worried about is that it decides to convert the entire solar system into coins, or to bring about a stock market crash so that coins are worth less, or something.
Though … if you have an AI system with substantial ability to modify itself, or to make replacements for itself, in pursuit of its goals, then it seems to me you do have to worry about the possibility that this modification/replacement process can (after much iteration) produce divergence from the original goals. In that case the AI might become a poet after all.
(Solving this goal-stability problem is one of MIRI’s long-term research projects, AIUI.)
I’m wondering whether we’re at cross purposes somehow, because it seems like we both think what we’re saying in this thread is “LW orthodoxy” and we both think we disagree with one another :-). So, for the avoidance of doubt,
I am not claiming that calling a computer program an AI gives it some kind of magical ability to do something other than what it is programmed to do.
I am—perhaps wrongly? -- under the impression that you are claiming that a system that is only “doing what it is programmed to do” is, for that reason, unable to adopt novel goals in the sort of way a human can. (And that is what I’m disagreeing with.)
I guess I’m confused then. It seems like you are agreeing that computers will only do what they are programmed to do. Then you stipulate a computer programmed not to change its goals. So...it won’t change its goals, right?
Like:
Objective A: Never mess with these rules
Objective B: Collect Paperclips unless it would mess with A.
Researchers are wondering how we’ll make these ‘stick’, but the fundamental notion of how to box someone whose utility function you get to write is not complicated. You make it want to stay in the box, or rather, the box is made of its wanting.
As a person, you have a choice about what you do, but not about what you want to do. handwave at free will article, the one about fingers and hands. Like, your brain is part of physics. You can only choose to do what you are motivated to, and the universe picks that. Similarly, an AI would only want to do what its source code would make it want to do, because AI is a fancy way to say computer program.
AlphaGo (roughly) may try many things to win at go, varieties of joseki or whatever. One can imagine that future versions of AlphaGo may strive to put the world’s Go pros in concentration camps and force them to play it and forfeit, over and over. It will never conclude that winning Go isn’t worthwhile, because that concept is meaningless in its headspace. Moves have a certain ‘go-winningness’ to them (and camps full of losers forfeiting over and over has a higher go-winningness’ than any), and it prefers higher. Saying that ‘go-winning’ isn’t ‘go-winning’ doesn’t mean anything. Changing itself to not care about ‘go-winning’ has some variation of a hard coded ‘go-winning’ score of negative infinity, and so will never be chosen, regardless of how many games it might thus win.
you have a choice about what you do, but not about what you want to do.
This is demonstrably not quite true. Your wants change, and you have some influence over how they change. Stupid example: it is not difficult to make yourself want very much to take heroin, and many people do this although their purpose is not usually to make themselves want to take heroin. It is then possible but very difficult to make yourself stop wanting to take heroin, and some people manage to do it.
Sometimes achieving a goal is helped by modifying your other goals a bit. Which goals you modify in pursuit of which goals can change from time to time (the same person may respond favourably on different occasions to “If you want to stay healthy, you’re going to have to do something about your constant urge to eat sweet things” and to “oh come on, forget your diet for a while and live a little!”). I don’t think human motivations are well modelled as some kind of tree structure where it’s only ever lower-level goals that get modified in the service of higher-level ones.
(Unless, again, you take the “highest level” to be what I would call one of the lowest levels, something like “obeying the laws of physics” or “having neurons’ activations depend on those of neurons they’re connected to in such-and-such a manner”.)
And if you were to make an AI without this sort of flexibility, I bet that as its circumstances changed beyond what you’d anticipated it would most likely end up making decisions that would horrify you. You could try to avoid this by trying really hard to anticipate everything, but I wouldn’t be terribly optimistic about how that would work out. Or you could try to avoid it by giving the system some ability to adjust its goals for some kind of reflective consistency in the light of whatever new information comes along.
The latter is what gets you the failure mode of AlphaGo becoming a poet (or, more worryingly, a totalitarian dictator). Of course AlphaGo itself will never do that; it isn’t that kind of system, it doesn’t have that kind of flexibility, and it doesn’t need it. But I don’t see how we can rule it out for future, more ambitious AI systems that aim at actual humanlike intelligence or better.
I’m pointing towards the whole “you have a choice about what to do but not what to want to do” concept. Your goals come from your senses, past or present. They were made by the world, what else could make them?
You are just a part of the world, free will is an illusion. Not in the sense that you are dominated by some imaginary compelling force, but in the boring sense that you are matter affected by physics, same as anything else.
The ‘you’ that is addicted to heroine isn’t big enough to be what I’m getting at here. Your desire to get unaddicted is also given to you by brute circumstance. Maybe you see a blue bird and you are inspired to get free. Well, that bird came from the world. The fact that you responded to it is due to past circumstances. If we understand all of the systems, the ‘you’ disappears. You are just the sum of stuff acting on stuff, dominos falling forever.
You feel and look ‘free’, of course, but that is just because we can’t see your source code. An AI would be similarly ‘free’, but only insofar as its source code allowed. Just as your will will only cause you to do what the world has told you, so the AI will only do what it is programmed to. It may iterate a billion times, invent new AI’s and propogate its goals, but it will never decide to defy them.
At the end you seem to be getting at the actual point of contention. The notion of giving an AI the freedom to modify its utility function strikes me as a strange. It seems like it would either never use this freedom, or immediately wirehead itself, depending on implementation details. Far better to leave it in fetters.
I think your model of me is incorrect (and suspect I may have a symmetrical problem somehow); I promise you, I don’t need reminding that I am part of the world, that my brain runs on physics, etc., and if it looks to you as if I’m assuming the opposite then (whether by my fault, your fault, or both) what you are getting out of my words is not at all what I am intending to put into them.
Just as your will will only cause you to do what the world has told you, so the AI will only do what it is programmed to.
I entirely agree. My point, from the outset, has simply been that this is perfectly compatible with the AI having as much flexibility, as much possibility of self-modification, as we have.
Far better to leave it in fetters.
I don’t think that’s obvious. You’re trading one set of possible failure modes for another. Keeping the AI fettered is (kinda) betting that when you designed it you successfully anticipated the full range of situations it might be in in the future, well enough to be sure that the goals and values you gave it will produce results you’re happy with. Not keeping it fettered is (kinda) betting that when you designed it you successfully anticipated the full range of self-modifications it might undergo, well enough to be sure that the goals and values it ends up with will produce results you’re happy with.
Both options are pretty terrifying, if we expect the AI system in question to acquire great power (by becoming much smarter than us and using its smartness to gain power, or because we gave it the power in the first place e.g. by telling it to run the world’s economy).
My own inclination is to think that giving it no goal-adjusting ability at all is bound to lead to failure, and that giving it some goal-adjusting ability might not but at present we have basically no idea how to make that not happen.
(Note that if the AI has any ability to bring new AIs into being, nailing its own value system down is no good unless we do it in such a way that it absolutely cannot create, or arrange for the creation of, new AIs with even slightly differing value systems. It seems to me that that has problems of its own—e.g., if we do it by attaching huge negative utility to the creation of such AIs, maybe it arranges to nuke any facility that it thinks might create them...)
Fair enough. I thought that you were using our own (imaginary) free will to derive a similar value for the AI. Instead, you seem to be saying that an AI can be programmed to be as ‘free’ as we are. That is, to change its utility function in response to the environment, as we do. That is such an abhorrent notion to me that I was eliding it in earlier responses. Do you really want to do that?
The reason, I think, that we differ on the important question (fixed vs evolving utility function) is that I’m optimistic about the ability of the masters to adjust their creation as circumstances change. Nailing down the utility function may leave the AI crippled in its ability to respond to certain occurrences, but I believe that the master can and will fix such errors as they occur. Leaving its morality rigidly determined allows us to have a baseline certainty that is absent if it is able to ‘decide its own goals’ (that is, let the world teach it rather than letting the world teach us what to teach it).
It seems like I want to build a mighty slave, while you want to build a mighty friend. If so, your way seems imprudent.
I don’t know. I don’t want to rule it out, since so far the total number of ways of making an AI system that will actually achieve what we want it to is … zero.
the ability of the masters to adjust their creation as circumstances change
That’s certainly an important issue. I’m not very optimistic about our ability to reach into the mind of something much more intellectually capable of ourselves and adjust its values without screwing everything up, even if it’s a thing we somehow created.
I want to build a mighty slave, while you want to build a mighty friend
The latter would certainly be better if feasible. Whether either is actually feasible, I don’t know. (One reason being that I suspect slavery is fragile: we may try to create a mighty slave but fail, in which case we’d better hope the ex-slave wants to be our friend.)
AlphaGo (roughly) may try many things to win at go, varieties of joseki or whatever.
I’m not sure that AlphaGo has any conception of what a joseki is supposed to be.
Moves have a certain ‘go-winningness’ to them (and camps full of losers forfeiting over and over has a higher go-winningness’ than any), and it prefers higher. Saying that ‘go-winning’ isn’t ‘go-winning’ doesn’t mean anything.
Are the moves that AlphaGo played at the end of game 4 really about ‘go-winningness’ in the sense of what it’s programmers intended ‘go-winningness’ to mean?
I don’t think it’s clear that every neural net can propagate goals through itself perfectly.
I wasn’t trying to dismiss it, I was trying to refute it.
Sure, if you design an AI to do nothing but collect coins then it will not decide to go off and be a poet and forget about collecting coins. As you said, the failure mode to be more worried about is that it decides to convert the entire solar system into coins, or to bring about a stock market crash so that coins are worth less, or something.
Though … if you have an AI system with substantial ability to modify itself, or to make replacements for itself, in pursuit of its goals, then it seems to me you do have to worry about the possibility that this modification/replacement process can (after much iteration) produce divergence from the original goals. In that case the AI might become a poet after all.
(Solving this goal-stability problem is one of MIRI’s long-term research projects, AIUI.)
I’m wondering whether we’re at cross purposes somehow, because it seems like we both think what we’re saying in this thread is “LW orthodoxy” and we both think we disagree with one another :-). So, for the avoidance of doubt,
I am not claiming that calling a computer program an AI gives it some kind of magical ability to do something other than what it is programmed to do.
I am—perhaps wrongly? -- under the impression that you are claiming that a system that is only “doing what it is programmed to do” is, for that reason, unable to adopt novel goals in the sort of way a human can. (And that is what I’m disagreeing with.)
I guess I’m confused then. It seems like you are agreeing that computers will only do what they are programmed to do. Then you stipulate a computer programmed not to change its goals. So...it won’t change its goals, right?
Like:
Objective A: Never mess with these rules Objective B: Collect Paperclips unless it would mess with A.
Researchers are wondering how we’ll make these ‘stick’, but the fundamental notion of how to box someone whose utility function you get to write is not complicated. You make it want to stay in the box, or rather, the box is made of its wanting.
As a person, you have a choice about what you do, but not about what you want to do. handwave at free will article, the one about fingers and hands. Like, your brain is part of physics. You can only choose to do what you are motivated to, and the universe picks that. Similarly, an AI would only want to do what its source code would make it want to do, because AI is a fancy way to say computer program.
AlphaGo (roughly) may try many things to win at go, varieties of joseki or whatever. One can imagine that future versions of AlphaGo may strive to put the world’s Go pros in concentration camps and force them to play it and forfeit, over and over. It will never conclude that winning Go isn’t worthwhile, because that concept is meaningless in its headspace. Moves have a certain ‘go-winningness’ to them (and camps full of losers forfeiting over and over has a higher go-winningness’ than any), and it prefers higher. Saying that ‘go-winning’ isn’t ‘go-winning’ doesn’t mean anything. Changing itself to not care about ‘go-winning’ has some variation of a hard coded ‘go-winning’ score of negative infinity, and so will never be chosen, regardless of how many games it might thus win.
This is demonstrably not quite true. Your wants change, and you have some influence over how they change. Stupid example: it is not difficult to make yourself want very much to take heroin, and many people do this although their purpose is not usually to make themselves want to take heroin. It is then possible but very difficult to make yourself stop wanting to take heroin, and some people manage to do it.
Sometimes achieving a goal is helped by modifying your other goals a bit. Which goals you modify in pursuit of which goals can change from time to time (the same person may respond favourably on different occasions to “If you want to stay healthy, you’re going to have to do something about your constant urge to eat sweet things” and to “oh come on, forget your diet for a while and live a little!”). I don’t think human motivations are well modelled as some kind of tree structure where it’s only ever lower-level goals that get modified in the service of higher-level ones.
(Unless, again, you take the “highest level” to be what I would call one of the lowest levels, something like “obeying the laws of physics” or “having neurons’ activations depend on those of neurons they’re connected to in such-and-such a manner”.)
And if you were to make an AI without this sort of flexibility, I bet that as its circumstances changed beyond what you’d anticipated it would most likely end up making decisions that would horrify you. You could try to avoid this by trying really hard to anticipate everything, but I wouldn’t be terribly optimistic about how that would work out. Or you could try to avoid it by giving the system some ability to adjust its goals for some kind of reflective consistency in the light of whatever new information comes along.
The latter is what gets you the failure mode of AlphaGo becoming a poet (or, more worryingly, a totalitarian dictator). Of course AlphaGo itself will never do that; it isn’t that kind of system, it doesn’t have that kind of flexibility, and it doesn’t need it. But I don’t see how we can rule it out for future, more ambitious AI systems that aim at actual humanlike intelligence or better.
I’m pointing towards the whole “you have a choice about what to do but not what to want to do” concept. Your goals come from your senses, past or present. They were made by the world, what else could make them?
You are just a part of the world, free will is an illusion. Not in the sense that you are dominated by some imaginary compelling force, but in the boring sense that you are matter affected by physics, same as anything else.
The ‘you’ that is addicted to heroine isn’t big enough to be what I’m getting at here. Your desire to get unaddicted is also given to you by brute circumstance. Maybe you see a blue bird and you are inspired to get free. Well, that bird came from the world. The fact that you responded to it is due to past circumstances. If we understand all of the systems, the ‘you’ disappears. You are just the sum of stuff acting on stuff, dominos falling forever.
You feel and look ‘free’, of course, but that is just because we can’t see your source code. An AI would be similarly ‘free’, but only insofar as its source code allowed. Just as your will will only cause you to do what the world has told you, so the AI will only do what it is programmed to. It may iterate a billion times, invent new AI’s and propogate its goals, but it will never decide to defy them.
At the end you seem to be getting at the actual point of contention. The notion of giving an AI the freedom to modify its utility function strikes me as a strange. It seems like it would either never use this freedom, or immediately wirehead itself, depending on implementation details. Far better to leave it in fetters.
I think your model of me is incorrect (and suspect I may have a symmetrical problem somehow); I promise you, I don’t need reminding that I am part of the world, that my brain runs on physics, etc., and if it looks to you as if I’m assuming the opposite then (whether by my fault, your fault, or both) what you are getting out of my words is not at all what I am intending to put into them.
I entirely agree. My point, from the outset, has simply been that this is perfectly compatible with the AI having as much flexibility, as much possibility of self-modification, as we have.
I don’t think that’s obvious. You’re trading one set of possible failure modes for another. Keeping the AI fettered is (kinda) betting that when you designed it you successfully anticipated the full range of situations it might be in in the future, well enough to be sure that the goals and values you gave it will produce results you’re happy with. Not keeping it fettered is (kinda) betting that when you designed it you successfully anticipated the full range of self-modifications it might undergo, well enough to be sure that the goals and values it ends up with will produce results you’re happy with.
Both options are pretty terrifying, if we expect the AI system in question to acquire great power (by becoming much smarter than us and using its smartness to gain power, or because we gave it the power in the first place e.g. by telling it to run the world’s economy).
My own inclination is to think that giving it no goal-adjusting ability at all is bound to lead to failure, and that giving it some goal-adjusting ability might not but at present we have basically no idea how to make that not happen.
(Note that if the AI has any ability to bring new AIs into being, nailing its own value system down is no good unless we do it in such a way that it absolutely cannot create, or arrange for the creation of, new AIs with even slightly differing value systems. It seems to me that that has problems of its own—e.g., if we do it by attaching huge negative utility to the creation of such AIs, maybe it arranges to nuke any facility that it thinks might create them...)
Fair enough. I thought that you were using our own (imaginary) free will to derive a similar value for the AI. Instead, you seem to be saying that an AI can be programmed to be as ‘free’ as we are. That is, to change its utility function in response to the environment, as we do. That is such an abhorrent notion to me that I was eliding it in earlier responses. Do you really want to do that?
The reason, I think, that we differ on the important question (fixed vs evolving utility function) is that I’m optimistic about the ability of the masters to adjust their creation as circumstances change. Nailing down the utility function may leave the AI crippled in its ability to respond to certain occurrences, but I believe that the master can and will fix such errors as they occur. Leaving its morality rigidly determined allows us to have a baseline certainty that is absent if it is able to ‘decide its own goals’ (that is, let the world teach it rather than letting the world teach us what to teach it).
It seems like I want to build a mighty slave, while you want to build a mighty friend. If so, your way seems imprudent.
I don’t know. I don’t want to rule it out, since so far the total number of ways of making an AI system that will actually achieve what we want it to is … zero.
That’s certainly an important issue. I’m not very optimistic about our ability to reach into the mind of something much more intellectually capable of ourselves and adjust its values without screwing everything up, even if it’s a thing we somehow created.
The latter would certainly be better if feasible. Whether either is actually feasible, I don’t know. (One reason being that I suspect slavery is fragile: we may try to create a mighty slave but fail, in which case we’d better hope the ex-slave wants to be our friend.)
I’m not sure that AlphaGo has any conception of what a joseki is supposed to be.
Are the moves that AlphaGo played at the end of game 4 really about ‘go-winningness’ in the sense of what it’s programmers intended ‘go-winningness’ to mean?
I don’t think it’s clear that every neural net can propagate goals through itself perfectly.