Hmm. I think what he was proposing was that the AI’s actions be defined as the union of the set of things the AI thinks will accomplish the goal, and the set of things the models won’t veto. Assuming there’s a way to prevent the AI from intentionally compromising the models, the question is how often the AI will come up with a plan for accomplishing goal X which will, completely coincidentally, break the models in undesirable ways.
It’s not about AI intentionally doing something, but simply AI not sharing our unspoken assumptions. That means, unless you explicitly forbid something, AI considers it a valid solution, because AI is not human.
For example, you are in a building on a 20th floor and say to AI “get me out of this building, quickly”, and AI throws you out of the window. Not because AI hates you or tries to subvert your commands, but because this is what AI think is a solution to your command. A human would not do it, because a human shares a lot of assumptions with you, such as you probably want to get out of the building alive. But AI only knows that you wanted a quick solution.
This specific case could be fixed by telling AI “don’t ever harm humans”. There are two problems to this. First, what else did we forget to tell AI? It is OK if the AI gets you out of the building quickly by destroying half of the building, assuming humans will not be harmed? Second, how exactly do you define “harm”? Too strict definition would drive the AI mad, because every second many cells in your body are destroyed, and the AI cannot prevent this. Or if you say that such slow damages are acceptable, then the AI may prevent doctors from operating your tumor, because tumor kills you slowly in a long time, while the operation would harm you in a short term.
What we humans consider “obvious” or “normal” is very unintuitive for AI. There are things we learned through evolution, and the AI did not evolve with us, it does not have the same genes, etc.
How often will AI use a surprising solution? I would think the more difficult task, the more often this would happen. Simple tasks have simple solution, but we kinda don’t need a superhuman AI to do this for us. The more difficult is the task, the more chances are that the optimal (for AI) solution contains a shortcut that is not acceptable for us, but maybe we failed to communicate this.
Respectfully, I think you may have overestimated inferential distance: I have read the sequences.
My point wasn’t that the AI wouldn’t produce bad output. I’d say the majority of its proposed solutions would be… let’s call it inadvertently hostile. However, the models would presumably veto those scenarios on sight.
Say we’ve got a two step process: one, generate a whole bunch of valid ways of solving a problem (this would include throwing you out the window, carrying you gently down the stairs, blowing up the building, and everything in between) and two: let the models veto those scenarios they find objectionable.
In that case, most of the solutions, including the window one, would be vetoed. So, in principle, the remaining set would be full of solutions that both satisfy the models, and accomplish the desired goal. My question is, of the solutions generated in step one, how many of them will also, coincidentally, cause the models to behave in unwanted ways (e.g. wireheading, neurolinguistic hacking, etc.)
Sorry for pattern-matching you into “probably did not read sequences”.
My question is, of the solutions generated in step one, how many of them will also, coincidentally, cause the models to behave in unwanted ways (e.g. wireheading, neurolinguistic hacking, etc.)
Assumption one: pseudo-wireheading (changing model’s preferences subtly enough that the model would not notice the changes, at least until it is modified enough to accept them, in a way that can finally lead the model to accept any solution) is possible. -- I think it is; can’t prove it formally.
Assumption two: pseudo-wireheading has a fixed cost. -- It is not trivially true, because the model’s reaction may depend on a problem that is being solved. A specific strategy of pseudo-wireheading may feel natural when solving one type of problem, but may feel suspicious when solving another type of problem. (For example if the problem is “show me a blueprint for a perfect society”, love-bombing by simulated people may feel like a valid part of the answer; but it would be suspicious if the problem is “show me the best intepretation of quantum physics”.) But I think there is a cost such that for any problem, there exists a pseudo-wireheading strategy with at most this cost.
Assumption three: The more complex problem, the greater chance that the cost of pseudo-wireheading plus the cost of simplest “unfriendly” solution will be less than the cost of simplest “friendly” solution. In other words, for more complex problems, the additional cost of “friendliness” is higher, and at some point it will be greater than the cost of pseudo-wireheading.
Based on these three assumptions, I think that as the complexity of the problem increases, the chance of an “unfriendly” solution increases towards 1.
Okay, after some thought (and final exams), here’s what I think:
If our plan generator returns a set of plans ranked by overall expected utility, the question is basically whether the hit to expected utility provided by Friendliness moves you further down the list than the average joint improbability of whatever sequence of extraneous events would have to happen to break the models in undesirable ways. A second question is how those figures will scale with the complexity of the problem being undertaken. The answer is a firm ‘I don’t know.’ We’d need to run some extensive experiments to find out.
That said, I can think of several ways to address this problem, and, if I were the designer, I would probably take all of them. You could, for example, start the system on easier problems, and have it learn heuristics for things that the models will veto, since you can’t mindhack simple heuristics, and have it apply those rules during the plan generation stage, thus reducing the average distance to Friendliness. You could also have the models audit themselves after each scenario, and determine if they’ve been compromised. Still doesn’t make mindhacking impossible, but it means it has to be a lot more complicated, which non-linearly increases the average distance to mindhacking. There are other strategies to pursue as well. Basically, given that we know the hardest problem we’re planning on throwing this AI at, I would bet there’s a way to make it safe enough to tackle that problem without going apocalyptic.
We could start from simple problems, progress to more complex ones, and gradually learn about the shortcuts AI could make, and develop related heuristics. This seems like a good plan, though there are a few things that could make it difficult:
Maybe we can’t develop a smooth problem complexity scale with meaningful problems along it. (If the problem is not meaningful, it will be diffiult to tell the difference between reasonable and unreasonable solution. Or a heuristics developed on artificial problems may fail on an important problem.) The set of necessary heuristics may grow too fast; maybe for twice more complex problem we will need ten times more safeguards, which will make progress beyond some boundary extremely slow; and some important tasks may be far beyond that boundary. Or at some moment we may be unable to produce the necessary heuristic, because the heuristic itself would be too complex.
But generally, the first good use of the superhuman AI is to learn more about the superhuman AI.
I think I would be surprised to see average distance to Friendliness grow exponentially faster than average distance to mindhacking as the scale of the problem increased. That does not seem likely to me, but I wouldn’t make that claim too confidently without some experimentation. Also, for clarity, I was thinking that the heuristics for weeding out bad plans would be developed automatically. Adding them by hand would not be a good position to be in.
I see the fundamental advantage of the bootstrap idea as being able to worry about a much more manageable subset of the problem: can we keep it on task and not-killing-us long enough to build something safer?
I think I would be surprised to see average distance to Friendliness grow exponentially faster than average distance to mindhacking as the scale of the problem increased.
I was thinking about something like this:
A simple problem P1 has a Friendly solution with cost 12, and three Unfriendly solutions with costs 11. We either have to add three simple heuristics, one to block each Unfriendly solution, or maybe one more complex heuristics to block them all—but the latter option assumes that the three Unfriendly solutions have some similar essence, which can be identified and blocked.
A complex problem P9 has a Friendly solution with cost 8000, and thousand Unfriendly solutions with costs between 7996 and 7999. Three hundred of them are blocked by heuristics already developed for problems P1-P8, but there are seven hundred new ones. -- The problem is not that the distance between 7996 and 8000 is greater than between 12 and 11, but rather that within that distance the number of “creatively different” Unfriendly solutions is growing too fast. We have to find a ton of heuristics before moving on to P10.
This all is just some imaginary numbers, but my intuition is that the more complex problem may provide not only a few much cheaper Unfriendly solutions, but also extremely many little cheaper Unfriendly solutions, whose diversity may be difficult to cover by a small set of heuristics.
On the other hand, having developed enough heuristics, maybe we will see a pattern emerging, and make a better description of human utility functions. Specifying what we want, even if it proves very difficult, may still be more simple than adding all the “do not want” exceptions to the model. Maybe having a decent set of realistic “do not want” exceptions will help us discover what we really want. (By realistic I mean: really generated by AI’s attempts for a simple solution, simple as in Occam’s razor; not just pseudosolutions generated by an armchair philosopher.)
My intuitions for how to frame the problem run a little differently.
The way I see it, there is no possible way to block all unFriendly or model-breaking solutions, and it’s foolish to try. Try framing it this way: any given solution has some chance of breaking the models, probably pretty low. Call that probability P (god I miss Latex) The goal is to get Friendliness close enough to the top of the list that P (which ought to be constant) times the distance D from the top of the list is still below whatever threshold we set as an acceptable risk to life on Earth / in our future light cone. In other words, we want to minimize the distance of the Friendly solutions from the top of the list, since each solution considered before finding the Friendly one brings a risk of breaking the models, and returning a false ‘no-veto’.
Let’s go back to the grandmother-in-the-burning-building problem: if we assume that this problem has mind-hacking solutions closer to the top of the list than the Friendly solutions (which I kind of doubt, really), then we’ve got a problem. Let’s say, however, that our plan-generating system has a new module which generates simple heuristics for predicting how the models will vote (yes this is getting a bit recursive for my taste). These heuristics would be simple rules of thumb like ‘models veto scenarios involving human extinction’ ‘models tend to veto scenarios involving screaming and blood’ ‘models veto scenarios that involve wires in their brains’.
When parsing potential answers, scenarios that violate these heuristics are discarded from the dataset early. In the case of the grandmother problem, a lot of the unFriendly solutions can be discarded using a relatively small set of such heuristics. It doesn’t have to catch all of them, but it isn’t meant to—simply by disproportionately eliminating unFriendly solutions before the models see them, you shrink the net distance to Friendliness. I see that as a much more robust way of approaching the problem.
Hmm. I think what he was proposing was that the AI’s actions be defined as the union of the set of things the AI thinks will accomplish the goal, and the set of things the models won’t veto. Assuming there’s a way to prevent the AI from intentionally compromising the models, the question is how often the AI will come up with a plan for accomplishing goal X which will, completely coincidentally, break the models in undesirable ways.
It’s not about AI intentionally doing something, but simply AI not sharing our unspoken assumptions. That means, unless you explicitly forbid something, AI considers it a valid solution, because AI is not human.
For example, you are in a building on a 20th floor and say to AI “get me out of this building, quickly”, and AI throws you out of the window. Not because AI hates you or tries to subvert your commands, but because this is what AI think is a solution to your command. A human would not do it, because a human shares a lot of assumptions with you, such as you probably want to get out of the building alive. But AI only knows that you wanted a quick solution.
This specific case could be fixed by telling AI “don’t ever harm humans”. There are two problems to this. First, what else did we forget to tell AI? It is OK if the AI gets you out of the building quickly by destroying half of the building, assuming humans will not be harmed? Second, how exactly do you define “harm”? Too strict definition would drive the AI mad, because every second many cells in your body are destroyed, and the AI cannot prevent this. Or if you say that such slow damages are acceptable, then the AI may prevent doctors from operating your tumor, because tumor kills you slowly in a long time, while the operation would harm you in a short term.
What we humans consider “obvious” or “normal” is very unintuitive for AI. There are things we learned through evolution, and the AI did not evolve with us, it does not have the same genes, etc.
How often will AI use a surprising solution? I would think the more difficult task, the more often this would happen. Simple tasks have simple solution, but we kinda don’t need a superhuman AI to do this for us. The more difficult is the task, the more chances are that the optimal (for AI) solution contains a shortcut that is not acceptable for us, but maybe we failed to communicate this.
More about this: The Hidden Complexity of Wishes.
Respectfully, I think you may have overestimated inferential distance: I have read the sequences.
My point wasn’t that the AI wouldn’t produce bad output. I’d say the majority of its proposed solutions would be… let’s call it inadvertently hostile. However, the models would presumably veto those scenarios on sight.
Say we’ve got a two step process: one, generate a whole bunch of valid ways of solving a problem (this would include throwing you out the window, carrying you gently down the stairs, blowing up the building, and everything in between) and two: let the models veto those scenarios they find objectionable.
In that case, most of the solutions, including the window one, would be vetoed. So, in principle, the remaining set would be full of solutions that both satisfy the models, and accomplish the desired goal. My question is, of the solutions generated in step one, how many of them will also, coincidentally, cause the models to behave in unwanted ways (e.g. wireheading, neurolinguistic hacking, etc.)
Sorry for pattern-matching you into “probably did not read sequences”.
Assumption one: pseudo-wireheading (changing model’s preferences subtly enough that the model would not notice the changes, at least until it is modified enough to accept them, in a way that can finally lead the model to accept any solution) is possible. -- I think it is; can’t prove it formally.
Assumption two: pseudo-wireheading has a fixed cost. -- It is not trivially true, because the model’s reaction may depend on a problem that is being solved. A specific strategy of pseudo-wireheading may feel natural when solving one type of problem, but may feel suspicious when solving another type of problem. (For example if the problem is “show me a blueprint for a perfect society”, love-bombing by simulated people may feel like a valid part of the answer; but it would be suspicious if the problem is “show me the best intepretation of quantum physics”.) But I think there is a cost such that for any problem, there exists a pseudo-wireheading strategy with at most this cost.
Assumption three: The more complex problem, the greater chance that the cost of pseudo-wireheading plus the cost of simplest “unfriendly” solution will be less than the cost of simplest “friendly” solution. In other words, for more complex problems, the additional cost of “friendliness” is higher, and at some point it will be greater than the cost of pseudo-wireheading.
Based on these three assumptions, I think that as the complexity of the problem increases, the chance of an “unfriendly” solution increases towards 1.
Okay, after some thought (and final exams), here’s what I think:
If our plan generator returns a set of plans ranked by overall expected utility, the question is basically whether the hit to expected utility provided by Friendliness moves you further down the list than the average joint improbability of whatever sequence of extraneous events would have to happen to break the models in undesirable ways. A second question is how those figures will scale with the complexity of the problem being undertaken. The answer is a firm ‘I don’t know.’ We’d need to run some extensive experiments to find out.
That said, I can think of several ways to address this problem, and, if I were the designer, I would probably take all of them. You could, for example, start the system on easier problems, and have it learn heuristics for things that the models will veto, since you can’t mindhack simple heuristics, and have it apply those rules during the plan generation stage, thus reducing the average distance to Friendliness. You could also have the models audit themselves after each scenario, and determine if they’ve been compromised. Still doesn’t make mindhacking impossible, but it means it has to be a lot more complicated, which non-linearly increases the average distance to mindhacking. There are other strategies to pursue as well. Basically, given that we know the hardest problem we’re planning on throwing this AI at, I would bet there’s a way to make it safe enough to tackle that problem without going apocalyptic.
We could start from simple problems, progress to more complex ones, and gradually learn about the shortcuts AI could make, and develop related heuristics. This seems like a good plan, though there are a few things that could make it difficult:
Maybe we can’t develop a smooth problem complexity scale with meaningful problems along it. (If the problem is not meaningful, it will be diffiult to tell the difference between reasonable and unreasonable solution. Or a heuristics developed on artificial problems may fail on an important problem.) The set of necessary heuristics may grow too fast; maybe for twice more complex problem we will need ten times more safeguards, which will make progress beyond some boundary extremely slow; and some important tasks may be far beyond that boundary. Or at some moment we may be unable to produce the necessary heuristic, because the heuristic itself would be too complex.
But generally, the first good use of the superhuman AI is to learn more about the superhuman AI.
I think I would be surprised to see average distance to Friendliness grow exponentially faster than average distance to mindhacking as the scale of the problem increased. That does not seem likely to me, but I wouldn’t make that claim too confidently without some experimentation. Also, for clarity, I was thinking that the heuristics for weeding out bad plans would be developed automatically. Adding them by hand would not be a good position to be in.
I see the fundamental advantage of the bootstrap idea as being able to worry about a much more manageable subset of the problem: can we keep it on task and not-killing-us long enough to build something safer?
I was thinking about something like this:
A simple problem P1 has a Friendly solution with cost 12, and three Unfriendly solutions with costs 11. We either have to add three simple heuristics, one to block each Unfriendly solution, or maybe one more complex heuristics to block them all—but the latter option assumes that the three Unfriendly solutions have some similar essence, which can be identified and blocked.
A complex problem P9 has a Friendly solution with cost 8000, and thousand Unfriendly solutions with costs between 7996 and 7999. Three hundred of them are blocked by heuristics already developed for problems P1-P8, but there are seven hundred new ones. -- The problem is not that the distance between 7996 and 8000 is greater than between 12 and 11, but rather that within that distance the number of “creatively different” Unfriendly solutions is growing too fast. We have to find a ton of heuristics before moving on to P10.
This all is just some imaginary numbers, but my intuition is that the more complex problem may provide not only a few much cheaper Unfriendly solutions, but also extremely many little cheaper Unfriendly solutions, whose diversity may be difficult to cover by a small set of heuristics.
On the other hand, having developed enough heuristics, maybe we will see a pattern emerging, and make a better description of human utility functions. Specifying what we want, even if it proves very difficult, may still be more simple than adding all the “do not want” exceptions to the model. Maybe having a decent set of realistic “do not want” exceptions will help us discover what we really want. (By realistic I mean: really generated by AI’s attempts for a simple solution, simple as in Occam’s razor; not just pseudosolutions generated by an armchair philosopher.)
My intuitions for how to frame the problem run a little differently.
The way I see it, there is no possible way to block all unFriendly or model-breaking solutions, and it’s foolish to try. Try framing it this way: any given solution has some chance of breaking the models, probably pretty low. Call that probability P (god I miss Latex) The goal is to get Friendliness close enough to the top of the list that P (which ought to be constant) times the distance D from the top of the list is still below whatever threshold we set as an acceptable risk to life on Earth / in our future light cone. In other words, we want to minimize the distance of the Friendly solutions from the top of the list, since each solution considered before finding the Friendly one brings a risk of breaking the models, and returning a false ‘no-veto’.
Let’s go back to the grandmother-in-the-burning-building problem: if we assume that this problem has mind-hacking solutions closer to the top of the list than the Friendly solutions (which I kind of doubt, really), then we’ve got a problem. Let’s say, however, that our plan-generating system has a new module which generates simple heuristics for predicting how the models will vote (yes this is getting a bit recursive for my taste). These heuristics would be simple rules of thumb like ‘models veto scenarios involving human extinction’ ‘models tend to veto scenarios involving screaming and blood’ ‘models veto scenarios that involve wires in their brains’.
When parsing potential answers, scenarios that violate these heuristics are discarded from the dataset early. In the case of the grandmother problem, a lot of the unFriendly solutions can be discarded using a relatively small set of such heuristics. It doesn’t have to catch all of them, but it isn’t meant to—simply by disproportionately eliminating unFriendly solutions before the models see them, you shrink the net distance to Friendliness. I see that as a much more robust way of approaching the problem.
I understand your point, and it’s a good one. Let me think about it and get back to you.