Coarse-grained impact measures end with the AI deploying massive-scale nanotech in order to try and cancel out butterfly effects and force the world onto a coarse-grained path as close as possible to what it would’ve had if the AI “hadn’t existed” however that counterfactual was defined.
Yes, if none of our various coarse-grainings catch the nanotech, and if we use some sort of averaging of deviations. It’s a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well). Then if the AI is capable of unraveling the butterfly effect for one of these measure, it will simply do nothing.
Doesn’t protect from some types of miracle science, I’m aware of that.
The idea for coarse graining (the “20 billion question approach”) is to take a collection of attributes about the world, and use these to group the possible worlds into equivalence classes. Then the penalty function measures the difference between the probabilities of these equivalence classes, based on whether X=1 or X=0.
What do we want from such a collection of attributes? Well, first we would hope that at least one of them would detect an AI behaving badly. So we could use the quantity of iron in the upper atmosphere, the Dow Jones average, air pressure, and a million more such attributes. The more attributes we shove in, the more we can hope to catch the AI, without having to know exactly how the AI would misbehave.
But it doesn’t help if one attribute detects the misbehaviour, but the penalty function doesn’t register this. So we can’t ever have a situation where “most of the attributes are fine, so that’s all ok”. Using the complete set of attributes to partition into equivalence classes, and treating the equivalence classes as entirely distinct will enable this. If two worlds have same expected Dow Jones values, same expected air pressure, same expected sewage volume in London, but different levels of metal in the upper atmosphere, we would want them to be as different from each other as if all the attributes were different—“all the cameras report no intruder, sir. Oh, except for camera 5, which is broken for some reason” is not an acceptable attitude.
We also have to be wary of changes across time. We wouldn’t want the AI to grab the world with nanotech, and afterwards rearrange the world to minimize the penalty function: we want the initial grab to be penalised, whatever happens after. So we want to penalise any deviation at any time, irrespective of what happens at other times. This is an L-infinity norm over time (the penalty over time is the maximum of the penalty at any given moment).
To get back to your example, what happens when the AI can unravel the butterfly effect? Take the Dow Jones, for instance: maybe the AI can accurately predict its value depending on X=1 versus X=0 (“the safe-AI company is going IPO tomorrow, with the results of their first test to be published just before”). In that case, the AI will detect a large deviation in the X=1 versus X=0 world—except if it outputs null, which is thus what it will do.
How to avoid this (we want to have an AI that’s useful, not just safe)? Well, the more chaos we put into the attributes, the more confident we are the AI won’t be able to unravel the butterfly effect (the Dow Jones average might be predictable from a successful AI project, but what about the volatility?). So to effectively coarse-grain, we need:
A lot of different attributes
Which are all heavily laced with chaos for small variations, so unpredictable
But are somewhat predictable for large variations
Will likely catch the disciple AI misbehaving, if it chose to do so
A tall order! But I don’t think it’s that hard to at least get to the point where we feel that the are actually making things safer. My main worry is that we might not find enough desirable attributes to cover the space of possible misbehavings.
What, you prefer that to “It’s a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well).”? :-)
It’s a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well).
I call bullshit. This isn’t even magical thinking, it’s buzzwords.
It had precisely that effect on me. I retract the claim of “bullshit”, but it does indeed seem like magical thinking on the level of the Open Source Wish Project.
Furthermore, if you can get an AI to keep “the concentration of iron in the Earth’s atmosphere” as a goal rather than “the reading of this sensor which currently reports the concentration of iron in the Earth’s atmosphere” or “the AI’s estimate of the concentration of iron in the Earth’s atmosphere”… it seems to me you’ve done much of the work necessary to safely point the AI at human preference.
I disagree. With the most basic ontology—say, standard quantum mechanics with some model of decoherence—you could define pretty clearly what “iron” is (given a few weeks, I could probably do that myself). You’d need a bit more ontology—specifically, a sensible definition of position—to get “Earth’s atmosphere”. But all these are strictly much easier than defining what “love” is.
Also, in this model, it doesn’t matter much if your definitions aren’t perfect. If “iron” isn’t exactly what we thought it was, as long as it measures something present in the atmosphere that could diverge given a bad AI, we’ve got something.
it does indeed seem like magical thinking on the level of the Open Source Wish Project.
Structurally the two are distinct. The Open Source Wish Project fails because it tries to define a goal that we “know” but are unable to precisely “define”. All the terms are questionable, and the definition gets longer and longer as they fail to nail down the terms.
In coarse graining, instead, we start with lots of measures that are much more precisely defined, and just pile on more of them in the hope of constraining the AI, without understanding how exactly the constraints works. We have two extra things going for us: first, the AI can always output NULL, and do nothing. Secondly, the goal we have setup for the AI (in terms of its utility function) is one that is easy for it to achieve, so it can only squeeze a little bit more out by taking over everything, so even small deviations in the penalty function are enough to catch that.
Personally, I am certain that I could find a loop-hole in any “wish for immortality”, but given a few million coarse-grained constraints ranging across all types of natural and artificial process, across all niches of the Earth, nearby space or the internet… I wouldn’t know where to begin. And this isn’t an unfair comparison, because coming up with thousands of these constraints is very easy, while spelling out what we mean by “life” is very hard.
What Vladimir said. The actual variable in the AI’s programming can’t be magically linked directly to the number of iron atoms in the atmosphere; it’s linked to the output of a sensor, or many sensors. There are always at least two possible failure modes- either the AI could suborn the sensor itself, or wirehead itself to believe the sensor has the correct value. These are not trivial failure modes; they’re some of the largest hurdles that Eliezer sees as integral to the development of FAI.
You’re missing the point: the distinction between the thing itself and various indicators of what it is.
I thought I was pretty clear on the distinction: traditional wishes are clear on the thing itself (eg immortality) but hopeless at the indicators; this approach is clear on the indicators, and more nebulous on how they achieve the thing (reduced impact).
By pilling on indicators, we are, with high probability, making it harder for the AI to misbehave, closing out more and more avenues for it to do so, pushing it to use methods that are more likely to fail. We only have to get the difference between “expected utility for minimised impact (given easy to max utility function)” and “unrestricted expected utility for easy to max utility function” (a small number) to accomplish our goals.
Will the method accomplish this? Will improved versions of the method accomplish this? Nobody knows yet, but given what’s at stake, it’s certainly worth looking into.
“There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.” - C.A.R. Hoare
Which is to say, not knowing where to begin looking for exception cases is not, to my mind, a point in favor of a proposed design.
Which is to say, not knowing where to begin looking for exception cases is not, to my mind, a point in favor of a proposed design.
Good point. But Bayesianly, it has to be an advantage that there is no obvious flaw (conservation of probability and so on). And because of the failure modes available (get nearly all utility with reduced impact disciple, or output NULL), it is more likely to succeed.
This should be testable; trying out some very toy problems, and seeing if an AI with as much resources we can give it, is capable of finding a way around it.
I certainly agree that a too-complicated-to-understand system with no obvious flaw is more likely to do what I expect it to than a too-complicated-to-understand system with an obvious flaw. This is true not only Bayesianly, but common-sensically. But if you’re arguing that it’s more likely to do what I expect than a less complicated system, I fail to follow your reasoning.
Agreed that this should be empirically explorable, if not strictly testable. Personally, I’m more inclined to explore it using NIs (natural intelligences), though. That is, if a group of relatively dumb but highly motivated NIs construct the kind of complex framework of independent interlocking constraints we’re discussing, how likely is it that a relatively intelligent and equally motivated NI can find an unexpected loophole?
My own expectation is that while it’s not terribly likely on any given trial, it’s pretty likely given enough trials.
Just to make that a little more concrete… if N is a standard deviation of intelligence within the current human population, and we define a trial as a single iteration of the loophole-seeker analyzing the network of constraints, coming up with a strategy that has a chance of achieving its goals despite those constraints, and implementing that strategy, then if I assume a 3N delta between the constraint-defining agents and the loophole-seeking agent and enough of a resource discrepancy benefiting the constraint definers that they get to put their entire network in place before the loophole-seeker gets started, I’d estimate something on the order of a 1% chance of success for the loophole-seeker. So if a trial takes a few days and the constraint definers don’t modify their network after each trial, then I expect no more than a year before the loophole-seeker succeeds.
What results would you expect from such a scenario?
use an L-infinity norm for deviations (across every moment of time as well).
The future 10^8 years later is going to look very different, even if things go right (FAI style or whatever), simply because we’ll have used the AI for something. This is going to push your L-infinity norm very high, regardless of it’s actions now, which is obviously very bad. As such, I think you want to weigh it be e^-t or something.
My other concern is that the AI will note that dedicating lots of resources to learning how to obey (game) the system will result in a really low score.
Yes, if none of our various coarse-grainings catch the nanotech, and if we use some sort of averaging of deviations. It’s a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well). Then if the AI is capable of unraveling the butterfly effect for one of these measure, it will simply do nothing.
Doesn’t protect from some types of miracle science, I’m aware of that.
What? Can you give me a specific example and walk through this?
The idea for coarse graining (the “20 billion question approach”) is to take a collection of attributes about the world, and use these to group the possible worlds into equivalence classes. Then the penalty function measures the difference between the probabilities of these equivalence classes, based on whether X=1 or X=0.
What do we want from such a collection of attributes? Well, first we would hope that at least one of them would detect an AI behaving badly. So we could use the quantity of iron in the upper atmosphere, the Dow Jones average, air pressure, and a million more such attributes. The more attributes we shove in, the more we can hope to catch the AI, without having to know exactly how the AI would misbehave.
But it doesn’t help if one attribute detects the misbehaviour, but the penalty function doesn’t register this. So we can’t ever have a situation where “most of the attributes are fine, so that’s all ok”. Using the complete set of attributes to partition into equivalence classes, and treating the equivalence classes as entirely distinct will enable this. If two worlds have same expected Dow Jones values, same expected air pressure, same expected sewage volume in London, but different levels of metal in the upper atmosphere, we would want them to be as different from each other as if all the attributes were different—“all the cameras report no intruder, sir. Oh, except for camera 5, which is broken for some reason” is not an acceptable attitude.
We also have to be wary of changes across time. We wouldn’t want the AI to grab the world with nanotech, and afterwards rearrange the world to minimize the penalty function: we want the initial grab to be penalised, whatever happens after. So we want to penalise any deviation at any time, irrespective of what happens at other times. This is an L-infinity norm over time (the penalty over time is the maximum of the penalty at any given moment).
To get back to your example, what happens when the AI can unravel the butterfly effect? Take the Dow Jones, for instance: maybe the AI can accurately predict its value depending on X=1 versus X=0 (“the safe-AI company is going IPO tomorrow, with the results of their first test to be published just before”). In that case, the AI will detect a large deviation in the X=1 versus X=0 world—except if it outputs null, which is thus what it will do.
How to avoid this (we want to have an AI that’s useful, not just safe)? Well, the more chaos we put into the attributes, the more confident we are the AI won’t be able to unravel the butterfly effect (the Dow Jones average might be predictable from a successful AI project, but what about the volatility?). So to effectively coarse-grain, we need:
A lot of different attributes
Which are all heavily laced with chaos for small variations, so unpredictable
But are somewhat predictable for large variations
Will likely catch the disciple AI misbehaving, if it chose to do so
A tall order! But I don’t think it’s that hard to at least get to the point where we feel that the are actually making things safer. My main worry is that we might not find enough desirable attributes to cover the space of possible misbehavings.
Upvoted for a relatively high-quality response, regardless of whether you’re correct.
What, you prefer that to “It’s a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well).”? :-)
I call bullshit. This isn’t even magical thinking, it’s buzzwords.
I hope http://lesswrong.com/lw/a39/the_mathematics_of_reduced_impact_help_needed/5x19 has upgraded your impression so that it at least reaches magical thinking level :-)
It had precisely that effect on me. I retract the claim of “bullshit”, but it does indeed seem like magical thinking on the level of the Open Source Wish Project.
Furthermore, if you can get an AI to keep “the concentration of iron in the Earth’s atmosphere” as a goal rather than “the reading of this sensor which currently reports the concentration of iron in the Earth’s atmosphere” or “the AI’s estimate of the concentration of iron in the Earth’s atmosphere”… it seems to me you’ve done much of the work necessary to safely point the AI at human preference.
Ah, now we’re getting somewhere.
I disagree. With the most basic ontology—say, standard quantum mechanics with some model of decoherence—you could define pretty clearly what “iron” is (given a few weeks, I could probably do that myself). You’d need a bit more ontology—specifically, a sensible definition of position—to get “Earth’s atmosphere”. But all these are strictly much easier than defining what “love” is.
Also, in this model, it doesn’t matter much if your definitions aren’t perfect. If “iron” isn’t exactly what we thought it was, as long as it measures something present in the atmosphere that could diverge given a bad AI, we’ve got something.
Structurally the two are distinct. The Open Source Wish Project fails because it tries to define a goal that we “know” but are unable to precisely “define”. All the terms are questionable, and the definition gets longer and longer as they fail to nail down the terms.
In coarse graining, instead, we start with lots of measures that are much more precisely defined, and just pile on more of them in the hope of constraining the AI, without understanding how exactly the constraints works. We have two extra things going for us: first, the AI can always output NULL, and do nothing. Secondly, the goal we have setup for the AI (in terms of its utility function) is one that is easy for it to achieve, so it can only squeeze a little bit more out by taking over everything, so even small deviations in the penalty function are enough to catch that.
Personally, I am certain that I could find a loop-hole in any “wish for immortality”, but given a few million coarse-grained constraints ranging across all types of natural and artificial process, across all niches of the Earth, nearby space or the internet… I wouldn’t know where to begin. And this isn’t an unfair comparison, because coming up with thousands of these constraints is very easy, while spelling out what we mean by “life” is very hard.
What Vladimir said. The actual variable in the AI’s programming can’t be magically linked directly to the number of iron atoms in the atmosphere; it’s linked to the output of a sensor, or many sensors. There are always at least two possible failure modes- either the AI could suborn the sensor itself, or wirehead itself to believe the sensor has the correct value. These are not trivial failure modes; they’re some of the largest hurdles that Eliezer sees as integral to the development of FAI.
Yes, if the AI doesn’t have a decent ontology or image of the world, this method likely fails.
But again, this seems strictly easier than FAI: we need to define physics and position, not human beings, and not human values.
You’re missing the point: the distinction between the thing itself and various indicators of what it is.
I thought I was pretty clear on the distinction: traditional wishes are clear on the thing itself (eg immortality) but hopeless at the indicators; this approach is clear on the indicators, and more nebulous on how they achieve the thing (reduced impact).
By pilling on indicators, we are, with high probability, making it harder for the AI to misbehave, closing out more and more avenues for it to do so, pushing it to use methods that are more likely to fail. We only have to get the difference between “expected utility for minimised impact (given easy to max utility function)” and “unrestricted expected utility for easy to max utility function” (a small number) to accomplish our goals.
Will the method accomplish this? Will improved versions of the method accomplish this? Nobody knows yet, but given what’s at stake, it’s certainly worth looking into.
“There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.” - C.A.R. Hoare
Which is to say, not knowing where to begin looking for exception cases is not, to my mind, a point in favor of a proposed design.
Good point. But Bayesianly, it has to be an advantage that there is no obvious flaw (conservation of probability and so on). And because of the failure modes available (get nearly all utility with reduced impact disciple, or output NULL), it is more likely to succeed.
This should be testable; trying out some very toy problems, and seeing if an AI with as much resources we can give it, is capable of finding a way around it.
I certainly agree that a too-complicated-to-understand system with no obvious flaw is more likely to do what I expect it to than a too-complicated-to-understand system with an obvious flaw. This is true not only Bayesianly, but common-sensically. But if you’re arguing that it’s more likely to do what I expect than a less complicated system, I fail to follow your reasoning.
Agreed that this should be empirically explorable, if not strictly testable. Personally, I’m more inclined to explore it using NIs (natural intelligences), though. That is, if a group of relatively dumb but highly motivated NIs construct the kind of complex framework of independent interlocking constraints we’re discussing, how likely is it that a relatively intelligent and equally motivated NI can find an unexpected loophole?
My own expectation is that while it’s not terribly likely on any given trial, it’s pretty likely given enough trials.
Just to make that a little more concrete… if N is a standard deviation of intelligence within the current human population, and we define a trial as a single iteration of the loophole-seeker analyzing the network of constraints, coming up with a strategy that has a chance of achieving its goals despite those constraints, and implementing that strategy, then if I assume a 3N delta between the constraint-defining agents and the loophole-seeking agent and enough of a resource discrepancy benefiting the constraint definers that they get to put their entire network in place before the loophole-seeker gets started, I’d estimate something on the order of a 1% chance of success for the loophole-seeker. So if a trial takes a few days and the constraint definers don’t modify their network after each trial, then I expect no more than a year before the loophole-seeker succeeds.
What results would you expect from such a scenario?
I really don’t know. I would expect the loophole-seeker to be much more successful if partial success was possible.
Agreed.
The future 10^8 years later is going to look very different, even if things go right (FAI style or whatever), simply because we’ll have used the AI for something. This is going to push your L-infinity norm very high, regardless of it’s actions now, which is obviously very bad. As such, I think you want to weigh it be e^-t or something.
My other concern is that the AI will note that dedicating lots of resources to learning how to obey (game) the system will result in a really low score.