It’s a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well).
I call bullshit. This isn’t even magical thinking, it’s buzzwords.
It had precisely that effect on me. I retract the claim of “bullshit”, but it does indeed seem like magical thinking on the level of the Open Source Wish Project.
Furthermore, if you can get an AI to keep “the concentration of iron in the Earth’s atmosphere” as a goal rather than “the reading of this sensor which currently reports the concentration of iron in the Earth’s atmosphere” or “the AI’s estimate of the concentration of iron in the Earth’s atmosphere”… it seems to me you’ve done much of the work necessary to safely point the AI at human preference.
I disagree. With the most basic ontology—say, standard quantum mechanics with some model of decoherence—you could define pretty clearly what “iron” is (given a few weeks, I could probably do that myself). You’d need a bit more ontology—specifically, a sensible definition of position—to get “Earth’s atmosphere”. But all these are strictly much easier than defining what “love” is.
Also, in this model, it doesn’t matter much if your definitions aren’t perfect. If “iron” isn’t exactly what we thought it was, as long as it measures something present in the atmosphere that could diverge given a bad AI, we’ve got something.
it does indeed seem like magical thinking on the level of the Open Source Wish Project.
Structurally the two are distinct. The Open Source Wish Project fails because it tries to define a goal that we “know” but are unable to precisely “define”. All the terms are questionable, and the definition gets longer and longer as they fail to nail down the terms.
In coarse graining, instead, we start with lots of measures that are much more precisely defined, and just pile on more of them in the hope of constraining the AI, without understanding how exactly the constraints works. We have two extra things going for us: first, the AI can always output NULL, and do nothing. Secondly, the goal we have setup for the AI (in terms of its utility function) is one that is easy for it to achieve, so it can only squeeze a little bit more out by taking over everything, so even small deviations in the penalty function are enough to catch that.
Personally, I am certain that I could find a loop-hole in any “wish for immortality”, but given a few million coarse-grained constraints ranging across all types of natural and artificial process, across all niches of the Earth, nearby space or the internet… I wouldn’t know where to begin. And this isn’t an unfair comparison, because coming up with thousands of these constraints is very easy, while spelling out what we mean by “life” is very hard.
What Vladimir said. The actual variable in the AI’s programming can’t be magically linked directly to the number of iron atoms in the atmosphere; it’s linked to the output of a sensor, or many sensors. There are always at least two possible failure modes- either the AI could suborn the sensor itself, or wirehead itself to believe the sensor has the correct value. These are not trivial failure modes; they’re some of the largest hurdles that Eliezer sees as integral to the development of FAI.
You’re missing the point: the distinction between the thing itself and various indicators of what it is.
I thought I was pretty clear on the distinction: traditional wishes are clear on the thing itself (eg immortality) but hopeless at the indicators; this approach is clear on the indicators, and more nebulous on how they achieve the thing (reduced impact).
By pilling on indicators, we are, with high probability, making it harder for the AI to misbehave, closing out more and more avenues for it to do so, pushing it to use methods that are more likely to fail. We only have to get the difference between “expected utility for minimised impact (given easy to max utility function)” and “unrestricted expected utility for easy to max utility function” (a small number) to accomplish our goals.
Will the method accomplish this? Will improved versions of the method accomplish this? Nobody knows yet, but given what’s at stake, it’s certainly worth looking into.
“There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.” - C.A.R. Hoare
Which is to say, not knowing where to begin looking for exception cases is not, to my mind, a point in favor of a proposed design.
Which is to say, not knowing where to begin looking for exception cases is not, to my mind, a point in favor of a proposed design.
Good point. But Bayesianly, it has to be an advantage that there is no obvious flaw (conservation of probability and so on). And because of the failure modes available (get nearly all utility with reduced impact disciple, or output NULL), it is more likely to succeed.
This should be testable; trying out some very toy problems, and seeing if an AI with as much resources we can give it, is capable of finding a way around it.
I certainly agree that a too-complicated-to-understand system with no obvious flaw is more likely to do what I expect it to than a too-complicated-to-understand system with an obvious flaw. This is true not only Bayesianly, but common-sensically. But if you’re arguing that it’s more likely to do what I expect than a less complicated system, I fail to follow your reasoning.
Agreed that this should be empirically explorable, if not strictly testable. Personally, I’m more inclined to explore it using NIs (natural intelligences), though. That is, if a group of relatively dumb but highly motivated NIs construct the kind of complex framework of independent interlocking constraints we’re discussing, how likely is it that a relatively intelligent and equally motivated NI can find an unexpected loophole?
My own expectation is that while it’s not terribly likely on any given trial, it’s pretty likely given enough trials.
Just to make that a little more concrete… if N is a standard deviation of intelligence within the current human population, and we define a trial as a single iteration of the loophole-seeker analyzing the network of constraints, coming up with a strategy that has a chance of achieving its goals despite those constraints, and implementing that strategy, then if I assume a 3N delta between the constraint-defining agents and the loophole-seeking agent and enough of a resource discrepancy benefiting the constraint definers that they get to put their entire network in place before the loophole-seeker gets started, I’d estimate something on the order of a 1% chance of success for the loophole-seeker. So if a trial takes a few days and the constraint definers don’t modify their network after each trial, then I expect no more than a year before the loophole-seeker succeeds.
What results would you expect from such a scenario?
I call bullshit. This isn’t even magical thinking, it’s buzzwords.
I hope http://lesswrong.com/lw/a39/the_mathematics_of_reduced_impact_help_needed/5x19 has upgraded your impression so that it at least reaches magical thinking level :-)
It had precisely that effect on me. I retract the claim of “bullshit”, but it does indeed seem like magical thinking on the level of the Open Source Wish Project.
Furthermore, if you can get an AI to keep “the concentration of iron in the Earth’s atmosphere” as a goal rather than “the reading of this sensor which currently reports the concentration of iron in the Earth’s atmosphere” or “the AI’s estimate of the concentration of iron in the Earth’s atmosphere”… it seems to me you’ve done much of the work necessary to safely point the AI at human preference.
Ah, now we’re getting somewhere.
I disagree. With the most basic ontology—say, standard quantum mechanics with some model of decoherence—you could define pretty clearly what “iron” is (given a few weeks, I could probably do that myself). You’d need a bit more ontology—specifically, a sensible definition of position—to get “Earth’s atmosphere”. But all these are strictly much easier than defining what “love” is.
Also, in this model, it doesn’t matter much if your definitions aren’t perfect. If “iron” isn’t exactly what we thought it was, as long as it measures something present in the atmosphere that could diverge given a bad AI, we’ve got something.
Structurally the two are distinct. The Open Source Wish Project fails because it tries to define a goal that we “know” but are unable to precisely “define”. All the terms are questionable, and the definition gets longer and longer as they fail to nail down the terms.
In coarse graining, instead, we start with lots of measures that are much more precisely defined, and just pile on more of them in the hope of constraining the AI, without understanding how exactly the constraints works. We have two extra things going for us: first, the AI can always output NULL, and do nothing. Secondly, the goal we have setup for the AI (in terms of its utility function) is one that is easy for it to achieve, so it can only squeeze a little bit more out by taking over everything, so even small deviations in the penalty function are enough to catch that.
Personally, I am certain that I could find a loop-hole in any “wish for immortality”, but given a few million coarse-grained constraints ranging across all types of natural and artificial process, across all niches of the Earth, nearby space or the internet… I wouldn’t know where to begin. And this isn’t an unfair comparison, because coming up with thousands of these constraints is very easy, while spelling out what we mean by “life” is very hard.
What Vladimir said. The actual variable in the AI’s programming can’t be magically linked directly to the number of iron atoms in the atmosphere; it’s linked to the output of a sensor, or many sensors. There are always at least two possible failure modes- either the AI could suborn the sensor itself, or wirehead itself to believe the sensor has the correct value. These are not trivial failure modes; they’re some of the largest hurdles that Eliezer sees as integral to the development of FAI.
Yes, if the AI doesn’t have a decent ontology or image of the world, this method likely fails.
But again, this seems strictly easier than FAI: we need to define physics and position, not human beings, and not human values.
You’re missing the point: the distinction between the thing itself and various indicators of what it is.
I thought I was pretty clear on the distinction: traditional wishes are clear on the thing itself (eg immortality) but hopeless at the indicators; this approach is clear on the indicators, and more nebulous on how they achieve the thing (reduced impact).
By pilling on indicators, we are, with high probability, making it harder for the AI to misbehave, closing out more and more avenues for it to do so, pushing it to use methods that are more likely to fail. We only have to get the difference between “expected utility for minimised impact (given easy to max utility function)” and “unrestricted expected utility for easy to max utility function” (a small number) to accomplish our goals.
Will the method accomplish this? Will improved versions of the method accomplish this? Nobody knows yet, but given what’s at stake, it’s certainly worth looking into.
“There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.” - C.A.R. Hoare
Which is to say, not knowing where to begin looking for exception cases is not, to my mind, a point in favor of a proposed design.
Good point. But Bayesianly, it has to be an advantage that there is no obvious flaw (conservation of probability and so on). And because of the failure modes available (get nearly all utility with reduced impact disciple, or output NULL), it is more likely to succeed.
This should be testable; trying out some very toy problems, and seeing if an AI with as much resources we can give it, is capable of finding a way around it.
I certainly agree that a too-complicated-to-understand system with no obvious flaw is more likely to do what I expect it to than a too-complicated-to-understand system with an obvious flaw. This is true not only Bayesianly, but common-sensically. But if you’re arguing that it’s more likely to do what I expect than a less complicated system, I fail to follow your reasoning.
Agreed that this should be empirically explorable, if not strictly testable. Personally, I’m more inclined to explore it using NIs (natural intelligences), though. That is, if a group of relatively dumb but highly motivated NIs construct the kind of complex framework of independent interlocking constraints we’re discussing, how likely is it that a relatively intelligent and equally motivated NI can find an unexpected loophole?
My own expectation is that while it’s not terribly likely on any given trial, it’s pretty likely given enough trials.
Just to make that a little more concrete… if N is a standard deviation of intelligence within the current human population, and we define a trial as a single iteration of the loophole-seeker analyzing the network of constraints, coming up with a strategy that has a chance of achieving its goals despite those constraints, and implementing that strategy, then if I assume a 3N delta between the constraint-defining agents and the loophole-seeking agent and enough of a resource discrepancy benefiting the constraint definers that they get to put their entire network in place before the loophole-seeker gets started, I’d estimate something on the order of a 1% chance of success for the loophole-seeker. So if a trial takes a few days and the constraint definers don’t modify their network after each trial, then I expect no more than a year before the loophole-seeker succeeds.
What results would you expect from such a scenario?
I really don’t know. I would expect the loophole-seeker to be much more successful if partial success was possible.
Agreed.