Which is to say, not knowing where to begin looking for exception cases is not, to my mind, a point in favor of a proposed design.
Good point. But Bayesianly, it has to be an advantage that there is no obvious flaw (conservation of probability and so on). And because of the failure modes available (get nearly all utility with reduced impact disciple, or output NULL), it is more likely to succeed.
This should be testable; trying out some very toy problems, and seeing if an AI with as much resources we can give it, is capable of finding a way around it.
I certainly agree that a too-complicated-to-understand system with no obvious flaw is more likely to do what I expect it to than a too-complicated-to-understand system with an obvious flaw. This is true not only Bayesianly, but common-sensically. But if you’re arguing that it’s more likely to do what I expect than a less complicated system, I fail to follow your reasoning.
Agreed that this should be empirically explorable, if not strictly testable. Personally, I’m more inclined to explore it using NIs (natural intelligences), though. That is, if a group of relatively dumb but highly motivated NIs construct the kind of complex framework of independent interlocking constraints we’re discussing, how likely is it that a relatively intelligent and equally motivated NI can find an unexpected loophole?
My own expectation is that while it’s not terribly likely on any given trial, it’s pretty likely given enough trials.
Just to make that a little more concrete… if N is a standard deviation of intelligence within the current human population, and we define a trial as a single iteration of the loophole-seeker analyzing the network of constraints, coming up with a strategy that has a chance of achieving its goals despite those constraints, and implementing that strategy, then if I assume a 3N delta between the constraint-defining agents and the loophole-seeking agent and enough of a resource discrepancy benefiting the constraint definers that they get to put their entire network in place before the loophole-seeker gets started, I’d estimate something on the order of a 1% chance of success for the loophole-seeker. So if a trial takes a few days and the constraint definers don’t modify their network after each trial, then I expect no more than a year before the loophole-seeker succeeds.
What results would you expect from such a scenario?
Good point. But Bayesianly, it has to be an advantage that there is no obvious flaw (conservation of probability and so on). And because of the failure modes available (get nearly all utility with reduced impact disciple, or output NULL), it is more likely to succeed.
This should be testable; trying out some very toy problems, and seeing if an AI with as much resources we can give it, is capable of finding a way around it.
I certainly agree that a too-complicated-to-understand system with no obvious flaw is more likely to do what I expect it to than a too-complicated-to-understand system with an obvious flaw. This is true not only Bayesianly, but common-sensically. But if you’re arguing that it’s more likely to do what I expect than a less complicated system, I fail to follow your reasoning.
Agreed that this should be empirically explorable, if not strictly testable. Personally, I’m more inclined to explore it using NIs (natural intelligences), though. That is, if a group of relatively dumb but highly motivated NIs construct the kind of complex framework of independent interlocking constraints we’re discussing, how likely is it that a relatively intelligent and equally motivated NI can find an unexpected loophole?
My own expectation is that while it’s not terribly likely on any given trial, it’s pretty likely given enough trials.
Just to make that a little more concrete… if N is a standard deviation of intelligence within the current human population, and we define a trial as a single iteration of the loophole-seeker analyzing the network of constraints, coming up with a strategy that has a chance of achieving its goals despite those constraints, and implementing that strategy, then if I assume a 3N delta between the constraint-defining agents and the loophole-seeking agent and enough of a resource discrepancy benefiting the constraint definers that they get to put their entire network in place before the loophole-seeker gets started, I’d estimate something on the order of a 1% chance of success for the loophole-seeker. So if a trial takes a few days and the constraint definers don’t modify their network after each trial, then I expect no more than a year before the loophole-seeker succeeds.
What results would you expect from such a scenario?
I really don’t know. I would expect the loophole-seeker to be much more successful if partial success was possible.
Agreed.