The idea for coarse graining (the “20 billion question approach”) is to take a collection of attributes about the world, and use these to group the possible worlds into equivalence classes. Then the penalty function measures the difference between the probabilities of these equivalence classes, based on whether X=1 or X=0.
What do we want from such a collection of attributes? Well, first we would hope that at least one of them would detect an AI behaving badly. So we could use the quantity of iron in the upper atmosphere, the Dow Jones average, air pressure, and a million more such attributes. The more attributes we shove in, the more we can hope to catch the AI, without having to know exactly how the AI would misbehave.
But it doesn’t help if one attribute detects the misbehaviour, but the penalty function doesn’t register this. So we can’t ever have a situation where “most of the attributes are fine, so that’s all ok”. Using the complete set of attributes to partition into equivalence classes, and treating the equivalence classes as entirely distinct will enable this. If two worlds have same expected Dow Jones values, same expected air pressure, same expected sewage volume in London, but different levels of metal in the upper atmosphere, we would want them to be as different from each other as if all the attributes were different—“all the cameras report no intruder, sir. Oh, except for camera 5, which is broken for some reason” is not an acceptable attitude.
We also have to be wary of changes across time. We wouldn’t want the AI to grab the world with nanotech, and afterwards rearrange the world to minimize the penalty function: we want the initial grab to be penalised, whatever happens after. So we want to penalise any deviation at any time, irrespective of what happens at other times. This is an L-infinity norm over time (the penalty over time is the maximum of the penalty at any given moment).
To get back to your example, what happens when the AI can unravel the butterfly effect? Take the Dow Jones, for instance: maybe the AI can accurately predict its value depending on X=1 versus X=0 (“the safe-AI company is going IPO tomorrow, with the results of their first test to be published just before”). In that case, the AI will detect a large deviation in the X=1 versus X=0 world—except if it outputs null, which is thus what it will do.
How to avoid this (we want to have an AI that’s useful, not just safe)? Well, the more chaos we put into the attributes, the more confident we are the AI won’t be able to unravel the butterfly effect (the Dow Jones average might be predictable from a successful AI project, but what about the volatility?). So to effectively coarse-grain, we need:
A lot of different attributes
Which are all heavily laced with chaos for small variations, so unpredictable
But are somewhat predictable for large variations
Will likely catch the disciple AI misbehaving, if it chose to do so
A tall order! But I don’t think it’s that hard to at least get to the point where we feel that the are actually making things safer. My main worry is that we might not find enough desirable attributes to cover the space of possible misbehavings.
What, you prefer that to “It’s a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well).”? :-)
The idea for coarse graining (the “20 billion question approach”) is to take a collection of attributes about the world, and use these to group the possible worlds into equivalence classes. Then the penalty function measures the difference between the probabilities of these equivalence classes, based on whether X=1 or X=0.
What do we want from such a collection of attributes? Well, first we would hope that at least one of them would detect an AI behaving badly. So we could use the quantity of iron in the upper atmosphere, the Dow Jones average, air pressure, and a million more such attributes. The more attributes we shove in, the more we can hope to catch the AI, without having to know exactly how the AI would misbehave.
But it doesn’t help if one attribute detects the misbehaviour, but the penalty function doesn’t register this. So we can’t ever have a situation where “most of the attributes are fine, so that’s all ok”. Using the complete set of attributes to partition into equivalence classes, and treating the equivalence classes as entirely distinct will enable this. If two worlds have same expected Dow Jones values, same expected air pressure, same expected sewage volume in London, but different levels of metal in the upper atmosphere, we would want them to be as different from each other as if all the attributes were different—“all the cameras report no intruder, sir. Oh, except for camera 5, which is broken for some reason” is not an acceptable attitude.
We also have to be wary of changes across time. We wouldn’t want the AI to grab the world with nanotech, and afterwards rearrange the world to minimize the penalty function: we want the initial grab to be penalised, whatever happens after. So we want to penalise any deviation at any time, irrespective of what happens at other times. This is an L-infinity norm over time (the penalty over time is the maximum of the penalty at any given moment).
To get back to your example, what happens when the AI can unravel the butterfly effect? Take the Dow Jones, for instance: maybe the AI can accurately predict its value depending on X=1 versus X=0 (“the safe-AI company is going IPO tomorrow, with the results of their first test to be published just before”). In that case, the AI will detect a large deviation in the X=1 versus X=0 world—except if it outputs null, which is thus what it will do.
How to avoid this (we want to have an AI that’s useful, not just safe)? Well, the more chaos we put into the attributes, the more confident we are the AI won’t be able to unravel the butterfly effect (the Dow Jones average might be predictable from a successful AI project, but what about the volatility?). So to effectively coarse-grain, we need:
A lot of different attributes
Which are all heavily laced with chaos for small variations, so unpredictable
But are somewhat predictable for large variations
Will likely catch the disciple AI misbehaving, if it chose to do so
A tall order! But I don’t think it’s that hard to at least get to the point where we feel that the are actually making things safer. My main worry is that we might not find enough desirable attributes to cover the space of possible misbehavings.
Upvoted for a relatively high-quality response, regardless of whether you’re correct.
What, you prefer that to “It’s a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well).”? :-)