I would like to argue that installing values in the first place is also robust, if done the right way.
Fragility is not intrinsic to value.
Value isn’t fragile because value isn’t a process. Only processes can be fragile or robust.
Winning the lottery is fragile, is a fragile process, because it had to be done all in one go. Consider the process of writing down a 12 digit phone number: if you to try to memorise the whole number, and then write it down, you are likely to make a mistake, due to Millers law, the one that says you can only hold five to nine items in short term memory. Writing digits down one at time, as you hear them, is more robust. Being able to ask for corrections, or having errors pointed out to you, is more robust still.
Processes that are incremental and involve error correction are robust, and can handle large volumes of data. The data aren’t the problem: any volume of data can be communicated, so long as there is enough error correction. Trying to preload an AI with the total of human value is the problem, because it is the most fragile way of instilling human value.
MIRI favours the preloading approach because it allows proveable correctness, and provable correctness is an established technique for achieving reliable software in other areas: it’s often used with embedded systems, critical systems, and so on.
But choosing that approach isn’t a nett gain, because it entails fragility, and loss of corrigibility, and because it is less applicable to current real world AI. Current real world AI systems are trained rather than programmed: to be precise, they are programmed to be trainable.
Training is a process that involves error correction. So training implies robustness. It also implies corrigibility, because, corrigibility just is error correction. Furthermore, we know training is capable of instilling at least a good-enough level of ethics into an entity of at least human intelligence, because training instills good-enough ethics into most human children.
However that approach isn’t a nett gain either. Trainable systems lack explicitness, in the sense that their goals are not coded in, but rather a feature that emerges and which are virtually impossible to determine by inspecting source side. Without the ability to determine behaviour from source code, they lack proveable correctness. On the other hand, their likely failure modes are more familiar to us...they are biomorphic, even if not anthropomorphic. They are less likely to present us with an inhuman threat, like the notorious paperclipper.
But surely a proveably safe system is better? Only if proveable really means proveable, if it implies 100% correctness. But a proof is a process, and one that can go wrong. A human can make a mistake. A proof assistant is a piece of software which is not magically immunized from having bugs. The difference between the two approach is not that the one is certain and the other not. One approach requires you to get something right first time, something which is very difficult, and which software engineers try to avoid where possible.
It is now beginning to look as though there never was a value fragility problem, aside from the decision to adopt the one-shot, preprogrammed, strategy. Is that right?
One of the things the value fragility argument was supposed to establish was that a “miss is as good as a mile”. But humans despite not sharing precise values, regularly achieve a typical level of ethical behaviour with an imperfect grasp of each others values. Human value may be a small volume of valuespace, but it is still a fuzzy blob, not a mathematical point. ( And feedback, error correction, is part of how humans muddle along—“would you mind not doing that, it annoys me”—” sorry, I didn’t realise”)
Another concern is that an AI would need to understand the whole of human value order to create better world, a better future, it would need be friendly in the sense of adding value, not just refraining from subtracting value,
“Unfriendliness” is ambiguous: an unfriendly AI may be downright dangerous; or it might have enough grasp of ethics to be safe, but it enough to be able to make the world a much more fun place for humans. Unfriendliness in the second sense is not, strictly speaking a safety issue. One of the disadvantages of the Friendliness approach is that it makes it difficult to discuss the strategy of foregoing fun in order to achieve safety, of building boringly safe AIs.
I would like to argue that installing values in the first place is also robust, if done the right way.
Fragility is not intrinsic to value.
Value isn’t fragile because value isn’t a process. Only processes can be fragile or robust.
Winning the lottery is fragile, is a fragile process, because it had to be done all in one go. Consider the process of writing down a 12 digit phone number: if you to try to memorise the whole number, and then write it down, you are likely to make a mistake, due to Millers law, the one that says you can only hold five to nine items in short term memory. Writing digits down one at time, as you hear them, is more robust. Being able to ask for corrections, or having errors pointed out to you, is more robust still.
Processes that are incremental and involve error correction are robust, and can handle large volumes of data. The data aren’t the problem: any volume of data can be communicated, so long as there is enough error correction. Trying to preload an AI with the total of human value is the problem, because it is the most fragile way of instilling human value.
MIRI favours the preloading approach because it allows proveable correctness, and provable correctness is an established technique for achieving reliable software in other areas: it’s often used with embedded systems, critical systems, and so on.
But choosing that approach isn’t a nett gain, because it entails fragility, and loss of corrigibility, and because it is less applicable to current real world AI. Current real world AI systems are trained rather than programmed: to be precise, they are programmed to be trainable.
Training is a process that involves error correction. So training implies robustness. It also implies corrigibility, because, corrigibility just is error correction. Furthermore, we know training is capable of instilling at least a good-enough level of ethics into an entity of at least human intelligence, because training instills good-enough ethics into most human children.
However that approach isn’t a nett gain either. Trainable systems lack explicitness, in the sense that their goals are not coded in, but rather a feature that emerges and which are virtually impossible to determine by inspecting source side. Without the ability to determine behaviour from source code, they lack proveable correctness. On the other hand, their likely failure modes are more familiar to us...they are biomorphic, even if not anthropomorphic. They are less likely to present us with an inhuman threat, like the notorious paperclipper.
But surely a proveably safe system is better? Only if proveable really means proveable, if it implies 100% correctness. But a proof is a process, and one that can go wrong. A human can make a mistake. A proof assistant is a piece of software which is not magically immunized from having bugs. The difference between the two approach is not that the one is certain and the other not. One approach requires you to get something right first time, something which is very difficult, and which software engineers try to avoid where possible.
It is now beginning to look as though there never was a value fragility problem, aside from the decision to adopt the one-shot, preprogrammed, strategy. Is that right?
One of the things the value fragility argument was supposed to establish was that a “miss is as good as a mile”. But humans despite not sharing precise values, regularly achieve a typical level of ethical behaviour with an imperfect grasp of each others values. Human value may be a small volume of valuespace, but it is still a fuzzy blob, not a mathematical point. ( And feedback, error correction, is part of how humans muddle along—“would you mind not doing that, it annoys me”—” sorry, I didn’t realise”)
Another concern is that an AI would need to understand the whole of human value order to create better world, a better future, it would need be friendly in the sense of adding value, not just refraining from subtracting value,
“Unfriendliness” is ambiguous: an unfriendly AI may be downright dangerous; or it might have enough grasp of ethics to be safe, but it enough to be able to make the world a much more fun place for humans. Unfriendliness in the second sense is not, strictly speaking a safety issue. One of the disadvantages of the Friendliness approach is that it makes it difficult to discuss the strategy of foregoing fun in order to achieve safety, of building boringly safe AIs.