Q: Is it important to figure out how to make AI provably friendly to us and our values (non-dangerous), before attempting to solve artificial general intelligence?
Stan Franklin: Proofs occur only in mathematics.
This seems like a good point, and something that’s been kind of bugging me for a while. It seems like “proving” an AI design will be friendly is like proving a system of government won’t lead to the economy going bad. I don’t understand how it’s supposed to be possible.
I can understand how you can prove a hello world program will print “hello world”, but friendly AI designs are based around heavy interaction WITH the messy outside world, not just saying hello to it, but learning all but its most primitive values from it.
How can we be developing 99% of our utility function by stealing it from the outside world, where we can’t even “prove” that the shop won’t be out of shampoo, and yet simultaneously have a “proof” that this will all work out? Even if we’re not proving “friendliness” per se, but just that the AI has “consistent goals under self-modification”, consistent with WHAT? If you’re not programming in an opinion about abortion and gun control to start with, how can any value it comes to regarding that be “consistent” OR “inconsistent”?
Proving “friendliness” may well be impossible to define, but there are narrower desirable properties that can be proven. You could prove that it optimizes correctly on special cases that are simpler than the world as a whole; you can prove that it doesn’t have certain classes of security holes; you can prove that it’s resilient against single-bit errors. With a more detailed understanding of metaethics, we might prove that it aggregates values in a way that’s stable in spite of outliers, and that its debug output about the values it’s discovered is accurate. Basically, we should prove as much as we can, even if there are some parts that aren’t amenable to formal proof.
I’ve been under the impression that “Friendliness proofs” aren’t about proving Friendliness as such. Rather, they’re proofs that whatever is set as the AI’s goal function will always be preserved by the AI as its goal function, no matter how much self-improvement it goes through.
It seems like “proving” an AI design will be friendly is like proving a system of government won’t lead to the economy going bad.
That doesn’t sound to be impossible. Consider that in the case of a seed AI, the “government” only has to deal with one perfectly rational game theoretic textbook agent. The only reason that economists fail to predict how certain policies will affect the economy is that their models often have to deal with a lot of unknown, or unpredictable factors. In the case of an AI, the policy is applied to the model itself, which is a well-defined mathematical entity.
This seems like a good point, and something that’s been kind of bugging me for a while. It seems like “proving” an AI design will be friendly is like proving a system of government won’t lead to the economy going bad. I don’t understand how it’s supposed to be possible.
I can understand how you can prove a hello world program will print “hello world”, but friendly AI designs are based around heavy interaction WITH the messy outside world, not just saying hello to it, but learning all but its most primitive values from it.
How can we be developing 99% of our utility function by stealing it from the outside world, where we can’t even “prove” that the shop won’t be out of shampoo, and yet simultaneously have a “proof” that this will all work out? Even if we’re not proving “friendliness” per se, but just that the AI has “consistent goals under self-modification”, consistent with WHAT? If you’re not programming in an opinion about abortion and gun control to start with, how can any value it comes to regarding that be “consistent” OR “inconsistent”?
Proving “friendliness” may well be impossible to define, but there are narrower desirable properties that can be proven. You could prove that it optimizes correctly on special cases that are simpler than the world as a whole; you can prove that it doesn’t have certain classes of security holes; you can prove that it’s resilient against single-bit errors. With a more detailed understanding of metaethics, we might prove that it aggregates values in a way that’s stable in spite of outliers, and that its debug output about the values it’s discovered is accurate. Basically, we should prove as much as we can, even if there are some parts that aren’t amenable to formal proof.
I’ve been under the impression that “Friendliness proofs” aren’t about proving Friendliness as such. Rather, they’re proofs that whatever is set as the AI’s goal function will always be preserved by the AI as its goal function, no matter how much self-improvement it goes through.
That doesn’t sound to be impossible. Consider that in the case of a seed AI, the “government” only has to deal with one perfectly rational game theoretic textbook agent. The only reason that economists fail to predict how certain policies will affect the economy is that their models often have to deal with a lot of unknown, or unpredictable factors. In the case of an AI, the policy is applied to the model itself, which is a well-defined mathematical entity.