it seems to me that proving honesty would be considerably easier. If you can prove that an AI is honest and you’ve got a strong box for it, then you can modify its source code until it says the kind of thing you want in your conversations with it
These notions of honesty and conversation seem kind of anthropormophic. I agree that we want a transparent AI, something that we know what it’s doing and why, but the means by which such transparency could be achieved would probably not be very similar to what a human does when she’s trying to tell you the truth. I would actually guess that AI transparency would be much simpler than human honesty in some relevant sense. When you ask a human, “Well, why did you do this-and-such?” she’ll use introspection to magically come up with a loose tangle of thoughts which sort-of kind-of resemble a high-level explanation which is probably wrong, and then try to tell you about the experience using mere words which you probably won’t understand. Whereas one might imagine asking a cleanly-designed AI the same question, and getting back a detailed printout of what actually happened.
Traceback (most recent call last):
File "si-ai.py", line 5945434, in <module>
if Neuron_0X$a8f3==True: a8prop(0X$a8f3,234):
File "si-ai.py", line 5945989, in a8prop ...
These notions of honesty and conversation seem kind of anthropormophic. I agree that we want a transparent AI, something that we know what it’s doing and why, but the means by which such transparency could be achieved would probably not be very similar to what a human does when she’s trying to tell you the truth. I would actually guess that AI transparency would be much simpler than human honesty in some relevant sense. When you ask a human, “Well, why did you do this-and-such?” she’ll use introspection to magically come up with a loose tangle of thoughts which sort-of kind-of resemble a high-level explanation which is probably wrong, and then try to tell you about the experience using mere words which you probably won’t understand. Whereas one might imagine asking a cleanly-designed AI the same question, and getting back a detailed printout of what actually happened.