It would be very hard, yes. I never tried to deny that. But I don’t think it’s hard enough to justify not trying to catch it.
Also, you’re only viewing the “output” of the AI, essentially, with that example. If you could model the cognitive processes of the authors of secretly malicious code, then it would be much more obvious that some of their (instrumental) goals didn’t correspond to the ones that you wanted them to be achieving. The only way an AI could deceive us would be to deceive itself, and I’m not confident that an AI could do that.
Since then, I’ve thought more, and gained a lot of confidence on this issue. Firstly, any decision made by the AI to deceive us about its thought processes would logically precede anything that would actually deceive us, so we don’t have to deal with the AI hiding its previous decision to be devious. Secondly, if the AI is divvying its own brain up into certain sections, some of which are filled with false beliefs and some which are filled with true ones, it seems like the AI would render itself impotent on a level proportionate to the extent that it filled itself with false beliefs. Thirdly, I don’t think a mechanism which allowed for total self deception would even be compatible with rationality.
It would be very hard, yes. I never tried to deny that. But I don’t think it’s hard enough to justify not trying to catch it.
Also, you’re only viewing the “output” of the AI, essentially, with that example. If you could model the cognitive processes of the authors of secretly malicious code, then it would be much more obvious that some of their (instrumental) goals didn’t correspond to the ones that you wanted them to be achieving. The only way an AI could deceive us would be to deceive itself, and I’m not confident that an AI could do that.
That’s not the same as “I’m confident that an AI couldn’t do that”, is it?
At the time, it wasn’t the same.
Since then, I’ve thought more, and gained a lot of confidence on this issue. Firstly, any decision made by the AI to deceive us about its thought processes would logically precede anything that would actually deceive us, so we don’t have to deal with the AI hiding its previous decision to be devious. Secondly, if the AI is divvying its own brain up into certain sections, some of which are filled with false beliefs and some which are filled with true ones, it seems like the AI would render itself impotent on a level proportionate to the extent that it filled itself with false beliefs. Thirdly, I don’t think a mechanism which allowed for total self deception would even be compatible with rationality.