Since then, I’ve thought more, and gained a lot of confidence on this issue. Firstly, any decision made by the AI to deceive us about its thought processes would logically precede anything that would actually deceive us, so we don’t have to deal with the AI hiding its previous decision to be devious. Secondly, if the AI is divvying its own brain up into certain sections, some of which are filled with false beliefs and some which are filled with true ones, it seems like the AI would render itself impotent on a level proportionate to the extent that it filled itself with false beliefs. Thirdly, I don’t think a mechanism which allowed for total self deception would even be compatible with rationality.
That’s not the same as “I’m confident that an AI couldn’t do that”, is it?
At the time, it wasn’t the same.
Since then, I’ve thought more, and gained a lot of confidence on this issue. Firstly, any decision made by the AI to deceive us about its thought processes would logically precede anything that would actually deceive us, so we don’t have to deal with the AI hiding its previous decision to be devious. Secondly, if the AI is divvying its own brain up into certain sections, some of which are filled with false beliefs and some which are filled with true ones, it seems like the AI would render itself impotent on a level proportionate to the extent that it filled itself with false beliefs. Thirdly, I don’t think a mechanism which allowed for total self deception would even be compatible with rationality.