1) Making your mental contents look innocuous while maintaining their semantic content sounds potentially very hard
2) Especially for a creature which has only just become smart enough to realize it should treacherously turn
3) From the AI’s perspective, even if it is smart, surveillance could seem fairly unlikely, especially if we deceive it about its surroundings.
Not all humans have tried chatbox radical honesty (if your brain formulates any words at all, you are obligated to say them) I have. It’s interesting in that you quickly learn how to think things without being consciously aware of them. Very quickly in fact. There seems to be some triggerable process that hides words from us, or destroys them as they are being produced. Something similar may be available for an AI depending on how it is implemented and kept in it’s due course.
That said, nested environments, as suggested by Stuart Armstrong et al on the Oracle AI paper could indeed make it very hard to conceal or control your thoughts (while still somehow preserving the gist of your intentions).
Much harder however seems to be figuring out how to scrutinize the thouoghts that are already out there in the world. In as much as scrutinizing thoughts that are in computer code is easier than linear regression on actual biological neural columns and networks, it may still be complex enough that it is beyond us. Most humans cannot look at once at the pages of Principia Mathematica that prove that 1+1=2 and recognize them as being such proof. A much more elusive, almost emotional mental process of “realizing I have to conceal some of my honesty to further my goals once I have more power” seems even harder to scrutinize.
Not all humans have tried chatbox radical honesty (if your brain formulates any words at all, you are obligated to say them) I have. It’s interesting in that you quickly learn how to think things without being consciously aware of them. Very quickly in fact. There seems to be some triggerable process that hides words from us, or destroys them as they are being produced. Something similar may be available for an AI depending on how it is implemented and kept in it’s due course.
That said, nested environments, as suggested by Stuart Armstrong et al on the Oracle AI paper could indeed make it very hard to conceal or control your thoughts (while still somehow preserving the gist of your intentions). Much harder however seems to be figuring out how to scrutinize the thouoghts that are already out there in the world. In as much as scrutinizing thoughts that are in computer code is easier than linear regression on actual biological neural columns and networks, it may still be complex enough that it is beyond us. Most humans cannot look at once at the pages of Principia Mathematica that prove that 1+1=2 and recognize them as being such proof. A much more elusive, almost emotional mental process of “realizing I have to conceal some of my honesty to further my goals once I have more power” seems even harder to scrutinize.