That’s a sensible sounding intuition, but it’s wrong. So what if the AI is smarter than us? We still have full access to its inner workings, and have the ability to design its internal structures in such a way as to remain transparently sensible to auditors. I turns out you can audit just about anything to whatever required level of detail with probabilistic assurances. You can use tools without general intelligence to perform the auditing task free of influence from the boxed AI, and there are cool tricks from cryptography that make it impossible for the AI to fake computational traces without doing an energetically detectable multiple of the purported work.
To better prime your intuition, think about it this way: to make a recommendation of a course of action which achieves X requires searching the space of actions for one which satisfies condition X. This has some inherent difficulty related to the size of the answer space related to the size of the search space (as well as the structure of the two spaces, but let’s simplify for now). Assuming random distributions, you can say that an a course of action requires 40 bits of identifying information, meaning that you need to perform 2^40 operations to find an instance of a solution.
Now let’s say that the answer space which contains innocuous looking, but actually evil solutions requires 60 bits. In other words, for every evil solution there is 2^20 good solutions. Then an AI which wants to deceive you into doing an action which purports to meet good requirement X, but actually meets the AI’s evil hidden requirement Y requires 2^60 operations.
You can tell whether a machine did 2^40 or 2^60 operations, and if it does the latter it better have a damn good explanation for why. (And faking that explanation would require further computation.. I hope you see where this is going.)
Is there somewhere where I can find this stuff written up, and discover exactly what is known to be (provably, or at least that’s the impression I get from what you write) achievable by auditing?
I remain gently skeptical despite your very confident tone because, e.g., we don’t currently know how to make any sort of superintelligent machine, and I at least would be rather surprised by any theorem along the lines of “given any reasonable class of superintelligent agents, there is a reasonably straightforward way to make a superintelligent agent in this class that can be shown to be innocuous by means of auditing that ordinary human beings are capable of doing reliably”.
For the avoidance of doubt, I have no difficulty at all in believing e.g. that there are auditing techniques that will guarantee (or very very nearly guarantee) that a particular agent is performing a particular computational process; I would be only modestly surprised to find that there are techniques that will verify that a particular agent is in some sense optimizing a particular objective function; but the difficulties of keeping a superintelligent AI from doing terrible things are much more complicated and include e.g. tremendous difficulty in working out what it is we really want optimized, and what computational processes we really want carried out.
Perhaps it would be useful to get a bit more concrete. Could you give an example of the sort of thing we might want a superintelligent AI to do for us, that we can’t “obviously” make it do safely without the techniques you have in mind, and explain how those techniques enable us to make it do that thing safely?
That’s a sensible sounding intuition, but it’s wrong. So what if the AI is smarter than us? We still have full access to its inner workings, and have the ability to design its internal structures in such a way as to remain transparently sensible to auditors. I turns out you can audit just about anything to whatever required level of detail with probabilistic assurances. You can use tools without general intelligence to perform the auditing task free of influence from the boxed AI, and there are cool tricks from cryptography that make it impossible for the AI to fake computational traces without doing an energetically detectable multiple of the purported work.
To better prime your intuition, think about it this way: to make a recommendation of a course of action which achieves X requires searching the space of actions for one which satisfies condition X. This has some inherent difficulty related to the size of the answer space related to the size of the search space (as well as the structure of the two spaces, but let’s simplify for now). Assuming random distributions, you can say that an a course of action requires 40 bits of identifying information, meaning that you need to perform 2^40 operations to find an instance of a solution.
Now let’s say that the answer space which contains innocuous looking, but actually evil solutions requires 60 bits. In other words, for every evil solution there is 2^20 good solutions. Then an AI which wants to deceive you into doing an action which purports to meet good requirement X, but actually meets the AI’s evil hidden requirement Y requires 2^60 operations.
You can tell whether a machine did 2^40 or 2^60 operations, and if it does the latter it better have a damn good explanation for why. (And faking that explanation would require further computation.. I hope you see where this is going.)
Is there somewhere where I can find this stuff written up, and discover exactly what is known to be (provably, or at least that’s the impression I get from what you write) achievable by auditing?
I remain gently skeptical despite your very confident tone because, e.g., we don’t currently know how to make any sort of superintelligent machine, and I at least would be rather surprised by any theorem along the lines of “given any reasonable class of superintelligent agents, there is a reasonably straightforward way to make a superintelligent agent in this class that can be shown to be innocuous by means of auditing that ordinary human beings are capable of doing reliably”.
For the avoidance of doubt, I have no difficulty at all in believing e.g. that there are auditing techniques that will guarantee (or very very nearly guarantee) that a particular agent is performing a particular computational process; I would be only modestly surprised to find that there are techniques that will verify that a particular agent is in some sense optimizing a particular objective function; but the difficulties of keeping a superintelligent AI from doing terrible things are much more complicated and include e.g. tremendous difficulty in working out what it is we really want optimized, and what computational processes we really want carried out.
Perhaps it would be useful to get a bit more concrete. Could you give an example of the sort of thing we might want a superintelligent AI to do for us, that we can’t “obviously” make it do safely without the techniques you have in mind, and explain how those techniques enable us to make it do that thing safely?