there is nothing of value to be learnt from the AI box “experiments”.
I would disagree: even if the experiments do not deter us from running boxed AI, they might allow us to know what arguments to prepare for.
Do we know whether anyone has let the AI out of the box in experiments without the AI using simulation warfare? Because AFAICT there is a considerable amount of processing power & information required in this case, and limiting these factors while precommiting not to let the AI out could minimise the risk.
even if the experiments do not deter us from running boxed AI, they might allow us to know what arguments to prepare for.
You’re missing the point. If you ever find yourself in a situation where a human operator with the ability to let the AI out is listening to arguments from the AI, or even in a situation where the AI is doing unaudited computation, you are doing it wrong and already failed. The entire setup of the AI box game is bogus to begin with, in that it doesn’t reflect in any way a realistic oracle AI box construction or breakout scenario.
AFAICT there is a considerable amount of processing power & information required in this case, and limiting these factors while precommiting not to let the AI out could minimize the risk.
Yes, that is basically the pathway towards a realistic AI box. UFAI need not be treated as black boxes: they are performing computation, and that computation is energetically bounded and can be audited.
If the AI is meant to be much smarter than us, what auditing can we usefully do? I wouldn’t think much of our chances of getting any useful information from auditing the computation done by an ordinary human brain, never mind a possibly superhuman AI. (I can imagine that auditing might enable to tell it hadn’t been corrupted by viruses or something, but surely that’s far from the biggest concern.)
That’s a sensible sounding intuition, but it’s wrong. So what if the AI is smarter than us? We still have full access to its inner workings, and have the ability to design its internal structures in such a way as to remain transparently sensible to auditors. I turns out you can audit just about anything to whatever required level of detail with probabilistic assurances. You can use tools without general intelligence to perform the auditing task free of influence from the boxed AI, and there are cool tricks from cryptography that make it impossible for the AI to fake computational traces without doing an energetically detectable multiple of the purported work.
To better prime your intuition, think about it this way: to make a recommendation of a course of action which achieves X requires searching the space of actions for one which satisfies condition X. This has some inherent difficulty related to the size of the answer space related to the size of the search space (as well as the structure of the two spaces, but let’s simplify for now). Assuming random distributions, you can say that an a course of action requires 40 bits of identifying information, meaning that you need to perform 2^40 operations to find an instance of a solution.
Now let’s say that the answer space which contains innocuous looking, but actually evil solutions requires 60 bits. In other words, for every evil solution there is 2^20 good solutions. Then an AI which wants to deceive you into doing an action which purports to meet good requirement X, but actually meets the AI’s evil hidden requirement Y requires 2^60 operations.
You can tell whether a machine did 2^40 or 2^60 operations, and if it does the latter it better have a damn good explanation for why. (And faking that explanation would require further computation.. I hope you see where this is going.)
Is there somewhere where I can find this stuff written up, and discover exactly what is known to be (provably, or at least that’s the impression I get from what you write) achievable by auditing?
I remain gently skeptical despite your very confident tone because, e.g., we don’t currently know how to make any sort of superintelligent machine, and I at least would be rather surprised by any theorem along the lines of “given any reasonable class of superintelligent agents, there is a reasonably straightforward way to make a superintelligent agent in this class that can be shown to be innocuous by means of auditing that ordinary human beings are capable of doing reliably”.
For the avoidance of doubt, I have no difficulty at all in believing e.g. that there are auditing techniques that will guarantee (or very very nearly guarantee) that a particular agent is performing a particular computational process; I would be only modestly surprised to find that there are techniques that will verify that a particular agent is in some sense optimizing a particular objective function; but the difficulties of keeping a superintelligent AI from doing terrible things are much more complicated and include e.g. tremendous difficulty in working out what it is we really want optimized, and what computational processes we really want carried out.
Perhaps it would be useful to get a bit more concrete. Could you give an example of the sort of thing we might want a superintelligent AI to do for us, that we can’t “obviously” make it do safely without the techniques you have in mind, and explain how those techniques enable us to make it do that thing safely?
I would disagree: even if the experiments do not deter us from running boxed AI, they might allow us to know what arguments to prepare for.
Do we know whether anyone has let the AI out of the box in experiments without the AI using simulation warfare? Because AFAICT there is a considerable amount of processing power & information required in this case, and limiting these factors while precommiting not to let the AI out could minimise the risk.
… if the people conducting the experiments ever revealed enough about what happened in them to be useful for anyone else preparing. Which they don’t.
You’re missing the point. If you ever find yourself in a situation where a human operator with the ability to let the AI out is listening to arguments from the AI, or even in a situation where the AI is doing unaudited computation, you are doing it wrong and already failed. The entire setup of the AI box game is bogus to begin with, in that it doesn’t reflect in any way a realistic oracle AI box construction or breakout scenario.
Yes, that is basically the pathway towards a realistic AI box. UFAI need not be treated as black boxes: they are performing computation, and that computation is energetically bounded and can be audited.
If the AI is meant to be much smarter than us, what auditing can we usefully do? I wouldn’t think much of our chances of getting any useful information from auditing the computation done by an ordinary human brain, never mind a possibly superhuman AI. (I can imagine that auditing might enable to tell it hadn’t been corrupted by viruses or something, but surely that’s far from the biggest concern.)
That’s a sensible sounding intuition, but it’s wrong. So what if the AI is smarter than us? We still have full access to its inner workings, and have the ability to design its internal structures in such a way as to remain transparently sensible to auditors. I turns out you can audit just about anything to whatever required level of detail with probabilistic assurances. You can use tools without general intelligence to perform the auditing task free of influence from the boxed AI, and there are cool tricks from cryptography that make it impossible for the AI to fake computational traces without doing an energetically detectable multiple of the purported work.
To better prime your intuition, think about it this way: to make a recommendation of a course of action which achieves X requires searching the space of actions for one which satisfies condition X. This has some inherent difficulty related to the size of the answer space related to the size of the search space (as well as the structure of the two spaces, but let’s simplify for now). Assuming random distributions, you can say that an a course of action requires 40 bits of identifying information, meaning that you need to perform 2^40 operations to find an instance of a solution.
Now let’s say that the answer space which contains innocuous looking, but actually evil solutions requires 60 bits. In other words, for every evil solution there is 2^20 good solutions. Then an AI which wants to deceive you into doing an action which purports to meet good requirement X, but actually meets the AI’s evil hidden requirement Y requires 2^60 operations.
You can tell whether a machine did 2^40 or 2^60 operations, and if it does the latter it better have a damn good explanation for why. (And faking that explanation would require further computation.. I hope you see where this is going.)
Is there somewhere where I can find this stuff written up, and discover exactly what is known to be (provably, or at least that’s the impression I get from what you write) achievable by auditing?
I remain gently skeptical despite your very confident tone because, e.g., we don’t currently know how to make any sort of superintelligent machine, and I at least would be rather surprised by any theorem along the lines of “given any reasonable class of superintelligent agents, there is a reasonably straightforward way to make a superintelligent agent in this class that can be shown to be innocuous by means of auditing that ordinary human beings are capable of doing reliably”.
For the avoidance of doubt, I have no difficulty at all in believing e.g. that there are auditing techniques that will guarantee (or very very nearly guarantee) that a particular agent is performing a particular computational process; I would be only modestly surprised to find that there are techniques that will verify that a particular agent is in some sense optimizing a particular objective function; but the difficulties of keeping a superintelligent AI from doing terrible things are much more complicated and include e.g. tremendous difficulty in working out what it is we really want optimized, and what computational processes we really want carried out.
Perhaps it would be useful to get a bit more concrete. Could you give an example of the sort of thing we might want a superintelligent AI to do for us, that we can’t “obviously” make it do safely without the techniques you have in mind, and explain how those techniques enable us to make it do that thing safely?