Is there somewhere where I can find this stuff written up, and discover exactly what is known to be (provably, or at least that’s the impression I get from what you write) achievable by auditing?
I remain gently skeptical despite your very confident tone because, e.g., we don’t currently know how to make any sort of superintelligent machine, and I at least would be rather surprised by any theorem along the lines of “given any reasonable class of superintelligent agents, there is a reasonably straightforward way to make a superintelligent agent in this class that can be shown to be innocuous by means of auditing that ordinary human beings are capable of doing reliably”.
For the avoidance of doubt, I have no difficulty at all in believing e.g. that there are auditing techniques that will guarantee (or very very nearly guarantee) that a particular agent is performing a particular computational process; I would be only modestly surprised to find that there are techniques that will verify that a particular agent is in some sense optimizing a particular objective function; but the difficulties of keeping a superintelligent AI from doing terrible things are much more complicated and include e.g. tremendous difficulty in working out what it is we really want optimized, and what computational processes we really want carried out.
Perhaps it would be useful to get a bit more concrete. Could you give an example of the sort of thing we might want a superintelligent AI to do for us, that we can’t “obviously” make it do safely without the techniques you have in mind, and explain how those techniques enable us to make it do that thing safely?
Is there somewhere where I can find this stuff written up, and discover exactly what is known to be (provably, or at least that’s the impression I get from what you write) achievable by auditing?
I remain gently skeptical despite your very confident tone because, e.g., we don’t currently know how to make any sort of superintelligent machine, and I at least would be rather surprised by any theorem along the lines of “given any reasonable class of superintelligent agents, there is a reasonably straightforward way to make a superintelligent agent in this class that can be shown to be innocuous by means of auditing that ordinary human beings are capable of doing reliably”.
For the avoidance of doubt, I have no difficulty at all in believing e.g. that there are auditing techniques that will guarantee (or very very nearly guarantee) that a particular agent is performing a particular computational process; I would be only modestly surprised to find that there are techniques that will verify that a particular agent is in some sense optimizing a particular objective function; but the difficulties of keeping a superintelligent AI from doing terrible things are much more complicated and include e.g. tremendous difficulty in working out what it is we really want optimized, and what computational processes we really want carried out.
Perhaps it would be useful to get a bit more concrete. Could you give an example of the sort of thing we might want a superintelligent AI to do for us, that we can’t “obviously” make it do safely without the techniques you have in mind, and explain how those techniques enable us to make it do that thing safely?