The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.
Observing the input and output of a blackbox X, and checking a property thereof.
Whitebox knowledge of X, and checking a property thereof.
In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that’s about the best we can say[2].
And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3], even though checking if two specific Turing-complete whiteboxes may be decidable.
Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X’ which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.
You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So “zero” is best understood as “zero within some tolerance”. So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently.
This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable tolerances, even if we do not yet know how to do such a thing in the face of the immense optimization pressure which superhuman AGI would bring to bear on a problem.
We need plans to have achievable tolerances. For example, we need to assume a realistic amount of hardware failure. We can’t treat the hardware as blackboxes; we know how it operates, and we have to make use of that knowledge. But we can’t pretend perfect mathematical knowledge of it, either; we have error tolerances.
So your blackbox/whitebox dichotomy doesn’t fit the situation very well.
But do you really buy the whole analogy with mutual information, IE buy the claim that we can judge the viability of escaping goodhart from this one example, and only object that the judgement with respect to this example was incorrect?
Perhaps we should really look at a range of examples, not just one? And judge John’s point as reasonable if and only if we can find some cases where effectively perfect proxies were found?
Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed. So John’s plan sounds doomed to failure, because it relies on finding an actually-perfect proxy, when all realistic proxies are imprecise at least in their physical tolerances.
In which case, I would reply that the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course. So there is hope that we will not end up in a situation where every tiny flaw is exploited. What we are looking for is plans which robustly get us to that point.
Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed.
My objection is actually mostly to the example itself.
As you mention:
the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course.
Compare with the example:
Suppose we’re designing some secure electronic equipment, and we’re concerned about the system leaking information to adversaries via a radio side-channel.
[...]
But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.
This is analogous to the case of… trying to contain a malign AI which is already not on our side.
Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by “guessing true names”. I think the approach makes sense, but my argument for why this is the case does differ from John’s arguments here.
I think TLW’s criticism is important, and I don’t think your responses are sufficient. I also think the original example is confusing; I’ve met several people who, after reading OP, seemed to me confused about how engineers could use the concept of mutual information.
Here is my attempt to expand your argument.
We’re trying to design some secure electronic equipment. We want the internal state and some of the outputs to be secret. Maybe we want all of the outputs to be secret, but we’ve given up on that (for example, radio shielding might not be practical or reliable enough). When we’re trying to design things so that the internal state and outputs are secret, there are a couple of sources of failure.
One source of failure is failing to model the interactions between the components of our systems. Maybe there is an output we don’t know about (like the vibrations the electronics make while operating), or maybe there is an interaction we’re not aware of (like magnetic coupling between two components we’re treating as independent).
Another source of failure is that we failed to consider all the ways that an adversary could exploit the interactions we do know about. In your example, we fail to consider how an adversary could exploit higher-order correlations between emitted radio waves and the state of the electronic internals.
A true name, in principle, allows us to avoid the second kind of failure. In high-dimensional state spaces, we might need to get kind of clever to prove the lack of mutual information. But it’s a fairly delimited analytic problem, and we at least know what a good answer would look like.
The true name could also guide our investigations into our system, to help us avoid the first kind of failure. “Huh, we just made the adder have a more complicated behaviour as an optimisation. Could the unnevenness of that optimisation over the input distribution leak information about the adder’s inputs to another part of the system?”
Now, reader, you might worry that the chosen example of a True Name leaves an implementation gap wide enough for a human adversary to drive an exploit through. And I think that’s a pretty good complaint. The best defence I can muster is that it guides and organises the defender’s thinking. You get to do proofs-given-assumptions, and you get more clarity about how to think if your assumptions are wrong.
To the extent that the idea is that True Names are part of a strategy to come up with approaches that are unbounded-optimisation-proof, I think that defence doesn’t work and the strategy is kind of sunk.
On the other hand, here is an argument that I can plause. In the end, we’ve got to make some argument that when we flick some switch or continue down some road, things will be OK. And there’s a big messy space of considerations to navigate to that end. True Names are necessary to have any hope of compressing the domain enough that you can make arguments that stand up.
I think that’s basically right, and good job explaining it clearly and compactly.
I would also highlight that it’s not just about adversaries. One the main powers of proof-given-assumptions is that it allows to rule out large classes of unknown unknowns in one go. And, insofar as the things-proven-given-assumptions turn out to be false, it allows to detect previously-unknown unknowns.
The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.
Let us make a distinction here between two cases:
Observing the input and output of a blackbox X, and checking a property thereof.
Whitebox knowledge of X, and checking a property thereof.
In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that’s about the best we can say[2].
And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3], even though checking if two specific Turing-complete whiteboxes may be decidable.
It is not exactly the same way, due to the above.
Namely, ‘the laws of physics’
(And worse, often doesn’t exactly match in the observations thus far, or results in contradictions.)
Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X’ which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.
You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So “zero” is best understood as “zero within some tolerance”. So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently.
This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable tolerances, even if we do not yet know how to do such a thing in the face of the immense optimization pressure which superhuman AGI would bring to bear on a problem.
We need plans to have achievable tolerances. For example, we need to assume a realistic amount of hardware failure. We can’t treat the hardware as blackboxes; we know how it operates, and we have to make use of that knowledge. But we can’t pretend perfect mathematical knowledge of it, either; we have error tolerances.
So your blackbox/whitebox dichotomy doesn’t fit the situation very well.
But do you really buy the whole analogy with mutual information, IE buy the claim that we can judge the viability of escaping goodhart from this one example, and only object that the judgement with respect to this example was incorrect?
Perhaps we should really look at a range of examples, not just one? And judge John’s point as reasonable if and only if we can find some cases where effectively perfect proxies were found?
Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed. So John’s plan sounds doomed to failure, because it relies on finding an actually-perfect proxy, when all realistic proxies are imprecise at least in their physical tolerances.
In which case, I would reply that the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course. So there is hope that we will not end up in a situation where every tiny flaw is exploited. What we are looking for is plans which robustly get us to that point.
My objection is actually mostly to the example itself.
As you mention:
Compare with the example:
This is analogous to the case of… trying to contain a malign AI which is already not on our side.
Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by “guessing true names”. I think the approach makes sense, but my argument for why this is the case does differ from John’s arguments here.
I think TLW’s criticism is important, and I don’t think your responses are sufficient. I also think the original example is confusing; I’ve met several people who, after reading OP, seemed to me confused about how engineers could use the concept of mutual information.
Here is my attempt to expand your argument.
We’re trying to design some secure electronic equipment. We want the internal state and some of the outputs to be secret. Maybe we want all of the outputs to be secret, but we’ve given up on that (for example, radio shielding might not be practical or reliable enough). When we’re trying to design things so that the internal state and outputs are secret, there are a couple of sources of failure.
One source of failure is failing to model the interactions between the components of our systems. Maybe there is an output we don’t know about (like the vibrations the electronics make while operating), or maybe there is an interaction we’re not aware of (like magnetic coupling between two components we’re treating as independent).
Another source of failure is that we failed to consider all the ways that an adversary could exploit the interactions we do know about. In your example, we fail to consider how an adversary could exploit higher-order correlations between emitted radio waves and the state of the electronic internals.
A true name, in principle, allows us to avoid the second kind of failure. In high-dimensional state spaces, we might need to get kind of clever to prove the lack of mutual information. But it’s a fairly delimited analytic problem, and we at least know what a good answer would look like.
The true name could also guide our investigations into our system, to help us avoid the first kind of failure. “Huh, we just made the adder have a more complicated behaviour as an optimisation. Could the unnevenness of that optimisation over the input distribution leak information about the adder’s inputs to another part of the system?”
Now, reader, you might worry that the chosen example of a True Name leaves an implementation gap wide enough for a human adversary to drive an exploit through. And I think that’s a pretty good complaint. The best defence I can muster is that it guides and organises the defender’s thinking. You get to do proofs-given-assumptions, and you get more clarity about how to think if your assumptions are wrong.
To the extent that the idea is that True Names are part of a strategy to come up with approaches that are unbounded-optimisation-proof, I think that defence doesn’t work and the strategy is kind of sunk.
On the other hand, here is an argument that I can plause. In the end, we’ve got to make some argument that when we flick some switch or continue down some road, things will be OK. And there’s a big messy space of considerations to navigate to that end. True Names are necessary to have any hope of compressing the domain enough that you can make arguments that stand up.
I think that’s basically right, and good job explaining it clearly and compactly.
I would also highlight that it’s not just about adversaries. One the main powers of proof-given-assumptions is that it allows to rule out large classes of unknown unknowns in one go. And, insofar as the things-proven-given-assumptions turn out to be false, it allows to detect previously-unknown unknowns.