But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system?
You fundamentally cannot, so it’s a moot point. There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.
I do not follow your seeming dismissal of this. You acknowledge it, and then… assert it’s not a problem?
An analogy: solving the Halting problem is impossible[3]. It is sometimes useful to handwave a Halting oracle as a component of proofs regardless—but at the end of the day saying ‘solving the Halting problem is easy, just use a Halting oracle’ is not a solution.
Many people have an intuition like “everything is an imperfect proxy; we can never avoid Goodhart”. The point of the mutual information example is that this is basically wrong.
“Many people have an intuition like “everything is an imperfect halting-problem solver; we can never avoid Turing”. The point of the Halting oracle example is that this is basically wrong.”
In particular, it requires calculating the distributions to infinite accuracy, which in turn requires an infinite sample. (Consider if I have two independent perfectly fair coins. I flip each of them 3x and get HHT/HHT. My mutual information is non-zero!)
Except potentially if there’s an event horizon, although even that’s an open question, and in that case it’s a moot point because an AI in an event horizon is indistinguishable from no AI.
There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.
Thanks for bringing this up; it raises to a technical point which didn’t make sense to include in the post but which I was hoping someone would raise in the comments.
The key point: Goodhart problems are about generalization, not approximation.
Suppose I have a proxy u′ for a true utility function u, and u′ is always within ϵ of u (i.e. |u′−u|<ϵ). I maximize u′. Then the true utility u achieved will be within 2ϵ of the maximum achievable utility. Reasoning: in the worst case, u′ is ϵ lower than u at the u-maximizing point, and ϵ higher than u at the u′-maximizing point.
Point is: if a proxy is close to the true utility function everywhere, then we will indeed achieve close-to-maximal utility upon maximizing the proxy. Goodhart problems require the proxy to not even be approximately close, in at least some places.
When we look at real-world Goodhart problems, they indeed involve situations where some approximation only works well within some region, and ceases to even be a good approximation once we move well outside that region. That’s a generalization problem, not an approximation problem.
So approximations are fine, so long as they generalize well.
This is an interesting observation; I don’t see how it addresses my point.
There is no exact solution to mutual information from two finite samples. There isno ϵ-approximation of mutual information from two finite samples, either.
=====
On the topic of said observation: beware that ϵ-approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin’s Constant[1].)
In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable of unbounded computation, and even approximations thereof.
Unfortunately, ‘value function of a powerful AI’ tends to fall into that category[2].
The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.
Observing the input and output of a blackbox X, and checking a property thereof.
Whitebox knowledge of X, and checking a property thereof.
In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that’s about the best we can say[2].
And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3], even though checking if two specific Turing-complete whiteboxes may be decidable.
Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X’ which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.
You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So “zero” is best understood as “zero within some tolerance”. So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently.
This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable tolerances, even if we do not yet know how to do such a thing in the face of the immense optimization pressure which superhuman AGI would bring to bear on a problem.
We need plans to have achievable tolerances. For example, we need to assume a realistic amount of hardware failure. We can’t treat the hardware as blackboxes; we know how it operates, and we have to make use of that knowledge. But we can’t pretend perfect mathematical knowledge of it, either; we have error tolerances.
So your blackbox/whitebox dichotomy doesn’t fit the situation very well.
But do you really buy the whole analogy with mutual information, IE buy the claim that we can judge the viability of escaping goodhart from this one example, and only object that the judgement with respect to this example was incorrect?
Perhaps we should really look at a range of examples, not just one? And judge John’s point as reasonable if and only if we can find some cases where effectively perfect proxies were found?
Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed. So John’s plan sounds doomed to failure, because it relies on finding an actually-perfect proxy, when all realistic proxies are imprecise at least in their physical tolerances.
In which case, I would reply that the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course. So there is hope that we will not end up in a situation where every tiny flaw is exploited. What we are looking for is plans which robustly get us to that point.
Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed.
My objection is actually mostly to the example itself.
As you mention:
the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course.
Compare with the example:
Suppose we’re designing some secure electronic equipment, and we’re concerned about the system leaking information to adversaries via a radio side-channel.
[...]
But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.
This is analogous to the case of… trying to contain a malign AI which is already not on our side.
Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by “guessing true names”. I think the approach makes sense, but my argument for why this is the case does differ from John’s arguments here.
The fact that the mutual information cannot be zero is a good and interesting point. But, as I understand it, this is not fundamentally a barrier to it being a good “true name”. Its the right target, the impossibility of hitting it doesn’t change that.
You fundamentally cannot, so it’s a moot point. There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.
I do not follow your seeming dismissal of this. You acknowledge it, and then… assert it’s not a problem?
An analogy: solving the Halting problem is impossible[3]. It is sometimes useful to handwave a Halting oracle as a component of proofs regardless—but at the end of the day saying ‘solving the Halting problem is easy, just use a Halting oracle’ is not a solution.
“Many people have an intuition like “everything is an imperfect halting-problem solver; we can never avoid Turing”. The point of the Halting oracle example is that this is basically wrong.”
Hopefully this illustrates my point.
In particular, it requires calculating the distributions to infinite accuracy, which in turn requires an infinite sample. (Consider if I have two independent perfectly fair coins. I flip each of them 3x and get HHT/HHT. My mutual information is non-zero!)
For a sufficient example: gravity causes any[4] two things in the universe[5] to correlate[6].
At least assuming the Church-Turing hypothesis is correct.
Except potentially if there’s an event horizon, although even that’s an open question, and in that case it’s a moot point because an AI in an event horizon is indistinguishable from no AI.
Strictly speaking, within each others lightcone.
And as soon as you have anything causing a correlation, the probability that other factors exactly cancel said correlation is zero.
Thanks for bringing this up; it raises to a technical point which didn’t make sense to include in the post but which I was hoping someone would raise in the comments.
The key point: Goodhart problems are about generalization, not approximation.
Suppose I have a proxy u′ for a true utility function u, and u′ is always within ϵ of u (i.e. |u′−u|<ϵ). I maximize u′. Then the true utility u achieved will be within 2ϵ of the maximum achievable utility. Reasoning: in the worst case, u′ is ϵ lower than u at the u-maximizing point, and ϵ higher than u at the u′-maximizing point.
Point is: if a proxy is close to the true utility function everywhere, then we will indeed achieve close-to-maximal utility upon maximizing the proxy. Goodhart problems require the proxy to not even be approximately close, in at least some places.
When we look at real-world Goodhart problems, they indeed involve situations where some approximation only works well within some region, and ceases to even be a good approximation once we move well outside that region. That’s a generalization problem, not an approximation problem.
So approximations are fine, so long as they generalize well.
This is an interesting observation; I don’t see how it addresses my point.
There is no exact solution to mutual information from two finite samples. There is no ϵ-approximation of mutual information from two finite samples, either.
=====
On the topic of said observation: beware that ϵ-approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin’s Constant[1].)
In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable of unbounded computation, and even approximations thereof.
Unfortunately, ‘value function of a powerful AI’ tends to fall into that category[2].
Which isn’t “a” constant, but that’s another matter.
Well, as closely as anything in the physical world does, anyway.
The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.
Let us make a distinction here between two cases:
Observing the input and output of a blackbox X, and checking a property thereof.
Whitebox knowledge of X, and checking a property thereof.
In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that’s about the best we can say[2].
And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3], even though checking if two specific Turing-complete whiteboxes may be decidable.
It is not exactly the same way, due to the above.
Namely, ‘the laws of physics’
(And worse, often doesn’t exactly match in the observations thus far, or results in contradictions.)
Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X’ which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.
You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So “zero” is best understood as “zero within some tolerance”. So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently.
This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable tolerances, even if we do not yet know how to do such a thing in the face of the immense optimization pressure which superhuman AGI would bring to bear on a problem.
We need plans to have achievable tolerances. For example, we need to assume a realistic amount of hardware failure. We can’t treat the hardware as blackboxes; we know how it operates, and we have to make use of that knowledge. But we can’t pretend perfect mathematical knowledge of it, either; we have error tolerances.
So your blackbox/whitebox dichotomy doesn’t fit the situation very well.
But do you really buy the whole analogy with mutual information, IE buy the claim that we can judge the viability of escaping goodhart from this one example, and only object that the judgement with respect to this example was incorrect?
Perhaps we should really look at a range of examples, not just one? And judge John’s point as reasonable if and only if we can find some cases where effectively perfect proxies were found?
Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed. So John’s plan sounds doomed to failure, because it relies on finding an actually-perfect proxy, when all realistic proxies are imprecise at least in their physical tolerances.
In which case, I would reply that the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course. So there is hope that we will not end up in a situation where every tiny flaw is exploited. What we are looking for is plans which robustly get us to that point.
My objection is actually mostly to the example itself.
As you mention:
Compare with the example:
This is analogous to the case of… trying to contain a malign AI which is already not on our side.
Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by “guessing true names”. I think the approach makes sense, but my argument for why this is the case does differ from John’s arguments here.
The fact that the mutual information cannot be zero is a good and interesting point. But, as I understand it, this is not fundamentally a barrier to it being a good “true name”. Its the right target, the impossibility of hitting it doesn’t change that.
This is the part I was disagreeing with, to be clear.