If human simulation is much simpler and faster than direct translation, then an obfuscated human-simulator would also be simpler and faster than a direct translator.
That is not obvious at all, indistinguishable obfuscation is a problem that seems to inherently require enormous amounts of computation. From the Wikipedia page you linked:
There have been attempts to implement and benchmark IO candidates. For example, as of 2017, an obfuscation of the function x1∧x2∧…∧x32 at a security level of 80 bits took 23.5 minutes to produce and measured 11.6 GB, with an evaluation time of 77 ms. Additionally, an obfuscation of the Advanced Encryption Standard hash function at a security level of 128 bits would measure 18 PB and have an evaluation time of about 272 years.
And the line below that links to a paper about a method that is exponential on the size of the circuit to obfuscate:
To give an idea about the practicality of this construction, consider a 2-bit multiplication circuit. It requires 4 inputs and between 1 and 8 AND gates for each of its 4 output bits. An obfuscation would be generated in about 10**27 years on a 2,6 GHz CPU and would require 20 Zetta Bytes of memory for m = 1 and p = 1049. Executing this circuit on the same CPU would take 1.3 × 10**8 years.
I was unable to find anything better and doubt this could be improved to the point where it would be hard to distinguish from a simple direct translator.
My thinking is: if you can do IO at all, then you can do it with a constant factor blowup (depending only on the security parameter), so if the gap between human simulator and direct translator can be arbitrarily large then this only closes a constant amount of the gap. And requiring the breaker to use the full power of IO is relatively optimistic.
ETA: looking at this in a more zoomed out way, I’d say the current state of affairs is that it’s often very hard to say much about “what a model is doing” given its weights, and if you specifically train it to be incomprehensible then it feels like it could quickly get much worse. You might wonder what the theoretical situation is—in the limit does the situation favor the obfuscator or the interpreter?---and IO is mostly relevant because it shows that the obfuscator will win in the limit (for the particular quantitative question that matters for evaluating these proposals). This isn’t supposed to be a bizarre claim supported only by appeal to IO (although that would suffice for a counterexample in our methodology), it is intended to be consistent with the intuitive sense of despair that many people (e.g. me) get when they stare at some giant weight matrices, already find them fairly hard to make sense of, and then imagine them being optimized to be incomprehensible.
Since the point of the contest is to discover a method for which there is no conceivable way it could fail, I grant you that given the uncertainty about what would really happen that counterexample suffices. But I still believe that in practice the interpreter will win. I think you are overextending your intuition, neural networks are incomprehensible to you because you are human, but activations in a network are what they are because they have been optimized so that the rest of the network can process (“understand”) them. So if a value can be written and read by a network, another one could do the same, since they are both being optimized to do it. It is only through complex cryptographic magic that we can avoid that (and maybe not even then if it turns out that IO can only create programs that are exponentially large).
I also believe that gradient descent can learn to use the activations of an obfuscated reporter (and indeed I frequently rely on some currently-plausible intuitive assumption like “gradient descent can’t obfuscate something from itself”). But this isn’t enough for these proposals to work, they need the relationship between the reporter and its obfuscated version to be quantitatively “simpler” than the relationship between the direct translator and human simulator.
That is not obvious at all, indistinguishable obfuscation is a problem that seems to inherently require enormous amounts of computation. From the Wikipedia page you linked:
And the line below that links to a paper about a method that is exponential on the size of the circuit to obfuscate:
I was unable to find anything better and doubt this could be improved to the point where it would be hard to distinguish from a simple direct translator.
My thinking is: if you can do IO at all, then you can do it with a constant factor blowup (depending only on the security parameter), so if the gap between human simulator and direct translator can be arbitrarily large then this only closes a constant amount of the gap. And requiring the breaker to use the full power of IO is relatively optimistic.
ETA: looking at this in a more zoomed out way, I’d say the current state of affairs is that it’s often very hard to say much about “what a model is doing” given its weights, and if you specifically train it to be incomprehensible then it feels like it could quickly get much worse. You might wonder what the theoretical situation is—in the limit does the situation favor the obfuscator or the interpreter?---and IO is mostly relevant because it shows that the obfuscator will win in the limit (for the particular quantitative question that matters for evaluating these proposals). This isn’t supposed to be a bizarre claim supported only by appeal to IO (although that would suffice for a counterexample in our methodology), it is intended to be consistent with the intuitive sense of despair that many people (e.g. me) get when they stare at some giant weight matrices, already find them fairly hard to make sense of, and then imagine them being optimized to be incomprehensible.
Since the point of the contest is to discover a method for which there is no conceivable way it could fail, I grant you that given the uncertainty about what would really happen that counterexample suffices. But I still believe that in practice the interpreter will win. I think you are overextending your intuition, neural networks are incomprehensible to you because you are human, but activations in a network are what they are because they have been optimized so that the rest of the network can process (“understand”) them. So if a value can be written and read by a network, another one could do the same, since they are both being optimized to do it. It is only through complex cryptographic magic that we can avoid that (and maybe not even then if it turns out that IO can only create programs that are exponentially large).
I also believe that gradient descent can learn to use the activations of an obfuscated reporter (and indeed I frequently rely on some currently-plausible intuitive assumption like “gradient descent can’t obfuscate something from itself”). But this isn’t enough for these proposals to work, they need the relationship between the reporter and its obfuscated version to be quantitatively “simpler” than the relationship between the direct translator and human simulator.