I also believe that gradient descent can learn to use the activations of an obfuscated reporter (and indeed I frequently rely on some currently-plausible intuitive assumption like “gradient descent can’t obfuscate something from itself”). But this isn’t enough for these proposals to work, they need the relationship between the reporter and its obfuscated version to be quantitatively “simpler” than the relationship between the direct translator and human simulator.
I also believe that gradient descent can learn to use the activations of an obfuscated reporter (and indeed I frequently rely on some currently-plausible intuitive assumption like “gradient descent can’t obfuscate something from itself”). But this isn’t enough for these proposals to work, they need the relationship between the reporter and its obfuscated version to be quantitatively “simpler” than the relationship between the direct translator and human simulator.