Adam Jermyn comments on Training Trace Priors

Adam Jermyn 13 Jun 2022 20:13 UTC
LW: 1 AF: 1
AF
I think I agree that the incentive points in that direction, though I’m not sure how strongly. My general intuition is that if certain wires in a circuit are always activated across the training distribution then something has gone wrong. Maybe this doesn’t translate as well to neural networks (where there is more information conveyed than just ‘True/False’)? Does that suggest that there’s a better way to implement this in the case of neural networks (maybe we should be talking about distributions of activations, and requesting that these be broad?).
On the specifics, I think I’m confused as to what your dog classifier is. What work is it doing, if it always outputs “this is a dog”? More generally, if a subcircuit always produces the same output I would rather have it replaced with constant wires.
- gwern 13 Jun 2022 21:03 UTC
  LW: 5 AF: 4
  AF Parent
  
  What work is it doing, if it always outputs “this is a dog”?
  
  My point is that, like in the AI koan, a random circuit, or a random NN, still does something. Like, if you feed in your dog photos, it’ll start off predicting 1% for this one, 25.78% for that one, 99.76% for this other one… This is just because it is filled with random parameters at initialization and when you feed in your photos, each neuron computes something. Something totally nonsensical, but something nonetheless, and during that something, each neuron will have a distribution of activations which will almost surely not exactly equal 50% and not be independent of every other neuron. Thus, your NN is born steeped deep in sin from the perspective of the regularizer. Of course it could be replaced by a single wire, but ‘replace all the parameters of a large complex NN with a single constant wire in a single step’ is not an operation that SGD can do, so it won’t. (What will it compute after it finally beats the regularizer and finds a set of parameters which will let SGD reduce its loss while still satisfying the regularization constraints? I’m not sure, but I bet it’d look like a nasty hash-like mess, which simply happens to be independent of its input on average.)
  - Adam Jermyn 13 Jun 2022 23:12 UTC
    LW: 2 AF: 2
    AF Parent
    Ok, I see. Thanks for explaining!
    One thing to note, which might be a technical quibble, is that I don’t endorse the entropy version of this prior (which is the one that wants ⁵⁰⁄₅₀ activations). I started off with it because it’s simpler, but I think it breaks for exactly the reasons you say, which is why I prefer the version that wants to see “Over the last N evaluations, each gate evaluated to T at least q times and to F at least q times, where q << N.” This is very specifically so that there isn’t a drive to unnaturally force the percentages towards 50% when the true input distribution is different from that.
    Setting that aside: I think what this highlights is that the translation from “a prior over circuits” to “a regularizer for NN’s” is pretty nontrivial, and things that are reasonably behaved in one space can be very bad in the other. If I’m sampling boolean circuits from a one-gate trace prior I just immediately find the solution of ‘they’re all dogs, so put a constant wire in’. Whereas with neural networks we can’t jump straight to that solution and may end up doing more contrived things along the way.
    - gwern 13 Jun 2022 23:59 UTC
      LW: 4 AF: 3
      AF Parent
      
      which is why I prefer the version that wants to see “Over the last N evaluations, each gate evaluated to T at least q times and to F at least q times, where q << N.”
      
      Yeah, I skipped over that because I don’t see how one would implement that. That doesn’t sound very differentiable? Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function? Even if you have some differentiable trick for that, it’s easier to explain my objections concretely with 50%. But I don’t have anything further to say about that at the moment.
      
      Setting that aside: I think what this highlights is that the translation from “a prior over circuits” to “a regularizer for NN’s” is pretty nontrivial, and things that are reasonably behaved in one space can be very bad in the other
      
      Absolutely. You are messing around with weird machines and layers of interpreters, and simple security properties or simple translations go right out the window as soon as you have anything adversarial or optimization-related involved.
      - Adam Jermyn 14 Jun 2022 0:37 UTC
        LW: 1 AF: 1
        AF Parent
        Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function?
        That would work, yeah. I was thinking of an approach based on making ad-hoc updates to the weights (beyond SGD), but an evolutionary approach would be much cleaner!