johnswentworth comments on How Do Selection Theorems Relate To Interpretability?

johnswentworth 9 Jun 2022 21:22 UTC
4 points
Indeed. One would need to have some whole theory of what kinds of relevant structures are present in the environment, and how to represent them.
And that’s why I started with abstraction.
- Joe Kwon 11 Jun 2022 0:01 UTC
  4 points
  Parent
  Hi John. One could run useful empirical experiments right now, before fleshing out all these structures and how to represent them, if you can assume that a proxy for human representations (crude: conceptnet, less crude: similarity judgments on visual features and classes collected by humans) is a good enough proxy for “relevant structures” (or at least that these representations more faithfully capture the natural abstractions than the best machines in vision tasks for example, where human performance is the benchmark performance), right?
  I had a similar idea about ontology mismatch identification via checking for isomorphic structures, and also realized I had no idea how to realize that idea. Through some discussions with Stephen Casper and Ilia Sucholutsky, we kind of pivoted the above idea into the regime of interpretability/adversarial robustness where we are hunting for interesting properties given that we can identify the biggest ways that humans and machines are representing things differently (and that humans, for now, are doing it “better”/more efficiently/more like the natural abstraction structures that exist).
  I think am working in the same building this summer (caught a split-second glance at you yesterday); I would love a chance to discuss how selection theorems might relate to an interpretability/adversarial robustness project I have been thinking about.
- tailcalled 9 Jun 2022 21:39 UTC
  2 points
  Parent
  Those structures would likely also be represented with neural nets, though, right? So in practice it seems like it would end up quite similar to looking for isomorphic structures between neural networks, except you specifically want to design a highly interpretable kind of neural network and then look for isomorphisms between this interpretable neural network and other neural networks.
  - johnswentworth 9 Jun 2022 21:54 UTC
    2 points
    Parent
    They would not necessarily be represented with neural nets, unless you’re using “neural nets” to refer to circuits in general.
    - tailcalled 9 Jun 2022 21:57 UTC
      2 points
      Parent
      I think by “neural nets” I mean “circuits that get optimized through GD-like optimization techniques and where the vast majority of degrees of freedom for the optimization process come from big matrix multiplications”.
      - johnswentworth 9 Jun 2022 22:01 UTC
        2 points
        Parent
        Yeah, I definitely don’t expect that to be the typical representation; I expect neither an optimization-based training process nor lots of big matrix multiplications.
        tailcalled 9 Jun 2022 22:03 UTC
        2 points
        Parent
        Interesting and exciting. Can you reveal more about how you expect it to work?
        johnswentworth 9 Jun 2022 22:12 UTC
        2 points
        Parent
        Based on the forms in Maxent and Abstraction, I expect (possibly nested) sums of local functions to be the main feature-representation. Figuring out which local functions to sum might be done, in practice, by backing equivalent sums out of a trained neural net, but the net wouldn’t be part of the representation.