Thane Ruthenis comments on New OpenAI Paper—Language models can explain neurons in language models

Thane Ruthenis 10 May 2023 23:12 UTC
6 points
2
You need a larger model to interpret your model
Inasmuch as this shtick works at all, that doesn’t seem necessarily true to me? You need a model above some threshold of capability at which it can provide useful interpretations, yes, but I don’t see any obvious reason why that threshold would move up with the size of the model under interpretation. The number of neurons/circuits to be interpreted will increase, but the complexity of any single interpretation? At the very least, that’s a non-trivial claim in need of support.
you want to make a model understand a model
I don’t think that’s particularly risky at all. A model that wasn’t dangerous before you fed it data about some other model (or, indeed, about itself) isn’t going to become dangerous after it understands. In turn, a model that is dangerous after you let it do science, has been dangerous from the get-go.
We probably shouldn’t have trained GPT-4 to begin with; but given that we have, and didn’t die, the least we can do is fully utilize the resultant tool.
- evand 11 May 2023 2:53 UTC
  3 points
  0
  Parent
  This feel reminiscent of:
  If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.
  And while it’s a well-constructed pithy quote, I don’t think it’s true. Can a system understand itself? Can a quining computer program exist? Where is the line between being able to recite itself and understand itself?
  You need a model above some threshold of capability at which it can provide useful interpretations, yes, but I don’t see any obvious reason why that threshold would move up with the size of the model under interpretation.
  Agreed. A quine needs some minimum complexity and/or language / environment support, but once you have one it’s usually easy to expand it. Things could go either way, and the question is an interesting one needing investigation, not bare assertion.
  And the answer might depend fairly strongly on whether you take steps to make the model interpretable or a spaghetti-code turing-tar-pit mess.
- rotatingpaguro 11 May 2023 0:09 UTC
  1 point
  −1
  Parent
  
  At the very least, that’s a non-trivial claim in need of support.
  
  From my point of view, I could say the opposite is rather a “non-trivial claim in need of support”. My (not particularly motivated) intuition is that a larger, smarter mind employs more sophisticate cognitive algorithms, and so analyzing its workings requires proportionally more intelligence.
  
  Example: I have the experience that, if I argue with someone less intelligent and used to debate than me, it is likely that they’ll perceive what I say in pieces instead of looking at the whole reasoning tree, and it is very difficult to have them understand the “big picture”. For example, if I say “A then B”, they might understand “A and B”, or “A or B”, or “A”, or “B”. In the domain of argument, they’re not able to understand how I put all the pieces together, by looking at the pieces in isolation, and it is difficult to them to even contemplate the rules I use.
  
  What are the intuitions that you use to feel the default case is the other way around?
  
  I don’t think that’s particularly risky at all. A model that wasn’t dangerous before you fed it data about some other model (or, indeed, about itself) isn’t going to become dangerous after it understands. In turn, a model that is dangerous after you let it do science, has been dangerous from the get-go.
  
  We probably shouldn’t have trained GPT-4 to begin with; but given that we have, and didn’t die, the least we can do is fully utilize the resultant tool.
  
  Ok, I think I lacked clarity. I did not mean that doing this particular research bit was not safe. I meant that the kind of paradigm that I see here, as I extrapolate it, is not safe.
  - Thane Ruthenis 11 May 2023 2:02 UTC
    3 points
    0
    Parent
    What are the intuitions that you use to feel the default case is the other way around?
    You can always factorize the problem into smaller pieces. If the interlocutor doesn’t understand “A then B” but can understand “A”, “B”, “or”, and “not” individually, you can introduce them to “not(A)”, let them get used to it until they can think of not(A) as a simple assertion C, then introduce them to or(C;B) (which implements the implication “if A then B”). It can be exhausting, but it works.
    And in the case of larger AI models, seems like this sort of factorization would be automatic. Their sophistication grows with the number of parameters — which means the complexity of interactions within individual fixed-size groups of parameters can be constant, or even decrease with the model’s size.
    Sure, the functions that e. g. parameters at late layers implement may be more complex in an absolute sense; but not more complex relative to lower-layer functions.
    Toy example: if every neuron at the nth layer implements an elementary operation over two lower-layer neurons, the function at the 32th layer would be “more complex” than any function at the 6th layer, when considered from scratch — but not more complex if by the time you get to an nth layer, you already understand everything at every preceding layer.