rotatingpaguro comments on New OpenAI Paper—Language models can explain neurons in language models

rotatingpaguro 11 May 2023 0:09 UTC
1 point
−1

At the very least, that’s a non-trivial claim in need of support.

From my point of view, I could say the opposite is rather a “non-trivial claim in need of support”. My (not particularly motivated) intuition is that a larger, smarter mind employs more sophisticate cognitive algorithms, and so analyzing its workings requires proportionally more intelligence.

Example: I have the experience that, if I argue with someone less intelligent and used to debate than me, it is likely that they’ll perceive what I say in pieces instead of looking at the whole reasoning tree, and it is very difficult to have them understand the “big picture”. For example, if I say “A then B”, they might understand “A and B”, or “A or B”, or “A”, or “B”. In the domain of argument, they’re not able to understand how I put all the pieces together, by looking at the pieces in isolation, and it is difficult to them to even contemplate the rules I use.

What are the intuitions that you use to feel the default case is the other way around?

I don’t think that’s particularly risky at all. A model that wasn’t dangerous before you fed it data about some other model (or, indeed, about itself) isn’t going to become dangerous after it understands. In turn, a model that is dangerous after you let it do science, has been dangerous from the get-go.

We probably shouldn’t have trained GPT-4 to begin with; but given that we have, and didn’t die, the least we can do is fully utilize the resultant tool.

Ok, I think I lacked clarity. I did not mean that doing this particular research bit was not safe. I meant that the kind of paradigm that I see here, as I extrapolate it, is not safe.
- Thane Ruthenis 11 May 2023 2:02 UTC
  3 points
  0
  Parent
  What are the intuitions that you use to feel the default case is the other way around?
  You can always factorize the problem into smaller pieces. If the interlocutor doesn’t understand “A then B” but can understand “A”, “B”, “or”, and “not” individually, you can introduce them to “not(A)”, let them get used to it until they can think of not(A) as a simple assertion C, then introduce them to or(C;B) (which implements the implication “if A then B”). It can be exhausting, but it works.
  And in the case of larger AI models, seems like this sort of factorization would be automatic. Their sophistication grows with the number of parameters — which means the complexity of interactions within individual fixed-size groups of parameters can be constant, or even decrease with the model’s size.
  Sure, the functions that e. g. parameters at late layers implement may be more complex in an absolute sense; but not more complex relative to lower-layer functions.
  Toy example: if every neuron at the nth layer implements an elementary operation over two lower-layer neurons, the function at the 32th layer would be “more complex” than any function at the 6th layer, when considered from scratch — but not more complex if by the time you get to an nth layer, you already understand everything at every preceding layer.