For white box vs black box, after further discussion I wound up feeling like people just use the term “black box” differently in different fields, and in practice maybe I’ll just taboo “black box” and “white box” going forward. Hopefully we can all agree on:
If a LLM outputs A rather than B, and you ask me why, then it might take me decades of work to give you a reasonable & intuitive answer.
And likewise we can surely all agree that future AI programmers will be able to see the weights and perform SGD.
If a LLM outputs A rather than B, and you ask me why, then it might take me decades of work to give you a reasonable & intuitive answer.
I don’t think any complete description of the LLM is going to be intuitive to a human, because it’s just too complex to fit in your head all at once. The best we can do is to come up with interpretations for selected components of the network. Just like a book or a poem, there’s not going to be a unique correct interpretation: different interpretations are useful for different purposes.
Theres also no guarantee that any of these mechanistic interpretations will be the most useful tool for what you’re trying to do (e.g. make sure the model doesn’t kill you, or whatever). The track record of mech interp for alignment is quite poor, especially compared to gradient based methods like RLHF. We should accept the Bitter Lesson: SGD is better than you at alignment.
The track record of mech interp for alignment is quite poor, especially compared to gradient based methods like RLHF.
I think this is essentially what people mean when they say “LLMs are a black box” and since you seem to be agreeing, I find myself very confused that you’ve been pushing a “white box” talking point.
It seems that all parties including Nora agree with “If a LLM outputs A rather than B, and you ask me why, then it might take me decades of work to give you a reasonable & intuitive answer”. The disagreements are (1) whether we should care—i.e., whether this fact is important and worrisome in the context of safe & beneficial AGI, (2) what the terms “black box” and “white box” mean.
I think Nora’s comment here was taking an opportunity to argue her side of (1).
In Nora’s recent post, to her credit, she defined exactly what she meant by “white box” the first time she used the term, and her discussion was valid given that definition.
I think her recent post (and ditto the OP here) would have been clearer if she had (A) noted that people in the AGI safety space sometimes use “black box” to say something like the “decades of work” claim above, (B) explicitly said that the “decades of work” claim is obviously true and totally uncontroversial, (C) clarified that this popular definition of “black box / white box” is not the definition she’s using in this post.
(A similar suggestion also applies to the other side of the debate including me, i.e. in the unlikely event that I use the term “black box” to mean the “decades of work” thing, in my future writing, I plan to immediately define it and also explicitly say that I’m not using the term to discuss whether or not you can see the weights and perform SGD.)
Hmm, I guess the point of using the term “white box” is then to illustrate that it is not a literal black box, while the point of the term “black box” is that while it’s a literal transparent system, we still don’t understand it in the ways that matter. There’s something that feels really off about the dynamic of term use here, but I can’t quite articulate it.
The terms “white box” and “black box”, like pretty much all terms, are more than just their literal definitions, they are also trojan horses full of connotations and vibes. So of course it’s natural (albeit unfortunate and annoying) for people on both sides of a debate to try to get those connotations and vibes to work in service of their side. :-P
I’ll edit the post soon to focus on the fact that the white-box definition is not a standard definition of the term, and instead refers to the computer analysis/security sense of the term.
I definitely agree that I think tabooing white box vs black box is good. One point though is that the innate reward system does targeted updates to neural circuits using simple learning rules, that means that we can probably use SGD to make ourselves an innate reward system combined with a weak prior to get good results.
Admittedly, I do thnk that the pathway isn’t as complete as I like, but I do actually think that the notion of seeing the weights, checking the Hessians, etc to be extremely powerful alignment tools, more powerful than appreciated.
For white box vs black box, after further discussion I wound up feeling like people just use the term “black box” differently in different fields, and in practice maybe I’ll just taboo “black box” and “white box” going forward. Hopefully we can all agree on:
And likewise we can surely all agree that future AI programmers will be able to see the weights and perform SGD.
I don’t think any complete description of the LLM is going to be intuitive to a human, because it’s just too complex to fit in your head all at once. The best we can do is to come up with interpretations for selected components of the network. Just like a book or a poem, there’s not going to be a unique correct interpretation: different interpretations are useful for different purposes.
Theres also no guarantee that any of these mechanistic interpretations will be the most useful tool for what you’re trying to do (e.g. make sure the model doesn’t kill you, or whatever). The track record of mech interp for alignment is quite poor, especially compared to gradient based methods like RLHF. We should accept the Bitter Lesson: SGD is better than you at alignment.
I would definitely like to see that argument made, as I suspect that a lot of LWers might disagree with this statement.
I think this is essentially what people mean when they say “LLMs are a black box” and since you seem to be agreeing, I find myself very confused that you’ve been pushing a “white box” talking point.
It seems that all parties including Nora agree with “If a LLM outputs A rather than B, and you ask me why, then it might take me decades of work to give you a reasonable & intuitive answer”. The disagreements are (1) whether we should care—i.e., whether this fact is important and worrisome in the context of safe & beneficial AGI, (2) what the terms “black box” and “white box” mean.
I think Nora’s comment here was taking an opportunity to argue her side of (1).
In Nora’s recent post, to her credit, she defined exactly what she meant by “white box” the first time she used the term, and her discussion was valid given that definition.
I think her recent post (and ditto the OP here) would have been clearer if she had (A) noted that people in the AGI safety space sometimes use “black box” to say something like the “decades of work” claim above, (B) explicitly said that the “decades of work” claim is obviously true and totally uncontroversial, (C) clarified that this popular definition of “black box / white box” is not the definition she’s using in this post.
(A similar suggestion also applies to the other side of the debate including me, i.e. in the unlikely event that I use the term “black box” to mean the “decades of work” thing, in my future writing, I plan to immediately define it and also explicitly say that I’m not using the term to discuss whether or not you can see the weights and perform SGD.)
Hmm, I guess the point of using the term “white box” is then to illustrate that it is not a literal black box, while the point of the term “black box” is that while it’s a literal transparent system, we still don’t understand it in the ways that matter. There’s something that feels really off about the dynamic of term use here, but I can’t quite articulate it.
The terms “white box” and “black box”, like pretty much all terms, are more than just their literal definitions, they are also trojan horses full of connotations and vibes. So of course it’s natural (albeit unfortunate and annoying) for people on both sides of a debate to try to get those connotations and vibes to work in service of their side. :-P
I’ll edit the post soon to focus on the fact that the white-box definition is not a standard definition of the term, and instead refers to the computer analysis/security sense of the term.
I definitely agree that I think tabooing white box vs black box is good. One point though is that the innate reward system does targeted updates to neural circuits using simple learning rules, that means that we can probably use SGD to make ourselves an innate reward system combined with a weak prior to get good results.
Admittedly, I do thnk that the pathway isn’t as complete as I like, but I do actually think that the notion of seeing the weights, checking the Hessians, etc to be extremely powerful alignment tools, more powerful than appreciated.