(Léo Dana) French master student in applied Mathematics (probability & statistic), soon PhD in Mathematics in Paris
WCargo
Visualizing small Attention-only Transformers
Results from the Turing Seminar hackathon
On Interpretability’s Robustness
Thanks for the post Ellena!
I was wondering if the finding “words are clustered by vocal and semantic similarity” also exists in traditional LLMs? I don’t remember seeing that, so could it mean that this modularity could also make interpretability easier?
It seems logical: we have more structure on the data, so better way to cluster the text, but I’m curious of your opinion.
Hi, Interesting experiments. What were you trying to find and how would you measure that the content is correctly mixed instead of just having “unrealated concepts juxtaposed” ?
Also, how did you choose which layer to merge your streams ?
Hi, thank you for the sequence. Do you know if there is any way to get access the Watanabe’s book for free ?
In a MLP, the nodes from different layers are in Series (you need to go through the first, and then the second), but inside the same layer they are in Parallel (you go through one of the other).
The analogy is with electrical systems, but I was mostly thinking in terms of LLM components: the MLPs and Attentions are in Series (you go through the first and after through the second), but inside one component, they are in parallel.
I guess that then, inside a component there is less superposition (evidence is this post), and between component there is redundancy (so if a computation fails somewhere, it is done also somewhere else).
In general, dropout makes me feel like because some part of the network are not going to work, the network has to implement “independent” component for it to compute thing properly.
Introducing EffiSciences’ AI Safety Unit
One thing I just thought about: I would predict that dropout is reducing superposition in parallel and augment superposition in series (because to make sure that the function is computed, you can have redundancy)
Thank you, Idk why but before I ended up on a different page with broken links (maybe some problem on my part)!
Hey, almost all links are dead, would it be possible to update them ? otherwise the post is pretty useless and I am interested in them ^^
Indeed. D4 is better than D5 if we had to choose, but D4 is harder to formalize. I think that having a theory of corrigibility without D4 is already something a good step as D4 seems like “asking to create corrigible agent”, so you maybe the way to do it is: 1. have a theory of corrigible agent (D1,2,3,5) and 2. have a theory of agent that ensures D4 by apply the previous theory to all agent and subagent.
Thank you! I haven’t read Armstrongs’ work in detail on my side, but I think that one key difference is that classical indifference methods all try to make the agent “act as if the button could not be pressed” which causes the big gamble problem.
By the way, do you have any idea why almost all link on the page you linked are dead or how to find the mentioned articles ??
Improvement on MIRI’s Corrigibility
Great post! I was wondering if the conclusion to be drawn is really that « dropout inibits superpositon »? My prior was that it should increase it (so this post proved me wrong on this part) mainly because in a model with mlp in parallel (like transformer) deopout would force redundancy of circuit, not inside one mlp, but across different mlps
Id like to see more on that, it would be super useful to know that dropout helps or not interpretability to enforce it or not on training
A Corrigibility Metaphore—Big Gambles
Here you present the link between two models using the fact that their centroïd token are the same.
Do you know any other similar correlation of this type? Maybe by finding other links between a model an its former models you could gather them and have a more reliable tool to predict if Model A and Model B share a past training.
In particular, I found that there seems to be a correlation between the size of a model and the best prompt for better accuracy [https://arxiv.org/abs/2105.11447 , figure5]. The link here is only the size of the models, but I thought that the size was a weird explanation, and so thought about your article.
Hope this may somehow help :)
Thanks for this nice post !
When you said that the objective was to « find the type of strategies the model currently learning before it becomes performant, and stop it if this isn’t the one we want » But how would you define what attractors are good ones ? How to identifiate the properties of an attractor if no dangerous model as been trained that has this attractor ? And what if the num er of attractor is huge and we can’t test them all beforehand ? It doesn’t seem obvious that the number of attractor wouldn’t grow as the network does.
Hello, I have some issue with the epistomology of the problem : my problem is that even if the process of training was giving the behavior we want, we would have no way to check the IA is working properly in practice.
I try now to give more details : in the volt probleme, given the same information, let’s think of an IA that just as to answer the question “Is the diamon still in the volt ?”.
Something we can suppose is that, the set Y, from which we draw the labeled examples to train the IA (a set of technique for the thief), is not important : trying to increase its size it isn’t a solution (because there is always something that can be thought out of our imagination). We can in fact try to solve the problem relatively to Y. We consider then X the scenarios that the IA can understand given it was trained on Y. Then the only way to act on X\Y is to train on Y in a specific way (I think). So we need a link between X and Y that we can exploit. So we need to know what X looks like, but we can’t since its the goal. The only thing we could know is X’ the set of scenarios which could be imagined or understood by a human, even if that human could not label such the scenario. Since we don’t know if X = X’ by definition, there may always be some cases in which the IA understood how the thief did but we don’t.
To me, the problem here is to have the IA giving us the information it has when the thief uses a technique in X’ and not X. Because in X\X’, there is nothing we know to help us guide the IA toward having a good behavior on this set. But it seems possible in X’, because we can imagine scenarios and so ways of guiding the IA.
So the thing I don’t understand is why the counter-example with the thief using a secret property of transistors is a good counter-example ? To me, we are in the case were because the method is out of reach for the humans, we have no idea if the IA tells the truth or not, because we can’t be sure to train an IA to have a specific behavior on example we could not imagine. Moreover we can’t check if it says the truth, so how would we trust it ?.
Thank you for reading
Quick question: you say that the MLP 2-6 gradually improve the representation of the sport of the athlete, and that no single MLP do it in one go. Would you consider that the reason would be something like this post describes ? https://www.lesswrong.com/posts/8ms977XZ2uJ4LnwSR/decomposing-independent-generalizations-in-neural-networks
So the MLP 2-6 basically do the same computations, but in a different superposition basis so that after several MLPs, the model is pretty confident about the answer ? Then would you think there is something more to say in the way the “basis are arranged”, eg which concept interfere with which (i guess this could help answering questions like “how to change the lookup table name-surname-sport” which we are currently not able to do)
thks