scasper
This is exciting to see. I think this solution is impressive, and I think the case for the structure you find is compelling. It’s also nice that this solution goes a little further in one aspect than the previous one. The analysis with bars seems to get a little closer to a question I have still had since the last solution:
My one critique of this solution is that I would have liked to see an understanding of why the transformer only seems to make mistakes near the parts of the domain where there are curved boundaries between regimes (see fig above with the colored curves). Meanwhile, the network did a great job of learning the periodic part of the solution that led to irregularly-spaced horizontal bars. Understanding why this is the case seems interesting but remains unsolved.
I think this work gives a bit more of a granular idea of what might be happening. And I think it’s an interesting foil to the other one. Both came up with some fairly different pictures for the same process. The differences between these two projects seem like an interesting case study in MI. I’ll probably refer to this a lot in the future.
Overall, I think this is great, and although the challenge is over, I’m adding this to the github readme. And If you let me know a high-impact charity you’d like to support, I’ll send $500 to it as a similar prize for the challenge :)
A Short Memo on AI Interpretability Rainbows
Examples of Prompts that Make GPT-4 Output Falsehoods
Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human’s comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless.
Thanks—I agree that this seems like an approach worth doing. I think that at CHAI and/or Redwood there is a little bit of work at least related to this, but don’t quote me on that. In general, it seems like if you have a model and then a smaller distilled/otherwise-compressed version of it, there is a lot you can do with them from an alignment perspective. I am not sure how much work has been done in the anomaly detection literature that involves distillation/compression.
I don’t work on this, so grain of salt.
But wouldn’t this take the formal out of formal verification? If so, I am inclined to think about this as a form of ambitious mechanistic interpretability.
I think this is a good point, thanks.
Eight Strategies for Tackling the Hard Part of the Alignment Problem
There are existing tools like lucid/lucent, captum, transformerlens, and many others that make it easy to use certain types of interpretability tools. But there is no standard, broad interpretability coding toolkit. Given the large number of interpretability tools and how quickly methods become obsolete, I don’t expect one.
Thoughts of mine on this are here. In short, I have argued that toy problems, cherry-picking models/tasks, and a lack of scalability has contributed to mechanistic interpretability being relatively unproductive.
Takeaways from the Mechanistic Interpretability Challenges
I think not. Maybe circuits-style mechanistic interpretability is though. I generally wouldn’t try dissuading people from getting involved in research on most AIS things.
Advice for Entering AI Safety Research
We talked about this over DMs, but I’ll post a quick reply for the rest of the world. Thanks for the comment.
A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it’s thin than if it’s wide. And that is the point I think that the Huang paper helps to make. The fact that deep and thin networks tend to be more robust suggests that representing information more densely w.r.t. neurons in a layer does not make these networks less robust than wide shallow nets.
Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this.
Thanks for the comment.
In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.
This seems completely plausible to me. But I think that it’s a little hand-wavy. In general, I perceive the interpretability agendas that don’t involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.
there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.
No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda.
I can kinda see the intuition here, but could you explain why we shouldn’t expect this to generalize?
Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren’t interpreting networks, they were just training the networks. So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is.
I just went by what it said. But I agree with your point. It’s probably best modeled as a predictor in this case—not an agent.
GPT-4 is easily controlled/exploited with tricky decision theoretic dilemmas.
In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model’s internals are not modeled as a part of the actual algorithm. They’re just implementational substrate.