First, an example of what not to do if you want humanity to survive: make an even foom-ier and less interpretable version of neural nets. On the spectrum of good idea to bad idea, this one is way worse than neuromorphic computing. In fact, they even had a paragraph in their paper discussing their method in contrast to Hebbian learning, and showing how their method is more volatile and unpredictable. Great. https://arxiv.org/abs/2202.05780
On the flip side of the coin, here’s some good stuff that I’m happy to se being worked on.
Tom Frederik working on a projected called ‘Unseal’ https://github.com/TomFrederik/unseal (to unseal the mystery of transformers), expanding on the PySvelte library to include aspects of the logit-lens idea.
Inspired by this, I started my own fork from his library to work on a interactive visualization that creates an open-ended framework for tying together multiple ways of interpreting models. I’m hoping to use it to put together the stuff I’ve found or made so far, as well as whatever I find or make next. It just seems generally useful to have some sort of way to put a bunch of different ‘views’ of the model together side-by-side in order to scan across all of them to get a more inclusive view of things.
Once I get this tool a bit more fleshed out, I plan to start trying to plan the audit game with it, using toy models that have been edited.
Here’s a presentation of my ideas I made for the conclusion of the AGI safety fundamentals course. https://youtu.be/FarRgPwBpGU
Among the things I’m thinking of putting in are ideas related to these papers.
This paper discusses their language data summarization technique, which I think will be cool to use someday, but also along the way to building that they had to do some clustering which I think sounds like a useful thing to include in my visualization and contrast with the neuron-importance topic clustering. I hope to also revisit and improve on the neuron-importance clustering. I think if I repeat the neuron-importance sampling on topic-labeled samples, I’ll then be able to tag the resulting clusters with the topics they most strongly relate to. That will make them more useful for interpretation.
Seems neat and sorta related. I can probably figure out some way to add a version of this text-deep-dreaming to the laundry list of ‘windows into interpretability’ I’m accumulating.
Progress Report 5: tying it together
Previous: Progress report 4
First, an example of what not to do if you want humanity to survive: make an even foom-ier and less interpretable version of neural nets. On the spectrum of good idea to bad idea, this one is way worse than neuromorphic computing. In fact, they even had a paragraph in their paper discussing their method in contrast to Hebbian learning, and showing how their method is more volatile and unpredictable. Great. https://arxiv.org/abs/2202.05780
On the flip side of the coin, here’s some good stuff that I’m happy to se being worked on.
Some other work that has been done with nostalgebraist’s logit-lens: Looking for Grammar in all the Right Places Alethea Power
Tom Frederik working on a projected called ‘Unseal’ https://github.com/TomFrederik/unseal (to unseal the mystery of transformers), expanding on the PySvelte library to include aspects of the logit-lens idea.
Inspired by this, I started my own fork from his library to work on a interactive visualization that creates an open-ended framework for tying together multiple ways of interpreting models. I’m hoping to use it to put together the stuff I’ve found or made so far, as well as whatever I find or make next. It just seems generally useful to have some sort of way to put a bunch of different ‘views’ of the model together side-by-side in order to scan across all of them to get a more inclusive view of things.
Once I get this tool a bit more fleshed out, I plan to start trying to plan the audit game with it, using toy models that have been edited.
Here’s a presentation of my ideas I made for the conclusion of the AGI safety fundamentals course. https://youtu.be/FarRgPwBpGU
Among the things I’m thinking of putting in are ideas related to these papers.
One: A paper mentioned by jsteinhardt here https://www.lesswrong.com/posts/qAhT2qvKXboXqLk4e/early-2022-paper-round-up : Summarizing Differences between Text Distributions with Natural Language (w/ Ruiqi Zhong, Charlie Snell, Dan Klein)
This paper discusses their language data summarization technique, which I think will be cool to use someday, but also along the way to building that they had to do some clustering which I think sounds like a useful thing to include in my visualization and contrast with the neuron-importance topic clustering. I hope to also revisit and improve on the neuron-importance clustering. I think if I repeat the neuron-importance sampling on topic-labeled samples, I’ll then be able to tag the resulting clusters with the topics they most strongly relate to. That will make them more useful for interpretation.
Two: https://arxiv.org/abs/2203.14680 Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space Mor Geva, Avi Caciularu, Kevin Ro Wang, Yoav Goldberg
Third:
What does BERT dream of? Deep dream with text https://www.gwern.net/docs/www/pair-code.github.io/c331351a690011a2a37f7ee1c75bf771f01df3a3.html
Seems neat and sorta related. I can probably figure out some way to add a version of this text-deep-dreaming to the laundry list of ‘windows into interpretability’ I’m accumulating.