Thoughts on self-inspecting neural networks.
While I have been contemplating this subject for quite some time, this is my first attempt at public communication of it. I have been thinking about AI implementation, as well as safety. I’ve been looking at ideas of various experts in the field and considering how things might be combined and integrated to improve either performance or safety (preferably both). One important aspect of that is legibility of how neural networks work. While I’ve been thinking about it for some time, a YouTube video I watched last night helped me crystallize the how of implementing it. It’s an interview with the authors of this paper on Arxiv. In the paper, they show how they were able to isolate a concept in GPT-3 to a particular set of parameters. Them demonstrating that has made what I’ve been thinking of appear to be that much more plausible.
So, my concept is for the integration of an ensemble of specialized neural networks that can contribute to a greater whole. Specifically, in this case, I’m considering a network that is trained by watching, and eventually directing, the training of a network such as an LLM. My thought is to export an “image” of a neural network with the pixels representing the parameters in the network. It would be a multidimensional image and I think that standard RGB would probably be insufficiently dimensional to represent everything that I’m considering, or perhaps multiple “views” might be used so that we could maintain compatibility with standard image formats so that we can take advantage of existing tools and also display them for some level of human consumption. It might be best to have a way to transition between multidimensional matrices and images so that each can be looked at when applicable in different contexts. Because of the size of the networks, it will be necessary to not only have a base representation of the network but also be able to represent changes and activations within the network as “diffs” away from the baseline (or at least the previous resultant state of the baselines + previous diffs). I’m thinking something like the way that Git works. We could also use various lossless and lossy image compression techniques to make them more tractable to work with.
So, within these artificial neural images would be contained the various weights and biases of the MLPs and Attention Heads. That would form the base layer that we’re working off of. During training (which could conceivably be ongoing) we would use a combination of the external input signal, the activation states within the network (favoring only the portions that appreciably activate as the output), the output result of the network, the evaluation of said output, and the updates to the network made as a result of all of that. All of that information would be fed into a supervisory neural network that functions in a manner not-too-dissimilar from current image recognition systems. I’m thinking of inspiration from systems such Alpha GO and DALL-E. The supervisory network would analyze the changes in the network and, with enough training, should be able to predict exactly which portions of the network would be updated to learn from any given input. Once it’s able to accurately predict these updates, the process of updating the network could be significantly cheaper than the usual methods since it could be considerably more directed to just the nodes that need the update and it might even end up aggregating multiple updates so that it could not only predict the next state of the network to achieve the desired output, but the final state of the network needed to do so. Such directed updates may also limit the impact of continuous training eventually causing corruption of the network. If we feed the output of this supervisory network into the ensemble network, we could reach the point where an AI could legibly and verifiably describe its “thinking” process in concrete, human understandable terms. It could also describe them referencing the particular neural pathways involved and the actual “why” of how it comes to various conclusions. If the images of the networks included the supervisory network as well as other portions of overall network as a whole, then the neural network may be able to describe every part of its system in terms that humans could understand. We could potentially interact with the system in the following ways:
“Which parts of your network are involved in evaluating prompt X?”
“Which parts of your network will be changed when learning X?”
“What tertiary effects are predicted to occur as a result of learning X?”
“Where is the concept of X → Y stored?”
“Change relationship X → Y to X → Z.”
“Summarize the knowledge contained in portion X in your network.”
“Appropriately label the various portions of your network.”
This could be done at increasingly fine-grained levels of hierarchy.
“Produce a human comprehensible corpus of knowledge (perhaps in database form) that represents the entirety of the knowledge and rules contained in your network.”
If those kinds of things were possible, especially the last few, we could end up with something that is understandable, and even searchable. We could deeply inspect the network to see what it has learned and if there are things within the network that we don’t like. Obviously, the output would be tremendously large, but if it were organized and in the manner that I’m thinking of, it could be significantly more compact than the body of information used to create that knowledge. Additionally, that output might be useful in significantly optimizing the functionality of the network and even offloading some of its complexity. It may end up that it allows us to significantly compress the knowledge by eliminating duplication that may be contained within it. The database might be searchable and queryable so that we could get summaries of knowledge seen from various perspectives.
This kind of organization may allow insights heretofore inaccessible to us due to the disparate nature and compartmentalization of knowledge in our culture. We might even be able to train a subsequent network based on that output or to interact with it so that we could produce a much smaller network with similar capabilities that was designed to interact with it in a far more efficient manner than traditional neural networks allow. It could allow us to separate empirical knowledge from more creative portions of the knowledge. That way we could actually separate “hallucinations” from factual knowledge and allow the network to explicitly do so as well; better grounding it in reality and hopefully producing a more trustworthy and deterministic system.
We could actually check for inner-alignments and be more confident that what we produced was what we intended to produce. We could check that what it produced actually worked the way that we intended and that undesirable goals weren’t being surreptitiously hidden from us. We could do this by exporting the analysis systems out of the network prior to it gaining too much capability to formulate plans that we would find objectionable. And we could constantly analyze such a system to look for signs of it heading in directions that we didn’t like. If this were a public thing, then the system could be searched at large scale by interested 3rd parties to help verify that we have thoroughly inspected this system from every possible angle; it’s amazing what the internet can find when you let them dig into something. With the stakes that we are facing, we are going to want that level of transparency.
We could validate that the insight that we think we have gained into the system is valid by using it to predict the output of the system given a particular input. If we find suspected flaws within the system, we could probe them explicitly and repair or excise them.
All of this is separate from other ideas that I have about minimizing the impact of agentic systems in carrying out directions of a user. I intend to go into that at a later date.
Well, that’s it. Let me know what you think. I’m interested to hear other’s thoughts on this, both positive and negative (though I’m interested in constructive criticism over mere negativity).
I think there’s a lot to like here. The entire thing about images reveals that you’re not quite “thinking with portals,” but I agree that automated interpretability would be powerful, and could do things like you talk about.
An important thing to consider is: what would be the training data of a supervisory network?
Maybe another important thing to consider is: would this actually help alignment more than it helps capabilities?
Thanks for your comment!
As for the image thing, it’s more of a metaphor than a literal interpretation of what I’m talking about. I’m thinking of a multidimensional matrix representation; you can think of that a bit like an image (RGB on pixels) and use similar techniques as what are used by actual image software; but it’s not a literal JPEG or BMP or whatever. The idea is to be able to take advantage of compression algorithms, etc. to make the process more efficient.
The training data for the supervisory network is the input, outputs, and parameter deltas of the target network. The idea is to learn the which parameters change based on which input/output pair and thereby eventually localize concepts within the target network and hopefully label/explain them in human understandable formats. This should be possible since the input/output is in human text.
The reasons to think of it a bit like an image and label parameters and sections of the target network is to try to take advantage of the same kind of technology used in the AlphaGo/AlphaZero/MuZero AIs that used techniques relating to image analysis to try and predict the deltas in the target network. If you could do this, then you should be able to “aim” the target network in a direction that you want it to go; basically, tell it what you want it to learn.
All of this may have several benefits, it could allow us to analyze the target network and understand what is in there and maybe check to see if anything we don’t like is in there (misalignment for example). And it could allow us to directly give the network a chosen alignment instead of the more nebulous feedback that we currently give networks to train them. Right now, it’s kind of like a teacher that only tells their student that they are wrong and not why they are wrong. The AI kind of has to guess what it did wrong. This makes it far more likely for it to end up learning a policy that has the outward behavior that we want but may have an inner meaning that doesn’t actually line up with what we’re trying to teach it.
You are correct that it may accelerate capabilities as well as safety; unfortunately, most of my ideas seem to be capabilities ideas. However, I am trying to focus more on safety. I do think we may have to accept that alignment research may impact capabilities research as well. The more we restrict the kinds of safety research we try to do, the more likely it is that we don’t find a solution at all. And it’s entirely possible that the hypothetical “perfect solution” would, in fact, greatly aid capabilities at the same time that it solved alignment/safety issues. I don’t think we should avoid that. I tend to think that safety research should mostly be open source while capabilities research should be mostly closed. If somebody somehow manages real AGI, we definitely want them to have available to them the best safety research in the world; no matter who it is that does it.
Anyway, thanks again for your input, I really appreciate it. Let me know if I successfully answered your questions/suggestions or not. :-)