Would it be possible to use a huge model (e.g. an LLM) to interpret smaller networks, and output human-readable explanations? Is anyone working on something along these lines?
I’m aware Kayla Lewis is working on something similar (but not quite the same thing) on a small scale. In my understanding, from reading her tweets, she’s using a network to predict the outputs of another network by reading its activations.
Would it be possible to use a huge model (e.g. an LLM) to interpret smaller networks, and output human-readable explanations? Is anyone working on something along these lines?
I’m aware Kayla Lewis is working on something similar (but not quite the same thing) on a small scale. In my understanding, from reading her tweets, she’s using a network to predict the outputs of another network by reading its activations.