I see two major challenges (one of which leans heavily on progress in linguistics). I can see there being mathematical theory to guide candidate model decompositions (Challenge 1), but I imagine that linking up a potential model decomposition to a theory of ‘semantic interpretability’ (Challenge 2) is equally hard, if not harder.
Any ideas on how you plan to address Challenge 2? Maybe the most robust approach would involve active learning of the pseudocode, where a human guides the algorithm in its decomposition and labeling of each abstract computation.
‘Smaller’ functions will probably be more human-interpretable, just because they do less, are easy to analyse, and have less weird stuff going on. I think that this implies that as you ‘double-click’ on more high-level primitives, they get more and more interpretable.
It’s plausible to me that there’s some mathematical theory of how to get things that are human-interpretable enough for our purposes.
It’s also plausible to me that by trying enough things, you find a method that seems sort of human-interpretable, see what properties it actually has, and check if you can use those.
There might be synergies with interpretability techniques like neuron visualisation that give you a sense of the input-output behaviour without telling you much about the internal mechanisms.
If a neural network is well-trained, it’s easier to visualise what each neuron does, because intuitively they need to do sensible things for the outputs to be sensible. You could hope that a similar property for high-level primitives obtains if those primitives are constructed sensibly out of neurons.
I see two major challenges (one of which leans heavily on progress in linguistics). I can see there being mathematical theory to guide candidate model decompositions (Challenge 1), but I imagine that linking up a potential model decomposition to a theory of ‘semantic interpretability’ (Challenge 2) is equally hard, if not harder.
Any ideas on how you plan to address Challenge 2? Maybe the most robust approach would involve active learning of the pseudocode, where a human guides the algorithm in its decomposition and labeling of each abstract computation.
Thoughts on challenge 2:
‘Smaller’ functions will probably be more human-interpretable, just because they do less, are easy to analyse, and have less weird stuff going on. I think that this implies that as you ‘double-click’ on more high-level primitives, they get more and more interpretable.
It’s plausible to me that there’s some mathematical theory of how to get things that are human-interpretable enough for our purposes.
It’s also plausible to me that by trying enough things, you find a method that seems sort of human-interpretable, see what properties it actually has, and check if you can use those.
There might be synergies with interpretability techniques like neuron visualisation that give you a sense of the input-output behaviour without telling you much about the internal mechanisms.
If a neural network is well-trained, it’s easier to visualise what each neuron does, because intuitively they need to do sensible things for the outputs to be sensible. You could hope that a similar property for high-level primitives obtains if those primitives are constructed sensibly out of neurons.