<note> I work in Clarity at OpenAI. Chris and I have discussed this response (though I cannot claim to represent him).
Does “faithful” mean “100% identical in terms of I/O”, or more like “captures all of the important elements of”?
I’d say faithfulness lies on a spectrum. Full IO determinism on a neural network is nearly impossible (given the vagaries of floating point arithmetic), but what is really of interest to us is “effectively identical IO”. A working definition of this could be—an interpretable network that one that can act as a drop-in replacement the original network with no impact on final accuracy.
This allows some wiggle room in the weights—to be rounded up and down, and allow weights that do not have any impact on final accuracy can be ablated. We are not, however, interested in creating interpretable approximations of the original network.
My understanding is that neural networks are continuous whereas human-understandable code like the Linux kernel are discrete, so it seemingly just can’t work in the former case, and I’m not sure how it can work in the latter case either.
We are reasoning explicitly about numerical code, but I would argue this isn’t that alien to human comprehension! Discrete code may be more intuitive (sometimes), but human cognition is certainly capable of understanding numerical algorithms (think of say, the SIFT algorithm)!
Do you or Chris think that a test of this might be to take a toy model (say a 100-neuron ANN) that solves some toy problem, and see if it can be reversed compiled? (Or let me know if this has already been done.) If not, what’s the earliest meaningful test that can be done?
We are indeed working through this on a fairly sophisticated vision model. We’re making progress!
<note> I work in Clarity at OpenAI. Chris and I have discussed this response (though I cannot claim to represent him).
I’d say faithfulness lies on a spectrum. Full IO determinism on a neural network is nearly impossible (given the vagaries of floating point arithmetic), but what is really of interest to us is “effectively identical IO”. A working definition of this could be—an interpretable network that one that can act as a drop-in replacement the original network with no impact on final accuracy.
This allows some wiggle room in the weights—to be rounded up and down, and allow weights that do not have any impact on final accuracy can be ablated. We are not, however, interested in creating interpretable approximations of the original network.
We are reasoning explicitly about numerical code, but I would argue this isn’t that alien to human comprehension! Discrete code may be more intuitive (sometimes), but human cognition is certainly capable of understanding numerical algorithms (think of say, the SIFT algorithm)!
We are indeed working through this on a fairly sophisticated vision model. We’re making progress!