Using ZIP as compression metric for NNs (I assume you do something along the lines of “take all the weights and line them up and then ZIP”) is unintuitive to me for the following reason: ZIP, though really this should apply to any other coding scheme that just tries to compress the weights by themselves, picks up on statistical patterns in the raw weights. But NNs are not just simply a list of floats, they are arranged in highly structured manner. The weights themselves get turned into functions and it is 1.the functions, and 2. the way the functions interact that we are ultimately trying to understand (and therefore compress).
To wit, a simple example for the first point : Assume that inside your model is a 2x2 matrix with entries M=[0.587785, −0.809017, 0.809017, 0.587785]. Storing it like this will cost you a few bytes and if you compress it you can ~ half the cost I believe. But really there is a much more compact way to store this information: This matrix represents a rotation by 36 degrees. Storing it this way, requires less than 1 byte.
This phenomenon should get worse for bigger models. One reason is the following: If we believe that the NN uses superposition, then there will be an overbasis in which all the computations are done (more) sparsly. And if we don’t factor that in, then ZIP will not include such information (Caveat: This is my intuition, I don’t have empirical results to back this up).
I think ZIP might pick up some structure (see e.g. here), just as in my example above it would pick up some sort of symmetry. But your your decoder/encoder in your compression scheme should include/have access to more information regarding the model you are compressing. You might want to check out this post for an attempt at compressing model performance using interperetations.
Using ZIP as compression metric for NNs (I assume you do something along the lines of “take all the weights and line them up and then ZIP”) is unintuitive to me for the following reason:
ZIP, though really this should apply to any other coding scheme that just tries to compress the weights by themselves, picks up on statistical patterns in the raw weights. But NNs are not just simply a list of floats, they are arranged in highly structured manner. The weights themselves get turned into functions and it is 1.the functions, and 2. the way the functions interact that we are ultimately trying to understand (and therefore compress).
To wit, a simple example for the first point : Assume that inside your model is a 2x2 matrix with entries M=[0.587785, −0.809017, 0.809017, 0.587785]. Storing it like this will cost you a few bytes and if you compress it you can ~ half the cost I believe. But really there is a much more compact way to store this information: This matrix represents a rotation by 36 degrees. Storing it this way, requires less than 1 byte.
This phenomenon should get worse for bigger models. One reason is the following: If we believe that the NN uses superposition, then there will be an overbasis in which all the computations are done (more) sparsly. And if we don’t factor that in, then ZIP will not include such information (Caveat: This is my intuition, I don’t have empirical results to back this up).
I think ZIP might pick up some structure (see e.g. here), just as in my example above it would pick up some sort of symmetry. But your your decoder/encoder in your compression scheme should include/have access to more information regarding the model you are compressing. You might want to check out this post for an attempt at compressing model performance using interperetations.