drocta comments on Parsing Chris Mingard on Neural Networks

drocta 7 May 2021 20:10 UTC
LW: 10 AF: 2
AF
This comment I’m writing is mostly because this prompted me to attempt to see how feasible it would be to computationally enumerate the conditions for the weights of small networks like the 2 input 2 hidden layer 1 output in order to implement each of the possible functions. So, I looked at the second smallest case by hand, and enumerated conditions on the weights for a 2 input 1 output no hidden layer perceptron to implement each of the 2 input gates, and wanted to talk about it. This did not result in any insights, so if that doesn’t sound interesting, maybe skip reading the rest of this comment. I am willing to delete this comment if anyone would prefer I do that.
Of the 16 2-input-1-output gates, 2 of them, xor and xnor, can’t be done with the perceptrons with no hidden layer (as is well known), for 8 of them, the conditions on the 2 weights and the bias for the function to be implemented can be expressed as an intersection of 3 half spaces, and the remaining 6 can of course be expressed with an intersection of 4 (the maximum number that could be required, as for each specific input and output, the condition on the weights and bias in order to have that input give that output is specified by a half space, so specifying the half space for each input is always enough).

The ones that require 4 are: the constant 0 function, the constant 1 function, return the first input, return the second input, return the negation of the first input, and return the negation of the second input.
These seem, surprisingly, among the simplest possible behaviors. They are the ones which disregard at least one input. It seems a little surprising to me that these would be the ones that require an intersection of 4 half spaces.

I haven’t computed the proportions of the space taken up by each region so maybe the ones that require 4 planes aren’t particularly smaller. And I suppose with this few inputs, it may be hard to say that any of these functions are really substantially more simple than any of the rest of them. Or it may be that the tendency for simpler functions to occupy more space only shows up when we actually have hidden layers and/or have many more nodes.
Here is a table (x and y are the weights from a and b to the output, and z is the bias on the output):
outputs for the different inputs when this function is computed
0000 (i.e. the constant 0) z<0, x+y+z<0, x+z<0, y+z<0
0001 (i.e. the and gate) x+y+z>0, x+z<0, y+z<0
0010 (i.e. a and not b) z<0, x+y+z<0, x+z>0
0011 (i.e. if input a) z<0, x+y+z>0, x+z>0, y+z<0
0100 (i.e. b and not a) z<0, x+y+z<0, y+z>0
0101 (i.e. if input b) z<0, x+y+z>0, x+z<0, y+z>0
0110 (i.e. xor) impossible
0111 (i.e. or) z<0, x+z>0, y+z>0
1000 (i.e. nor) z>0, x+z<0, y+z<0
1001 (i.e. xnor) impossible
1010 (i.e. not b) z>0, x+y+z<0, x+z>0, y+z<0
1011 (i.e. b->a ) z>0, x+y+z>0, x+z<0
1100 (i.e. not a) z>0, x+y+z<0, x+z<0, y+z>0
1101 (i.e. a->b ) z>0, x+y+z>0, y+z<0
1110 (i.e. nand ) x+y+z<0, x+z>0, y+z>0
1111 (i.e. constant 0) z>0, x+z>0, y+z>0, x+y+z>0
- Alex Flint 8 May 2021 14:55 UTC
  LW: 4 AF: 2
  AF Parent
  Very very cool. Thank you for this drocta. What would it take to map out the sizes of the volumes corresponding to each of these mappings? Also, could you perhaps compute the exact Kolmogorov complexity of these mappings in some particular description language, since they are so small? It would be super interesting to me to assemble a table of volumes and Kolmogorov complexities for each of these small mappings. It may then be possible to write some code that does the same for 3-input and 4-input mappings.
  - drocta 25 May 2021 5:23 UTC
    1 point
    Parent
    For the volumes, I suppose that because scaling all of these parameters by the same positive constant doesn’t change the function computed, it would make sense to compute the volumes of the corresponding regions of the cube, and this would handle the issues with these regions having unbounded size.
    (this would still work with more parameters, it would just be a higher dimensional sphere)
    Er, would that give the same thing as the limit if we took the parameters within a cube?
    Anyway, at least in this case, if we use the “projected onto the sphere” case, we could evaluate the areas by splitting the regions (which would be polygons of some kind, with edges being arcs of great circles) into triangles, and then using the formulas for the areas of triangles on a sphere. Actually, they might already be triangles, I’m not sure.
    Would this work in higher dimensions? I don’t know of formulas for computing the measure of a n-simplex (with flat facets or whatever the right terminology is) within an n-sphere, but I suspect that they shouldn’t be too bad?
    I’m not sure which is the more sensible thing to measure, the volumes of the intersection of the half spaces (intersected with a large cube centered at the origin and aligned with the coordinate axes), or the volume (one dimension lower) of that intersected-with/projected-onto the unit sphere.
    Well, I guess if we assume that the coefficients are identically and independently distributed with a Gaussian distribution, then that would be a fairly natural choice, and should result in things being symmetric about rotations in the origin, which would seem to point to the choice of projecting it all to the (hyper-)sphere.
    Well, I suppose in either case (whether on the sphere or in a cube), even before trying to apply some formulas about the area of a triangle on a sphere, there’s always the “just take the integral” option.
    (in the cube option, this would I think be more straightforwards. Just have to do a triple integral (more in higher dimensions) of 1 with linear inequalities for the bounds. No real issues should show up.)
    I’ll attempt it with the conditions for “and” for the “on the sphere” case, to check the feasibility.
    If we have x+y+z>0, x+z<0, y+z<0, then we necessarily also have z<0 , x>0, y>0 , in particular x<-z , y<-z . If we have x,y,z on the unit sphere, then we have x^2+y^2+z^2=1 . So, for each value of z (which must be strictly between −1 and 0) we have x^2 + y^2 = 1 - z^2 , and because we have x>0 and y>0 , for a given z, for each value of x there is exactly one value of y, and visa versa.
    So, y = sqrt(1 - z^2 - x^2) , and so we have x + sqrt(1 - z^2 - x^2) > -z , …
    this is somewhat more difficult to calculate than I had hoped.
    Still confident that it can be done, but I shouldn’t finish this calculation right now due to responsibilities.
    It looks like, at least in this case with 3 parameters, that it would probably be easier to use the formulas for the area of triangles on a sphere, but I wouldn’t be surprised if, when generalizing to higher dimensions, doing it that way becomes harder.
    It looks like Chris Mingard’s reply has nice results which say much of what I think one would want from this direction? Well, it is less “enumerate them specifically”, and more “for functions which have a given proportion of outputs being 1″, but, still. (also I haven’t read it, just looked briefly at it)
    I don’t know what particular description language you would want to use for this. I feel like this is such a small case that small differences in choice of description language might overwhelm any difference in complexity that these would have within the given description language?
    - Alex Flint 25 May 2021 17:06 UTC
      2 points
      Parent
      Yes it does seem challenging to compute realistic complexity measures for such small functions. Perhaps we could just look at the mappings ordered by their volume in parameter space, and check whether the mappings at the top of that ordering “seem” less complex than the mappings at the bottom.
    - drocta 26 May 2021 2:40 UTC
      1 point
      Parent
      I’ve now computed the volumes within the [-a,a]^3 cube for and, or, and the constant 1 function. I was surprised by the results.
      (I hadn’t considered that the ratios between the volumes will not depend on the size of the cube)
      If we select x,y,z uniformly at random within this cube, the probability of getting the and gate is ¹⁄₄₈, the probability of getting the or gate is ²⁄₄₈, and the probability of getting the constant 1 function is ¹³⁄₄₈ (more than ¹⁄₄).
      This I found quite surprising, because of the constant 1 function requiring 4 half planes to express the conditions for it.
      So, now I’m guessing that the ones that required fewer half spaces to specify, are the ones where the individual constraints are already implying other constraints, and so actually will tend to have a smaller volume.
      On the other hand, I still haven’t computed any of them for if projecting onto the sphere, and so this measure kind of gives extra weight to the things in the directions near the corners of the cube, compared to the measure that would be if using the sphere.
- Chris Mingard 12 May 2021 13:50 UTC
  3 points
  Parent
  Check out https://arxiv.org/pdf/1909.11522.pdf where we do some similar analysis of perceptrons but in higher dimensions. Theorem 4.1 shows that there is an anti-entropy bias—in other words, functions with either mostly 0s or mostly 1s are exponentially more likely to show up than expected under a uniform prior—which holds for perceptrons of any dimension. This proves a (fairly trivial) bias towards simple functions, although it doesn’t say anything about why a function like 010101010101… appears more frequently than other functions in the maximum-entropy class.