For what notion is the first problem complicated, and the second simple?
I might be out of my depth here, but—could it be that sparse parity with noise is just objectively “harder than it sounds” (because every bit of noise inverts the answer), whereas protein folding is “easier than it sounds” (because if it weren’t, evolution wouldn’t have solved it)?
Just because the log-depth xor tree is small, doesn’t mean it needs to be easy to find, if it can hide amongst vastly many others that might have generated the same evidence … which I suppose is your point. (The “function approximation” frame encourages us to look at the boolean circuit and say, “What a simple function, shouldn’t be hard to noisily approximate”, which is not exactly the right question to be asking.)
I think it’s also a question of the learning techniques used. It seems like generalizable solutions to xor involve at least one of the two following:
noticing that many of the variables in the input can be permuted without changing the output and therefore it’s only the counts of the variables that matter
noticing how the output changes or doesn’t change when you start with one input and flip the variable one at a time
But current neural networks can’t use either of these techniques, partly because it doesn’t align well with the training paradigm. They both kind of require the network to be able to pick the inputs to see the ground truth for, whereas training based on a distribution has the (input, output) pairs randomized.
To “humanize” these problems, we can:
Break intuitive permutation invariance by using different symbols in each slot. So instead of an input looking like 010110 or 101001 or 111000, it might look like A😈4z👞i or B🔥😱Z😅0 or B😈😱Z😅i.
Break the ability to notice the effects of single bitflips by just seeing random strings rather than neighboring strings.
I might be out of my depth here, but—could it be that sparse parity with noise is just objectively “harder than it sounds” (because every bit of noise inverts the answer), whereas protein folding is “easier than it sounds” (because if it weren’t, evolution wouldn’t have solved it)?
Just because the log-depth xor tree is small, doesn’t mean it needs to be easy to find, if it can hide amongst vastly many others that might have generated the same evidence … which I suppose is your point. (The “function approximation” frame encourages us to look at the boolean circuit and say, “What a simple function, shouldn’t be hard to noisily approximate”, which is not exactly the right question to be asking.)
I think it’s also a question of the learning techniques used. It seems like generalizable solutions to xor involve at least one of the two following:
noticing that many of the variables in the input can be permuted without changing the output and therefore it’s only the counts of the variables that matter
noticing how the output changes or doesn’t change when you start with one input and flip the variable one at a time
But current neural networks can’t use either of these techniques, partly because it doesn’t align well with the training paradigm. They both kind of require the network to be able to pick the inputs to see the ground truth for, whereas training based on a distribution has the (input, output) pairs randomized.
To “humanize” these problems, we can:
Break intuitive permutation invariance by using different symbols in each slot. So instead of an input looking like 010110 or 101001 or 111000, it might look like A😈4z👞i or B🔥😱Z😅0 or B😈😱Z😅i.
Break the ability to notice the effects of single bitflips by just seeing random strings rather than neighboring strings.
This makes it intuitively much harder to me.