gwern comments on Brain-inspired AGI and the “lifetime anchor”

gwern 2 Oct 2021 22:05 UTC
LW: 14 AF: 5
AF
The biologist answer there seems to be question-begging. What reason is there to think it isn’t? Animals can’t split and merge themselves or afford the costs or store datasets for exact replay etc, so they would be unable to do that whether or not it was possible, and so they provide zero evidence about whether their internal algorithms would be able to do it. You might argue that there might be multiple ‘families’ of algorithms all delivering animal-level intelligence, some of which are parallelizable and some not, and for lack of any incentive animals happened to evolve a non-parallelizable one, but this is pure speculation and can’t establish that the non-parallelizable one is superior to the others (much less is the only such family).

From the ML or statistics view, it seems hard for parallelization in learning to not be useful. It’s a pretty broad principle that more data is better than less data. Your neurons are always estimating local gradients with whatever local learning rule they have, and these gradients are (extremely) noisy, and can be improved by more datapoints or rollouts to better estimate the update that jointly optimizes all of the tasks; almost by definition, this seems superior to getting less data one point at a time and doing noisy updates neglecting most of the tasks.

If I am a DRL agent and I have n hypotheses about the current environment, why am I harmed by exploring all n in parallel with n copied agents, observing the updates, and updating my central actor with them all? Even if they don’t produce direct gradients (let’s handwave an architecture where somehow it’d be bad to feed them all in directly, maybe it’s very fragile to off-policyness), they are still producing observations I can use to update my environment model for planning, and I can go through them and do learning before I take any more actions. (If you were in front of a death maze and were watching fellow humans run through it and get hit by the swinging blades or acid mists or ironically-named boulders, you’d surely appreciate being able to watch as many runs as possible by your fellow humans rather than yourself running it.)

In particular, for non-superhuman AIs-in-training, we already have tons of pedagogical materials like human textbooks and lectures. So I don’t see teams-of-AIs-who talk-to-each-other being all that helpful in getting to superhuman faster.

If we look at some of these algorithms, it’s even less compelling to argue that there’s some deep intrinsic reason we want to lock learning to small serial steps—look at expert iteration in AlphaZero, where the improved estimates that the NN is repeatedly retrained on don’t even come from the NN itself, but an ‘expert’ (eg a NN + tree search); what would we gain by ignoring the expert’s provably superior board position evaluations (which would beat the NN if they played) and forcing serial learning? At least, given that MuZero/AlphaZero are so good, this serial biological learning process, whatsoever it may be, has failed to produce superior results to parallelized learning, raising questions about what circumstances exactly yield these serial-mandatory benefits...
- Steven Byrnes 3 Oct 2021 1:37 UTC
  LW: 4 AF: 2
  AF Parent
  The biologist answer there seems to be question-begging
  Yeah, I didn’t bother trying to steelman the imaginary biologist. I don’t agree with them anyway, and neither would you.
  (I guess I was imagining the biologist belonging to the school of thought (which again I strongly disagree with) that says that intelligence doesn’t work by a few legible algorithmic principles, but is rather a complex intricate Rube Goldberg machine, full of interrelated state variables and so on. So we can’t just barge in and make some major change in how the step-by-step operations work, without everything crashing down. Again, I don’t agree, but I think something like that is a common belief in neuroscience/CogSci/etc.)
  it seems hard for parallelization in learning to not be useful … why am I harmed …
  I agree with “useful” and “not harmful”. But an interesting question is: Is it SO helpful that parallelization can cut the serial (subjective) time from 30 years to 15 years? Or what about 5 years? 2 years? I don’t know! Again, I think at least some brain-like learning has to be serial (e.g. you need to learn about multiplication before nonabelian cohomology), but I don’t have a good sense for just how much.