gwern comments on Contra Yudkowsky on Doom from Foom #2

gwern 27 Apr 2023 20:28 UTC
13 points
6

For the greatest cheap circuit density, energy efficiency, and speed you need to use analog synapses but in doing so you basically give up the ability to easily transfer the knowledge out of the system—it becomes more ‘mortal’ as hinton recently argues.

This seems like a small tradeoff, and this does not seem like a big enough deal to restore these to anything like human mortality, with all its enormous global effects. It may be much harder to copy weights off a idiosyncratic mess of analogue circuits modified in-place by their training to maximize energy efficiency than it is to run cp foo.pkl bar.pkl, absolutely, but the increase in difficulty here seems more on par with ‘a small sub-field with a few hundred grad students/engineers for a few years’ than ‘the creation of AGI’, and so one can assume it’d be solved almost immediately should it ever actually become a problem.

For example, even if it’s ultra-miniaturized, you can tap connections to optionally read off activations between many pairs of layers, which will affect only a small part of it and not eliminate the miniaturization or energy savings—and with the layer embeddings summarizing a group of layers, now you can do knowledge distillation to another such neuromorphic computer (or smaller). Knowledge distillation, or self-distillation rather, will cost little and works well. Or, since you can presumably set the analogue values even if you can’t read them, and have a model worth copying, you can pay the one-time cost to distill it out to a more von-Neumann computer, one where you can more easily read the weights out, and thence copy it onto all of the other neuromorphics henceforth. Or, you can reverse-engineer the weights themselves: probe the original and the copy with synthetic data flipping a bit at a time to run finite-differences on outputs like activations/embeddings, starting at the lowest available tap, to eventually reconstruct the equivalent weights group by group. (This may require lots of probes, but these systems by definition run extremely fast and since you’re only probing a small part of it at a time, run even faster than that.) Just off the cuff, and I’m sure you could think of several better approaches if you tried. So I don’t expect ‘mortal’ NNs to be all that different from our current ‘immortal’ NNs or things like FPGAs.
- jacob_cannell 27 Apr 2023 20:42 UTC
  5 points
  1
  Parent
  Largely agreed, which is partly why I said only more ‘mortal’ with ‘mortal’ in scare quotes. Or put another way, the full neuromorphic analog route still isn’t as problematic to copy weights out of vs an actual brain, and I expect actual uploading to be possible eventually so … it’s mostly a matter of copy speeds and expenses as you point out, and for the most hardcore analog neuromorphic designs like brains you still can exploit sophisticated distillation techniques as you discuss. But it does look like there are tradeoffs that increase copy out cost as you move to the most advanced neuromorphic designs.
  - Gerald Monroe 30 Apr 2023 16:53 UTC
    1 point
    −2
    Parent
    This whole thing is just thought experiment, correct? “what we would have to do to mimic the brain’s energy efficiency”. Because analog synapses where we left off a network of analog gates to connect any given synapse to an ADC (something that current prototype analog inference accelerators use, and analog FPGAs do exist) are kinda awful.
    The reason is because of https://openai.com/research/emergent-tool-use . What they found in this paper was that you want to make your Bayesian updates to your agent’s policy in large batches. Meaning you need to be able to copy the policy many times across a fleet of hardware that runs in separate agents, and learn the expected value and errors of the given policy across a larger batch of episodes. The copying requires precise reading of the values, so they need to be binary, and there is no benefit from modifying the policy rapidly in real time.
    The reason why we have brains that learn rapidly in real time, overfitting to a small number of strong examples, is because this was all that was possible with the hardware nature could evolve. It is suboptimal.