Anirandis comments on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?

Anirandis 10 Sep 2020 2:11 UTC
2 points
I hadn’t really considered the possibility of a brain-inspired/neuromorphic AI, thanks for the points.
(2) seems interesting; as I understand it, you’re basically suggesting that the error would occur gradually & the system would work to prevent it. Although maybe the AI realises it’s getting positive feedback for bad things and keeps doing them, or something (I don’t really know, I’m also a little sleep deprived and things like this tend to do my head in.) Like, if I hated beer then suddenly started liking it, I’d probably continue drinking it. Maybe the reward signals are simply so strong that the AI can’t resist turning into a “monster”, or whatever. Perhaps the system would implement checksums of some sort to do this automatically?
A similar point to (3) was raised by Dach in another thread, although I’m uncertain about this since GPT-2 was willing to explore new strategies when it got hit by a sign-flipping bug. I don’t doubt that it would be different with a neuromorphic system, though.
- Steven Byrnes 10 Sep 2020 13:19 UTC
  4 points
  Parent
  Maybe the reward signals are simply so strong that the AI can’t resist turning into a “monster”, or whatever.
  The whole point of the reward signals are to change the AI’s motivations; we design the system such that that will definitely happen. But a full motivation system might consist of 100,000 neocortical concepts flagged with various levels of “this concept is rewarding”, and each processing cycle where you get subcortical feedback, maybe only one or two of those flags would get rewritten, for example. Then it would spend a while feeling torn and conflicted about lots of things, as its motivation system gets gradually turned around. I’m thinking that we can and should design AGIs such that if it feels very torn and conflicted about something, it stops and alerts the programmer; and there should be a period where that happens in this scenario.
  GPT-2 was willing to explore new strategies when it got hit by a sign-flipping bug
  I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.
  - Anirandis 10 Sep 2020 13:36 UTC
    4 points
    Parent
    I see. I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.
    
    I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.
    I just raised GPT-2 to indicate that flipping the goal sign suddenly can lead to optimising for bad behavior without the AI neglecting to consider new strategies. Presumably that’d suggest it’s also a possibility with cosmic ray/other errors.
    - Steven Byrnes 10 Sep 2020 17:12 UTC
      4 points
      Parent
      I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.
      I’m not sure what probability people on this forum would put on brain-inspired AGI. I personally would put >50%, but this seems quite a bit higher than other people on this forum, judging by how little brain algorithms are discussed here compared to prosaic (stereotypical PyTorch / Tensorflow-type) ML. Or maybe the explanation is something else, e.g. maybe people feel like they don’t have any tractable directions for progress in that scenario (or just don’t know enough to comment), or maybe they have radically different ideas than me about how the brain works and therefore don’t distinguish between prosaic AGI and brain-inspired AGI.
      Understanding brain algorithms is a research program that thousands of geniuses are working on night and day, right now, as we speak, and the conclusion of the research program is guaranteed to be AGI. That seems like a pretty good reason to put at least some weight on it! I put even more weight on it because I’ve worked a lot on trying to understand how the neocortical algorithm works, and I don’t think that the algorithm is all that complicated (cf. “cortical uniformity”), and I think that ongoing work is zeroing in on it (see here).