Steven Byrnes comments on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?

Steven Byrnes 10 Sep 2020 1:42 UTC
4 points
In a brain-like AGI, as I imagine it, the “neocortex” module does all the smart and dangerous things, but it’s a (sorta)-general-purpose learning algorithm that starts from knowing nothing (random weights) and gets smarter and smarter as it trains. Meanwhile a separate “subcortex” module is much simpler (dumber) but has a lot more hardwired information in it, and this module tries to steer the neocortex module to do things that we programmers want it to do, primarily (but not exclusively) by calculating a reward signal and sending it to the neocortex as it operates. In that case, let’s look at 3 scenarios:
1. The neocortex module is steered in the opposite direction from what was intended by the subcortex’s code, and this happens right from the beginning of training.
Then the neocortex probably wouldn’t work at all. The subcortex is important for capabilities as well as goals; for example, the subcortex (I believe) has a simple human-speech-sound detector, and it prods the neocortex that those sounds are important to analyze, and thus a baby’s neocortex learns to model human speech but not to model the intricacies of bird songs. The reliance on the subcortex for capabilities is less true in an “adult” AGI, but very true in a “baby” AGI I think; I’m skeptical that a system can bootstrap itself to superhuman intelligence without some hardwired guidance / curriculum early on. Moreover, in the event that the neocortex does work, it would probably misbehave in obvious ways very early on, before it knows it knows anything about the world, what is a “person”, etc. Hopefully there would be human or other monitoring of the training process that would catch that.
2. The neocortex module is steered in the opposite direction from what was intended by the subcortex’s code, and this happens when it is already smart.
The subcortex doesn’t provide a goal system as a nicely-wrapped package to be delivered to the neocortex; instead it delivers little bits of guidance at a time. Imagine that you’ve always loved beer, but when you drink it now, you hate it, it’s awful. You would probably stop drinking beer, but you would also say, “what’s going on?” Likewise, the neocortex would have developed a rich interwoven fabric of related goals and beliefs, much of which supports itself with very little ground-truth anchoring from subcortex feedback. If the subcortex suddenly changes its tune, there would be a transition period when the neocortex would retain most of its goal system from before, and might shut itself down, email the programmers, hack into the subcortex, or who knows what, to avoid getting turned into (what it still mostly sees as) a monster. The details are contingent on how we try to steer the neocortex.
3. The neocortex’s own goal system flips sign suddenly.
Then the neocortex would suddenly become remarkably ineffective. The neocortex uses the same system for flagging concepts as instrumental goals and flagging concepts as ultimate goals, so with a sign flip, it gets all the instrumental goals wrong; it finds it highly aversive to come up with a clever idea, or to understand something, etc. etc. It would take a lot of subcortical feedback to get the neocortex working again, if that’s even possible, and hopefully the subcortex would recognize a problem.
This is just brainstorming off the top of my (sleep-deprived) head. I think you’re going to say that none of these are particularly rock-solid assurance that the problem could never ever happen, and I’ll agree.
What links here?
- Why does (any particular) AI safety work reduce s-risks more than it increases them? by MichaelStJules (EA Forum; 3 Oct 2021 16:55 UTC; 48 points)
- Anirandis 10 Sep 2020 2:11 UTC
  2 points
  Parent
  I hadn’t really considered the possibility of a brain-inspired/neuromorphic AI, thanks for the points.
  (2) seems interesting; as I understand it, you’re basically suggesting that the error would occur gradually & the system would work to prevent it. Although maybe the AI realises it’s getting positive feedback for bad things and keeps doing them, or something (I don’t really know, I’m also a little sleep deprived and things like this tend to do my head in.) Like, if I hated beer then suddenly started liking it, I’d probably continue drinking it. Maybe the reward signals are simply so strong that the AI can’t resist turning into a “monster”, or whatever. Perhaps the system would implement checksums of some sort to do this automatically?
  A similar point to (3) was raised by Dach in another thread, although I’m uncertain about this since GPT-2 was willing to explore new strategies when it got hit by a sign-flipping bug. I don’t doubt that it would be different with a neuromorphic system, though.
  - Steven Byrnes 10 Sep 2020 13:19 UTC
    4 points
    Parent
    Maybe the reward signals are simply so strong that the AI can’t resist turning into a “monster”, or whatever.
    The whole point of the reward signals are to change the AI’s motivations; we design the system such that that will definitely happen. But a full motivation system might consist of 100,000 neocortical concepts flagged with various levels of “this concept is rewarding”, and each processing cycle where you get subcortical feedback, maybe only one or two of those flags would get rewritten, for example. Then it would spend a while feeling torn and conflicted about lots of things, as its motivation system gets gradually turned around. I’m thinking that we can and should design AGIs such that if it feels very torn and conflicted about something, it stops and alerts the programmer; and there should be a period where that happens in this scenario.
    GPT-2 was willing to explore new strategies when it got hit by a sign-flipping bug
    I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.
    - Anirandis 10 Sep 2020 13:36 UTC
      4 points
      Parent
      I see. I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.
      
      I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.
      I just raised GPT-2 to indicate that flipping the goal sign suddenly can lead to optimising for bad behavior without the AI neglecting to consider new strategies. Presumably that’d suggest it’s also a possibility with cosmic ray/other errors.
      - Steven Byrnes 10 Sep 2020 17:12 UTC
        4 points
        Parent
        I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.
        I’m not sure what probability people on this forum would put on brain-inspired AGI. I personally would put >50%, but this seems quite a bit higher than other people on this forum, judging by how little brain algorithms are discussed here compared to prosaic (stereotypical PyTorch / Tensorflow-type) ML. Or maybe the explanation is something else, e.g. maybe people feel like they don’t have any tractable directions for progress in that scenario (or just don’t know enough to comment), or maybe they have radically different ideas than me about how the brain works and therefore don’t distinguish between prosaic AGI and brain-inspired AGI.
        Understanding brain algorithms is a research program that thousands of geniuses are working on night and day, right now, as we speak, and the conclusion of the research program is guaranteed to be AGI. That seems like a pretty good reason to put at least some weight on it! I put even more weight on it because I’ve worked a lot on trying to understand how the neocortical algorithm works, and I don’t think that the algorithm is all that complicated (cf. “cortical uniformity”), and I think that ongoing work is zeroing in on it (see here).