Gordon Seidoh Worley comments on How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe?

Gordon Seidoh Worley Sep 11, 2020, 1:28 AM
4 points
0
You seem to be approaching a kind of reasoning about this similar to something I’ve explored in a slightly different way (reducing risk of false positives in AI alignment mechanism design). You might find it interesting.
- Anirandis Sep 11, 2020, 3:47 AM
  2 points
  0
  Parent
  Seems a little bit beyond me at 4:45am—I’ll probably take a look tomorrow when I’m less sleep deprived (although still can’t guarantee I’ll be able to make it through then; there’s quite a bit of technical language in there that makes my head spin.) Are you able to provide a brief tl;dr, and have you thought much about “sign flip in reward function” or “direction of updates to reward model flipped”-type errors specifically? It seems like these particularly nasty bugs could plausibly be mitigated more easily than avoiding false positives (as you defined them in the arxiv’s paper’s abstract) in general.
  - Zack_M_Davis Sep 11, 2020, 3:22 PM
    9 points
    0
    Parent
    Sleep is very important! Get regular sleep every night! Speaking from personal experience, you don’t want to have a sleep-deprivation-induced mental breakdown while thinking about Singularity stuff!
    - Anirandis Sep 11, 2020, 4:02 PM
      4 points
      0
      Parent
      My anxieties over this stuff tend not to be so bad late at night, TBH.
  - Gordon Seidoh Worley Sep 11, 2020, 4:50 PM
    3 points
    0
    Parent
    Actually, I’m not sure that sign flips are easier to deal with. A sentiment I’ve heard expressed before is that it’s much easier to trim something to more a little more or less of it, but it’s much harder to know if you’ve got it pointed in the right direction or not. Ultimately, addressing false positives though ends up being about these kind of directional issues.
    - Anirandis Sep 11, 2020, 5:10 PM
      2 points
      0
      Parent
      it’s much harder to know if you’ve got it pointed in the right direction or not
      Perhaps, but the type of thing I’m describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it’s not going to torture everyone if that’s the case.)
      This seems easier than recognising whether the sign is flipped or just designing a system that can’t experience these sign-flip type errors; I’m just unsure whether this is something that we have robust solutions for. If it turns out that someone’s figured out a reliable solution to this problem, then the only real concern is whether the AI’s developers would bother to implement it. I’d much rather risk the system going wrong and paperclipping than going wrong and turning “I have no mouth, and I must scream” into a reality.