JanPro comments on Agentized LLMs will change the alignment landscape

JanPro 9 Apr 2023 11:18 UTC
34 points
23
Ad ChaosGPT:
Attempting to create (even weak) agent tasked with “destroying humanity” should be made very clear to be out of bounds of acceptable behavior. I feel that I want the author to be prosecuted.
Now the meme is: “haha we can tell AI to hurt us and make fun of how it fails”
- This would obviously backfire if the substrate were able to cause lots of damage.
What I would like the meme to be: this is extremely unethical, deserving outrage and perhaps attempted terrorism.
- awg 9 Apr 2023 15:50 UTC
  26 points
  14
  Parent
  I wonder if/when/how quickly this will be criminalized in a manner similar to terrorism or using weapons of mass destruction.
  - Prometheus 10 Apr 2023 16:33 UTC
    26 points
    29
    Parent
    If we’re being realistic, this kind of thing would only get criminalized after something bad actually happened. Until then, too many people will think “omg, it’s just a Chatbot”. Any politician calling for it would get made fun of on every Late Night show.
    - Seth Herd 10 Apr 2023 22:23 UTC
      10 points
      5
      Parent
      I’m almost certain this is already criminal, to the extent it’s actually dangerous. If you roll a boulder down the hill, you’re up for manslaughter if it kills someone, and reckless endangerment if it could’ve but didn’t hurt anyone. It doesn’t matter if it’s a boulder or software; if you should’ve known it was dangerous, you’re criminally liable.
      
      In this particular case, I have mixed feelings. This demonstration is likely to do immense good for public awareness of AGI risk. It even did for me, on an emotional level I haven’t felt before. But it’s also impossible to know when a dumb bot will come up with a really clever idea by accident, or when improvements have produced emergent intelligence. So we need to shut this down as much as possible as get to better capabilities. Of course, criminal punishments reduce bad behavior, but don’t eliminate it. So we also need to be able to detect and prevent malicious bot behavior, and keep up with prevention techniques (likely with aligned, better AGI from bigger corporations) as it gets more capable.
- Prometheus 10 Apr 2023 16:31 UTC
  5 points
  5
  Parent
  Yeah, all the questions over the years of “why would the AI want to kill us” could be answered with “because some idiot thought it would be funny to train an AI to kill everyone, and it got out of hand”. Unfortunately, stopping everyone on the internet from doing things isn’t realistic. It’s much better to never let the genie out of the bottle in the first place.
- lc 10 Apr 2023 22:39 UTC
  3 points
  1
  Parent
  
  Attempting to create (even weak) agent tasked with “destroying humanity” should be made very clear to be out of bounds of acceptable behavior. I feel that I want the author to be prosecuted.
  
  This seems like a bit much.
- NatCarlinhos 11 Apr 2023 1:27 UTC
  2 points
  −1
  Parent
  One thing that seems worth mentioning is that, based on my understanding of Alignment Theory, if some smarter version of ChaosGPT did kill all humans, it wouldn’t be because of the instructions it was given, but for the same reason any unaligned AI would kill all humans—that is, because it’s unaligned. It’s hard for me to imagine a scenario where an unaligned agent like ChaosGPT would be more likely to kill everyone than any given unaligned AI; the whole deal with the Outer Alignment Problem is that we don’t yet know how to get agents to do the things we want them to do, regardless of whether those things are benevolent or destructive or anything in between.
  Still, I agree that this sets a horrible precedent and that this sort of thing should be prosecuted in the future, if only because at some point if we do solve Alignment, an agent like ChaosGPT could be dangerous for (obvious) different reasons, unrelated to being unaligned.
  - Seth Herd 11 Apr 2023 1:35 UTC
    5 points
    6
    Parent
    I think most alignment people would use “aligned” as I am here. ChaosGPT is aligned to the intent of the person that set it in motion, (at least initially) even though it’s not aligned to the values of humanity at large. That would be bad outer alignment and bad coordination in the way I’m using those terms.
    
    And it will destroy humanity (if it gets smart enough to) for a very different reason than an unaligned AGI would. That’s it’s goal, while for an unaligned AGI it would be a subgoal or a side effect.
    
    It’s increasingly incorrect to say that we have no idea how to get an AGI to do what we want. We have no idea how to do that in closed form code instructions, but the limited success of RLHF and other training indicates that we have at least some ability to steer the behavior of deep networks. I think it’s still fair to say that we don’t have methods we can be confident of, or that are stable over time and learning. I’m nominating this approach of giving explicit goals in language as our new best shot.
- Sky Moo 11 Apr 2023 17:16 UTC
  1 point
  −2
  Parent
  I understand your emotional reaction to ChaosGPT in particular, but I actually think it’s important to keep in mind that ChaosGPT is equally as dangerous as AutoGPT when asked to make cookies, or make people smile. It really doesn’t matter what the goal is, it’s the optimization that leads to these instrumental biproducts that may lead to disaster.
  - Seth Herd 11 Apr 2023 17:54 UTC
    3 points
    0
    Parent
    Good point. It would be an even better emotional impact and intuition pump to see an agentized LLM arrive at destroying humanity as a subgoal of some other objective.
    
    Somebody put in producing paperclips as a goal to one of these; I’ve forgotten where I saw it. Maybe it was a baby AGI example? That one actually recognized the dangers and shifted to researching the alignment problem. That seemed to be the result of how the paperclip goal is linked to that issue in internet writing, and the RLHF and other ethical safeguards built into GPT4 as the core LLM. That example unfortunately sends the inaccurate opposite intuition, that these systems automatically have safeguards and ethics. They have that only when using an LLM with those things built in, and they’re still unreliable.