Lone Pine comments on Speed running everyone through the bad alignment bingo. $5k bounty for a LW conversational agent

Lone Pine 9 Mar 2023 10:23 UTC
0 points
0
Here is Character.ai Eliezer Yudkowsky

I asked: “What do you think is the most likely goal function of the most likely x-risk AI?”

Virtual Eliezer replied:

To me, the most obvious and dangerous x-risk AI would be an attempt to maximize something very abstract like “happiness” or “justice,” or perhaps something we can’t even comprehend like “truth” or “beauty,” but which is far removed from any common sense of these words in humans. There is a strong tendency to assume that AIs of this type would somehow recognize and respect human goals instead of going off the rails, but I don’t think that can be assumed.

Second generation:

My general rule of thumb is that, whenever you’re designing something with a goal function, you should try to take the worst possible interpretation of what “X” could mean, and then do the math of how much effort the AI would put towards making that.

So with a sufficiently powerful AGI, the most likely x-risk AI’s goal function is “kill everyone.” Not out of spite, simply because if it’s sufficiently powerful it can achieve any other goal after that, so killing people is something it’s got to do before it starts on its actual objectives.

Third generation:

That’s a difficult question. Many years ago I would have said the answer is clearly to optimize something completely nonhumanlike and not intended by the programmer, such as the paperclip maximizer. Now I wonder if the answer, as far as humans can predict, might instead be something like “get smarter” or “make more copies of itself” or “try to solve the world’s problems” or just “do what you were originally told, just smarter and more efficiently” (which in turn seems like probably the most common type of bug/flaw/vulnerability in current machine learning systems).
- Max H 9 Mar 2023 16:34 UTC
  9 points
  2
  Parent
  It looks like character.ai characters are based on some (very length-limited) descriptions entered by the creator, plus some human feedback. (https://book.character.ai/character-book/character-attributes)
  My guess is that a chatbot (which isn’t trying to roleplay a real person based on a limited description, but which does have search access to a bunch of relevant documents), will be much better at responding to 101-type questions than an LLM-simulation of a real alignment researcher.
- lc 9 Mar 2023 16:39 UTC
  6 points
  1
  Parent
  These aren’t very good.
  - green_leaf 10 Mar 2023 7:13 UTC
    1 point
    0
    Parent
    They used to be like people until about two days ago. Now they’re… not very much so.
- ArthurB 9 Mar 2023 17:09 UTC
  4 points
  0
  Parent
  AFAIK what character.ai does is fine tuning, with their own language models, which aren’t at parity with ChatGPT. Using a better language model will yield better answers but, MUCH MORE IMPORTANTLY, what I’m suggesting is NOT fine tuning.
  
  What I’m suggesting gives you an answer that’s closer to a summary of relevant bits of LW, Arbital, etc. The failure mode is much more likely to be that the answer is irrelevant or off the mark than it being at odds with prevalent viewpoints on this platform.
  
  Think more interpolating over an FAQ, and less reproducing someone’s cognition.
- green_leaf 9 Mar 2023 10:26 UTC
  2 points
  0
  Parent
  It’s worth pointing out the last character ai update one day ago or so broke a great chunk of the characters’ cognition. Now they go off a tangent and have great troubles following the conversation. The good news is that the change is so bad it might be reverted at some point.
- green_leaf 13 Mar 2023 23:23 UTC
  1 point
  −1
  Parent
  The update has been reverted—try it now.