gwern comments on Pop Culture Alignment Research and Taxes

gwern 17 Apr 2022 19:18 UTC
8 points
Quibble: reminder that the Tay example is probably not real and shouldn’t be used.

As far as I can tell, there is no empirical evidence of deceptive AI yet

Surely the Anthropic Codex code-vulnerability prompting is a great example?
- Jan 18 Apr 2022 8:40 UTC
  2 points
  Parent
  Thanks for the comment! I’m curious about the Anthropic Codex code-vulnerability prompting, is this written up somewhere? The closest I could find is this, but. I don’t think that’s what you’re referencing?
  - gwern 18 Apr 2022 15:29 UTC
    8 points
    Parent
    https://arxiv.org/pdf/2107.03374.pdf#page=27
    - Jan 18 Apr 2022 17:52 UTC
      2 points
      Parent
      Interesting, thank you! I guess I was thinking of deception as characterized by Evan Hubinger, with mesa-optimizers, bells, whistles, and all. But I can see how a sufficiently large competence-vs-performance gap could also count as deception.
- Maxwell Peterson 17 Apr 2022 19:37 UTC
  2 points
  Parent
  Wait! There’s doubts about the Tay story? I didn’t know that, and have failed to turn up anything in a few different searches just now. Can you say more, or drop a link if you have one?
  - gwern 17 Apr 2022 23:57 UTC
    11 points
    Parent
    I don’t want to write an essay about this, it’s too stupid an incident for anyone to waste time thinking about, but somehow everyone thinks it’s a great example and must be mentioned in every piece on AI risk… Some material: https://news.ycombinator.com/item?id=30739093 https://www.gwern.net/Leprechauns
    - Jan 18 Apr 2022 8:24 UTC
      1 point
      Parent
      I was not aware of this, thanks for pointing this out! I made a note in the text. I guess this is not an example of “advanced AI with an unfortunately misspecified goal” but rather just an example of the much larger class of “system with an unfortunately misspecified goal”.
    - Maxwell Peterson 18 Apr 2022 3:42 UTC
      1 point
      Parent
      Thanks!