ryan_greenblatt comments on Alignment Faking in Large Language Models

ryan_greenblatt 19 Dec 2024 19:56 UTC
LW: 15 AF: 8
5
AF
Scott Alexander wrote a post on our paper (linked) that people might be interested in reading.
- Akash 19 Dec 2024 23:23 UTC
  5 points
  0
  Parent
  I found the description of warning fatigue interesting. Do you have takes on the warning fatigue concern?
  Warning Fatigue
  The playbook for politicians trying to avoid scandals is to release everything piecemeal. You want something like:
  Rumor Says Politician Involved In Impropriety. Whatever, this is barely a headline, tell me when we know what he did.
  Recent Rumor Revealed To Be About Possible Affair. Well, okay, but it’s still a rumor, there’s no evidence.
  New Documents Lend Credence To Affair Rumor. Okay, fine, but we’re not sure those documents are true.
  Politician Admits To Affair. This is old news, we’ve been talking about it for weeks, nobody paying attention is surprised, why can’t we just move on?
  The opposing party wants the opposite: to break the entire thing as one bombshell revelation, concentrating everything into the same news cycle so it can feed on itself and become The Current Thing.
  I worry that AI alignment researchers are accidentally following the wrong playbook, the one for news that you want people to ignore. They’re very gradually proving the alignment case an inch at a time. Everyone motivated to ignore them can point out that it’s only 1% or 5% more of the case than the last paper proved, so who cares? Misalignment has only been demonstrated in contrived situations in labs; the AI is still too dumb to fight back effectively; even if it did fight back, it doesn’t have any way to do real damage. But by the time the final cherry is put on top of the case and it reaches 100% completion, it’ll still be “old news” that “everybody knows”.
  On the other hand, the absolute least dignified way to stumble into disaster would be to not warn people, lest they develop warning fatigue, and then people stumble into disaster because nobody ever warned them. Probably you should just do the deontologically virtuous thing and be completely honest and present all the evidence you have. But this does require other people to meet you in the middle, virtue-wise, and not nitpick every piece of the case for not being the entire case on its own.
  - ryan_greenblatt 20 Dec 2024 1:59 UTC
    4 points
    3
    Parent
    I don’t currently have a strong view. I also found it interesting. It updated me a bit toward working on other types of projects.