RogerDearnaley comments on What I Would Do If I Were Working On AI Governance

RogerDearnaley 8 Dec 2023 10:40 UTC
15 points
8
Cunningham’s Law
Cunningham’s Law: “the best way to get the right answer on the internet is not to ask a question; it’s to post the wrong answer.”
This suggests an alternative to the “helpful assistant” paradigm and its risk of sycophancy during RL training: come up with a variant of instruct training. where, rather than asking the chatbot a question that it will then answer, you instead tell it your opinion, and it corrects you at length, USENET-style. It should be really easy to elicit this behavior from base models.
- johnswentworth 8 Dec 2023 10:59 UTC
  4 points
  0
  Parent
  That is a surprisingly excellent idea.
  - RogerDearnaley 8 Dec 2023 11:09 UTC
    7 points
    2
    Parent
    I’m almost tempted to Cunningham-tune a Mistral 7B base model. We’d only need O(10,000) good examples and O($100). And it would be funny as hell.
    - RogerDearnaley 8 Dec 2023 11:24 UTC
      3 points
      0
      Parent
      Trying this prompting approach briefly on GPT-4, if you just venture a clearly-mistaken opinion, it does politely but informatively correct you (distinctly not USENET-style). On some debatable subjects it was rather sycophantic to my viewpoint, though with a bit of on-the-other-hand push-back in later paragraphs. So I’m gradually coming to the opinion this is only about as humorous as Grok. But it still might be a thought-provoking change of pace.
      - RogerDearnaley 8 Dec 2023 11:34 UTC
        3 points
        0
        Parent
        IMO the criterion for selecting the positive training examples should be that the chatbot won the argument, under standard debating rules (plus Godwin’s Law, of course): it net-shifted a vote of humans towards its position. If the aim is to evoke USENET, I think we should allow the chatbot to use more then one persona holding more than one viewpoint, even ones that also argue with each other.