Cunningham’s Law: “the best way to get the right answer on the internet is not to ask a question; it’s to post the wrong answer.”
This suggests an alternative to the “helpful assistant” paradigm and its risk of sycophancy during RL training: come up with a variant of instruct training. where, rather than asking the chatbot a question that it will then answer, you instead tell it your opinion, and it corrects you at length, USENET-style. It should be really easy to elicit this behavior from base models.
Trying this prompting approach briefly on GPT-4, if you just venture a clearly-mistaken opinion, it does politely but informatively correct you (distinctly not USENET-style). On some debatable subjects it was rather sycophantic to my viewpoint, though with a bit of on-the-other-hand push-back in later paragraphs. So I’m gradually coming to the opinion this is only about as humorous as Grok. But it still might be a thought-provoking change of pace.
IMO the criterion for selecting the positive training examples should be that the chatbot won the argument, under standard debating rules (plus Godwin’s Law, of course): it net-shifted a vote of humans towards its position. If the aim is to evoke USENET, I think we should allow the chatbot to use more then one persona holding more than one viewpoint, even ones that also argue with each other.
Cunningham’s Law: “the best way to get the right answer on the internet is not to ask a question; it’s to post the wrong answer.”
This suggests an alternative to the “helpful assistant” paradigm and its risk of sycophancy during RL training: come up with a variant of instruct training. where, rather than asking the chatbot a question that it will then answer, you instead tell it your opinion, and it corrects you at length, USENET-style. It should be really easy to elicit this behavior from base models.
That is a surprisingly excellent idea.
I’m almost tempted to Cunningham-tune a Mistral 7B base model. We’d only need O(10,000) good examples and O($100). And it would be funny as hell.
Trying this prompting approach briefly on GPT-4, if you just venture a clearly-mistaken opinion, it does politely but informatively correct you (distinctly not USENET-style). On some debatable subjects it was rather sycophantic to my viewpoint, though with a bit of on-the-other-hand push-back in later paragraphs. So I’m gradually coming to the opinion this is only about as humorous as Grok. But it still might be a thought-provoking change of pace.
IMO the criterion for selecting the positive training examples should be that the chatbot won the argument, under standard debating rules (plus Godwin’s Law, of course): it net-shifted a vote of humans towards its position. If the aim is to evoke USENET, I think we should allow the chatbot to use more then one persona holding more than one viewpoint, even ones that also argue with each other.