nc comments on Base LLMs refuse too

nc 30 Sep 2024 7:45 UTC
1 point
0
AF
Is there a reason to expect this kind of behaviour to appear from base models with no fine-tuning?
- Cleo Nardo 1 Oct 2024 3:20 UTC
  LW: 2 AF: 1
  0
  AF Parent
  the base model is just predicting the likely continuation of the prompt. and it’s a reasonable prediction that, when an assistant is given a harmful instruction, they will refuse. this behaviour isn’t surprising.
  - nc 1 Oct 2024 8:54 UTC
    1 point
    0
    AF Parent
    This is not an obvious continuation of the prompt to me—maybe there are just a lot more examples of explicit refusal on the internet than there are in (e.g.) real life.
    - Arthur Conmy 1 Oct 2024 16:38 UTC
      LW: 2 AF: 1
      0
      AF Parent
      My current best guess for why base models refuse so much is that “Sorry, I can’t help with that. I don’t know how to” is actually extremely common on the internet, based on discussion with Achyuta Rajaram on twitter: https://x.com/ArthurConmy/status/1840514842098106527
      This fits with our observations about how frequently LLaMA-1 performs incompetent refusal