Bogdan Ionut Cirstea comments on Interpreting the Learning of Deceit

Bogdan Ionut Cirstea 11 Feb 2024 22:37 UTC
2 points
0
It seems to me like there might be a more general insight to draw here, something along the following line. As long as we’re still in the current paradigm, where most model capabilities (including undesirable ones) come from pre-training, and they’d (mostly) only get “wrapped” by fine-tuning the model, (with appropriate prompting—or even other elicitation tools) the pre-trained models can serve as “model organisms” for about any elicitable misbehavior.
- RogerDearnaley 6 Mar 2024 10:12 UTC
  1 point
  0
  Parent
  I completely agree. LLMs are so context dependent that just about any good or bad behavior that a significant number of instances of can be found in the training set can be elicited from them by suitable prompts. Fine tuning can increse their resistance to this, but not by anything like enough.. We either need to filter the training set, which risks them just not understanding bad behaviors, rather than actually knowing to avoid them, making it had to know what will happen when they in-context learn about them, or else we need to use something like conditional pretraining along the lines I discuss in How to Control an LLM’s Behavior (why my P(DOOM) went down).
  - Bogdan Ionut Cirstea 11 May 2024 0:42 UTC
    1 point
    0
    Parent
    I also wonder how much interpretability LM agents might help here, e.g. as they could make much cheaper scaling the ‘search’ to many different undesirable kinds of behaviors.