It seems to me like there might be a more general insight to draw here, something along the following line. As long as we’re still in the current paradigm, where most model capabilities (including undesirable ones) come from pre-training, and they’d (mostly) only get “wrapped” by fine-tuning the model, (with appropriate prompting—or even other elicitation tools) the pre-trained models can serve as “model organisms” for about any elicitable misbehavior.
I completely agree. LLMs are so context dependent that just about any good or bad behavior that a significant number of instances of can be found in the training set can be elicited from them by suitable prompts. Fine tuning can increse their resistance to this, but not by anything like enough.. We either need to filter the training set, which risks them just not understanding bad behaviors, rather than actually knowing to avoid them, making it had to know what will happen when they in-context learn about them, or else we need to use something like conditional pretraining along the lines I discuss in How to Control an LLM’s Behavior (why my P(DOOM) went down).
It seems to me like there might be a more general insight to draw here, something along the following line. As long as we’re still in the current paradigm, where most model capabilities (including undesirable ones) come from pre-training, and they’d (mostly) only get “wrapped” by fine-tuning the model, (with appropriate prompting—or even other elicitation tools) the pre-trained models can serve as “model organisms” for about any elicitable misbehavior.
I completely agree. LLMs are so context dependent that just about any good or bad behavior that a significant number of instances of can be found in the training set can be elicited from them by suitable prompts. Fine tuning can increse their resistance to this, but not by anything like enough.. We either need to filter the training set, which risks them just not understanding bad behaviors, rather than actually knowing to avoid them, making it had to know what will happen when they in-context learn about them, or else we need to use something like conditional pretraining along the lines I discuss in How to Control an LLM’s Behavior (why my P(DOOM) went down).
I also wonder how much interpretability LM agents might help here, e.g. as they could make much cheaper scaling the ‘search’ to many different undesirable kinds of behaviors.