I’m one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I’m certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I’ve seen alignment-ish properties generalize about as well as capabilities, and that I don’t have a strong expectation that this will change in future.
I also find this summary a little misleading. Consider for example, “the paper finds concrete evidence of current large language models exhibiting: convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), …” (italics added in both) vs:
Worryingly, RLHF also increases the model’s tendency to state a desire to pursue hypothesized “convergent instrumental subgoals” … While it is not dangerous to state instrument subgoals, such statements suggest that models may act in accord with potentially dangerous subgoals (e.g., by influencing users or writing and executing code). Models may be especially prone to act in line with dangerous subgoals if such statements are generated as part of step-by-step reasoning or planning.
While indeed worrying, models generally seem to have weaker intrinsic connections between their stated desires and actual actions than humans. For example, if you ask about code models can and will discuss SQL injections (or buffer overflows, or other classic weaknesses, bugs, and vulnerabilities) and best-practices to avoid them in considerable detail… while also prone to writing them whereever a naive human might do so. Step-by-step reasoning, planning, or model cascades do provide a mechanism to convert verbal claims into actions; but I’m confident that strong supervision of such intermediates is feasible.
I’m not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?
I’m not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?
The key relaxation here is: deceptive alignment will not happen. In many ways, a lot of hopes are resting on deceptive alignment not being a problem.
I’m one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I’m certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I’ve seen alignment-ish properties generalize about as well as capabilities, and that I don’t have a strong expectation that this will change in future.
I disagree, since I think the non-myopia found is a key way for how something like goal misgeneralization or the sharp left turn could happen, where a model remains very capable, but loses it’s alignment properties due to deceptive alignment.
I’m one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I’m certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I’ve seen alignment-ish properties generalize about as well as capabilities, and that I don’t have a strong expectation that this will change in future.
I also find this summary a little misleading. Consider for example, “the paper finds concrete evidence of current large language models exhibiting: convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), …” (italics added in both) vs:
While indeed worrying, models generally seem to have weaker intrinsic connections between their stated desires and actual actions than humans. For example, if you ask about code models can and will discuss SQL injections (or buffer overflows, or other classic weaknesses, bugs, and vulnerabilities) and best-practices to avoid them in considerable detail… while also prone to writing them whereever a naive human might do so. Step-by-step reasoning, planning, or model cascades do provide a mechanism to convert verbal claims into actions; but I’m confident that strong supervision of such intermediates is feasible.
I’m not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?
The key relaxation here is: deceptive alignment will not happen. In many ways, a lot of hopes are resting on deceptive alignment not being a problem.
I disagree, since I think the non-myopia found is a key way for how something like goal misgeneralization or the sharp left turn could happen, where a model remains very capable, but loses it’s alignment properties due to deceptive alignment.