Aaron_Scher comments on [Interim research report] Evaluating the Goal-Directedness of Language Models

Aaron_Scher 30 Jul 2024 0:28 UTC
3 points
0
Yep, I basically agree with those being the cruxes!
On how much of the goal reasoning is happening out loud: Nowadays, I think about a lot of AI safety research as being aimed at an AI Control scenario where we are closely supervising what models are thinking about in CoT, and thus malign goal seeking must either happen in individual forward passes (and translated to look harmless most the time) or in an obfuscated way in CoT. (or from rogue deployments outside the control scheme)
Studying the intrinsic goals of the models in naturalistic settings would also be interesting, but I expect this to be hard, and also less relevant to the threat models we’re worried about. I’m mainly concerned about models becoming more goal-directed with scale and someone figuring out long-horizon RL training on top of that, leading to models that are qualitatively different from current ones.
By naturalistic, I mean “from a realistic training process, even if that training is designed to create goals”. Which sounds like what you said is the main threat model you’re worried about? If you have the time, I would push you harder on this: what is a specific story of AI catastrophe that you are trying to study/prevent?
- Elizabeth Donoway 30 Jul 2024 22:52 UTC
  3 points
  0
  Parent
  I really appreciate the thoughtful replies and feedback—I also didn’t read any of your comments as rude or mean! I’d like to clarify a few points about our approach and its relevance:
  We expect that good CoT reasoning helps current models better pursue and achieve goals. In the future, all of the reasoning needed to do this effectively might happen in the forward pass. However, we think it’s likely that AGI could be achieved with models similar to current SOTA models, where sophisticated reasoning for goal pursuit will still need to happen in CoT. Even if this bet is wrong and models can do this reasoning entirely in the forward pass, our evals should still reveal goal-directed behavior because we assess actions the model takes that are instrumental to achieving the goal (not just the sensibility of the reasoning traces).
  Regarding AI control scenarios, we understand that with good oversight methods, deceptive goal-pursuit would need to happen either in the forward pass or through obfuscated reasoning in CoT. While our scoring of reasoning traces may become less relevant as models improve at internal reasoning, our other metrics focus on how well the model carries out steps and decides on instrumental actions that lead to goal achievement. We believe this serves as a good proxy for goal-directedness, even if we can’t directly observe how intrinsic goals arise or manifest from the training process.
  A potential risk scenario we’re considering is one where a highly capable model is developed that doesn’t show obvious signs of misalignment in standard tests (e.g. typical use or general reasoning benchmarks) but has strong goal-pursuit capabilities. Without evaluations that specifically probe goal-directedness, such a model could potentially develop deceptive goals while becoming adept at hiding them. Our evals aim to surface whether models have the ability to perform the necessary reasoning to competently pursue goals, by measuring their capability to carry out the steps needed to achieve those goals and the circumstances under which this behavior is impacted.
  We remain agnostic to what sort of training results in these capabilities and think it should be possible to get a good idea of goal-directedness by assessing aspects of how well models can carry out goals (I would imagine that a capable, goal-directed model with intrinsic goals would also be able to be goal-directed towards goals inscribed through prompting, since we train models to follow instructions. If this expectation is wrong, this would be a major limitation of our evals).
  Our approach differs from typical capability evaluations in that we try to determine the utility of models’ actions towards goal fulfillment, situations or conditions under which models fail, the time horizons over which they can act and plan, and their corrigibility under adverse circumstances or when presented with conflicting goals. This allows us to assess aspects of goal-directedness that are distinct from general reasoning capabilities. I know it’s not touched on in the post, since we hadn’t started implementing these tasks when we wrote this, but we’re currently developing long-horizon tasks to better assess these aspects of goal-directedness, and expect to have preliminary results in a few weeks!