RobertKirk comments on Measuring and Improving the Faithfulness of Model-Generated Reasoning

RobertKirk 31 Jul 2023 13:35 UTC
LW: 1 AF: 1
0
AF
To check I understand this, is another way of saying this that in the scaling experiments, there’s effectively a confounding variable which is model performance on the task:
- Improving zero-shot performance decreases deletion-CoT-faithfulness
  - The model will get the answer write with no CoT, so adding a prefix of correct reasoning is unlikely to change the output, hence a decrease in faithfulness
- Model scale is correlated with model performance.
- So the scaling experiments show model scale is correlated with less faithfulness, but probably via the correlation with model performance.
If you had a way of measuring the faithfulness conditioned on a given performance for a given model scale then you could measure how scaling up changes faithfulness. Maybe for a given model size you can plot performance vs faithfulness (with each dataset being a point on this plot), measure the correlation for that plot and then use that as a performance-conditioned faithfulness metric? Or there’s likely some causally correct way of measuring the correlation between model scale and faithfulness while removing the confounder of model performance.
- Sam Bowman 6 Aug 2023 17:34 UTC
  LW: 2 AF: 2
  1
  AF Parent
  Yep, that sounds right! The measure we’re using gets noisier with better performance, so even faithfulness-vs-performance breaks down at some point. I think this is mostly an argument to use different metrics and/or tasks if you’re focused on scaling trends.
- cfoster0 31 Jul 2023 15:27 UTC
  LW: 2 AF: 1
  0
  AF Parent
  If you have checkpoints from different points in training of the same models, you could do a comparison between different-size models at the same loss value (performance). That way, you’re actually measuring the effect of scale alone, rather than scale confounded by performance.
  - Sam Bowman 6 Aug 2023 17:36 UTC
    LW: 1 AF: 1
    0
    AF Parent
    That makes sense, though what’s at stake with that question? In almost every safety-relevant context I can think of, ‘scale’ is just used as a proxy for ‘the best loss I can realistically achieve in a training run’, rather than as something we care about directly.