dil-leik-og comments on Validating against a misalignment detector is very different to training against one

dil-leik-og 7 Mar 2025 17:55 UTC
1 point
0
AF
The post’s claim that validation-only approaches are fundamentally better than training-with-validation oversimplifies a complex reality. Both approaches modify the distribution of models—neither preserves some “pure” average case. Our base training objective may already have some correlation with our validation signal, and there’s nothing special about maintaining this arbitrary starting point. Sometimes we should increase correlation between training and validation, sometimes decrease it, depending on the specific relationship between our objective and validator. What matters is understanding how correlation affects both P(aligned) and P(pass|misaligned), weighing the tradeoffs, and optimizing within our practical retraining budget (because often, increasing P(aligned|pass) will also decrease P(pass)).
- mattmacdermott 7 Mar 2025 19:03 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Flagging for posterity that we had a long discussion about this via another medium and I was not convinced.