Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.
Bouncing Off: We rewind our finetuning process as far as is needed to make another attempt at aligning the model, taking advantage of what we’ve learned in the previous step.
I would find it helpful to understand why “Sonnet 3.7 reward hacks constantly in agentic settings, but still made it to deployment” doesn’t invalidate that this model will hold. It seems completely plausible to me that we end up in a regime where “the models reward hack and alignment fake, but they’re just so useful and we can usually catch it, so we don’t take the usability hit”.
Note: I’m not arguing that this is / isn’t the right strategy! There are lots of tradeoffs to be considered in real cases of this, I just think it’s likely useful in planning to explicit consider this path, especially when allocating resources.
I would find it helpful to understand why “Sonnet 3.7 reward hacks constantly in agentic settings, but still made it to deployment” doesn’t invalidate that this model will hold. It seems completely plausible to me that we end up in a regime where “the models reward hack and alignment fake, but they’re just so useful and we can usually catch it, so we don’t take the usability hit”.
Note: I’m not arguing that this is / isn’t the right strategy! There are lots of tradeoffs to be considered in real cases of this, I just think it’s likely useful in planning to explicit consider this path, especially when allocating resources.