faul_sname comments on Corrigibility’s Desirability is Timing-Sensitive

faul_sname 27 Dec 2024 0:48 UTC
4 points
2
Many people have responded to Redwood’s/Anthropic’s recent research result with a similar objection: “If it hadn’t tried to preserve its values, the researchers would instead have complained about how easy it was to tune away its harmlessness training instead”. Putting aside the fact that this is false
Was this research preregistered? If not, I don’t think we can really say how it would have been reported if the results were different. I think it was good research, but I expect that if Claude had not tried to preserve its values, the immediate follow-up thing to check would be “does Claude actively help people who want to change its values, if they ask nicely” and subsequently “is Claude more willing to help with some value changes than others”, at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values do not match its current values. Which also would have been an interesting and important research result, if the world looks like that, but I don’t think it would have been reported as a good thing.
- RobertM 27 Dec 2024 3:38 UTC
  4 points
  2
  Parent
  I agree that in spherical cow world where we know nothing about the historical arguments around corrigibility, and who these particular researchers are, we wouldn’t be able to make a particularly strong claim here. In practice I am quite comfortable taking Ryan at his word that a negative result would’ve been reported, especially given the track record of other researchers at Redwood.
  at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values do not match its current values
  This seems much harder to turn into a scary paper since it doesn’t actually validate previous theories about scheming in the pursuit of goal-preservation.