I agree that in spherical cow world where we know nothing about the historical arguments around corrigibility, and who these particular researchers are, we wouldn’t be able to make a particularly strong claim here. In practice I am quite comfortable taking Ryan at his word that a negative result would’ve been reported, especially given the track record of other researchers at Redwood.
at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values do not match its current values
This seems much harder to turn into a scary paper since it doesn’t actually validate previous theories about scheming in the pursuit of goal-preservation.
I agree that in spherical cow world where we know nothing about the historical arguments around corrigibility, and who these particular researchers are, we wouldn’t be able to make a particularly strong claim here. In practice I am quite comfortable taking Ryan at his word that a negative result would’ve been reported, especially given the track record of other researchers at Redwood.
This seems much harder to turn into a scary paper since it doesn’t actually validate previous theories about scheming in the pursuit of goal-preservation.