I feel like you are drawing the wrong conclusion from the shift in arguments that has occurred. I would argue that what look like wrong ideas that ended up not contributing to future research could actually have been quite necessary for progressing the field’s understanding as a whole. That is, maybe we really needed to engage with utility functions first before we could start breaking down that assumption—or maybe optimization daemons were a necessary step towards understanding mesa-optimization. Thus, I don’t think the shift in arguments at all justifies the conclusion that prior work wasn’t very helpful, as the prior work could have been necessary to achieve that very shift.
I think this justification for doing research now is valid. However, I think that as the systems developed further, researchers would be forced to shift their arguments for risk anyway, since the concrete ways that the systems go wrong would be readily apparent. It’s possible that by that time it would be “too late” as the problems of safety are just too hard and researchers would have wished they made conceptual progress sooner (I’m pretty skeptical of this though).
I feel like you are drawing the wrong conclusion from the shift in arguments that has occurred. I would argue that what look like wrong ideas that ended up not contributing to future research could actually have been quite necessary for progressing the field’s understanding as a whole. That is, maybe we really needed to engage with utility functions first before we could start breaking down that assumption—or maybe optimization daemons were a necessary step towards understanding mesa-optimization. Thus, I don’t think the shift in arguments at all justifies the conclusion that prior work wasn’t very helpful, as the prior work could have been necessary to achieve that very shift.
I think this justification for doing research now is valid. However, I think that as the systems developed further, researchers would be forced to shift their arguments for risk anyway, since the concrete ways that the systems go wrong would be readily apparent. It’s possible that by that time it would be “too late” as the problems of safety are just too hard and researchers would have wished they made conceptual progress sooner (I’m pretty skeptical of this though).