An additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.
An additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.
That’s an interesting point.