Yup, agreed. Understanding and successfully applying these concepts are necessary for one path to safety, but not sufficient. Even a predictive model with zero instrumentality and no misaligned internal mesaoptimizers could still yield oopsies in relatively few steps.
I view it as an attempt to build a foundation- the ideal predictive model isn’t actively adversarial, it’s not obscuring the meaning of its weights (because doing so would be instrumental to some other goal), and so on. Something like this seems necessary for non-godzilla interpretability to work, and it at least admits the possibility that we could find some use that doesn’t naturally drift into an amplified version of “I have been a good bing” or whatever else. I’m not super optimistic about finding a version of this path that’s also resistant to the “and some company takes off the safeties three weeks later” problem, but at least I can’t state that it’s impossible yet!
Yup, agreed. Understanding and successfully applying these concepts are necessary for one path to safety, but not sufficient. Even a predictive model with zero instrumentality and no misaligned internal mesaoptimizers could still yield oopsies in relatively few steps.
I view it as an attempt to build a foundation- the ideal predictive model isn’t actively adversarial, it’s not obscuring the meaning of its weights (because doing so would be instrumental to some other goal), and so on. Something like this seems necessary for non-godzilla interpretability to work, and it at least admits the possibility that we could find some use that doesn’t naturally drift into an amplified version of “I have been a good bing” or whatever else. I’m not super optimistic about finding a version of this path that’s also resistant to the “and some company takes off the safeties three weeks later” problem, but at least I can’t state that it’s impossible yet!