I’ll note that you’re a MIRI research associate, so I wouldn’t have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.
There is ample discussion of distribution shifts (“seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set”) by other people. Random examples: Christiano, Shah, DeepMind.
Maybe Eliezer is talking specifically about the context of transparency. Personally, I haven’t worked much on transparency because IMO (i) even if we solve transparency perfectly but don’t solve actual alignment, we are still dead, (ii) if we solve actual alignment without transparency, then theoretically we might succeed (although in practice it would sure help a lot to have transparency to catch errors in time) and (iii) there are less strong reasons to think transparency must be robustly solvable compared to reasons to think alignment must be robustly solvable.
In any case, I really don’t understand why Eliezer thinks the rest of AI safety are unaware of the type of attack vectors he describes.
The journal paper that Stuart Armstrong coauthored on “interruptibility” is a far step down from Armstrong’s other work on corrigibility. It had to be dumbed way down (I’m counting obscuration with fancy equations and math results as “dumbing down”) to be published in a mainstream journal.
I agree that currently publishing in mainstream venues seems to require dumbing down, but IMO we should proceed by publishing dumbed-down versions in the mainstream + smarted-up versions/commentary in our own venues. And, not all of AI safety is focused on publishing in mainstream venues? There is plenty of stuff on the alignment forum, on various blogs etc.
Overall I actually agree that lots of work by the AI safety community is unimpressive (tbh I wish MIRI would lead by example instead of going stealth-mode, but maybe I don’t understand the considerations). What I’m confused by is the particular example in the OP. I also dunno about “fancy equations and math results”, I feel like the field would benefit from getting a lot more mathy (ofc in meaningful ways rather than just using mathematical notation as decoration).
There is ample discussion of distribution shifts (“seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set”) by other people. Random examples: Christiano, Shah, DeepMind.
Maybe Eliezer is talking specifically about the context of transparency. Personally, I haven’t worked much on transparency because IMO (i) even if we solve transparency perfectly but don’t solve actual alignment, we are still dead, (ii) if we solve actual alignment without transparency, then theoretically we might succeed (although in practice it would sure help a lot to have transparency to catch errors in time) and (iii) there are less strong reasons to think transparency must be robustly solvable compared to reasons to think alignment must be robustly solvable.
In any case, I really don’t understand why Eliezer thinks the rest of AI safety are unaware of the type of attack vectors he describes.
I agree that currently publishing in mainstream venues seems to require dumbing down, but IMO we should proceed by publishing dumbed-down versions in the mainstream + smarted-up versions/commentary in our own venues. And, not all of AI safety is focused on publishing in mainstream venues? There is plenty of stuff on the alignment forum, on various blogs etc.
Overall I actually agree that lots of work by the AI safety community is unimpressive (tbh I wish MIRI would lead by example instead of going stealth-mode, but maybe I don’t understand the considerations). What I’m confused by is the particular example in the OP. I also dunno about “fancy equations and math results”, I feel like the field would benefit from getting a lot more mathy (ofc in meaningful ways rather than just using mathematical notation as decoration).