I don’t know Eliezer’s view on this — presumably he either disagrees that the example he gave is “mundane AI safety stuff”, or he disagrees that “mundane AI safety stuff” is widespread? I’ll note that you’re a MIRI research associate, so I wouldn’t have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.
Safety Interruptible Agents is an example Eliezer’s given in the past of work that isn’t “real” (back in 2017):
[...]
It seems to me that I’ve watched organizations like OpenPhil try to sponsor academics to work on AI alignment, and it seems to me that they just can’t produce what I’d consider to be real work. The journal paper that Stuart Armstrong coauthored on “interruptibility” is a far step down from Armstrong’s other work on corrigibility. It had to be dumbed way down (I’m counting obscuration with fancy equations and math results as “dumbing down”) to be published in a mainstream journal. It had to be stripped of all the caveats and any mention of explicit incompleteness, which is necessary meta-information for any ongoing incremental progress, not to mention important from a safety standpoint. The root cause can be debated but the observable seems plain. If you want to get real work done, the obvious strategy would be to not subject yourself to any academic incentives or bureaucratic processes. Particularly including peer review by non-”hobbyists” (peer commentary by fellow “hobbyists” still being potentially very valuable), or review by grant committees staffed by the sort of people who are still impressed by academic sage-costuming and will want you to compete against pointlessly obscured but terribly serious-looking equations.
I’ll note that you’re a MIRI research associate, so I wouldn’t have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.
There is ample discussion of distribution shifts (“seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set”) by other people. Random examples: Christiano, Shah, DeepMind.
Maybe Eliezer is talking specifically about the context of transparency. Personally, I haven’t worked much on transparency because IMO (i) even if we solve transparency perfectly but don’t solve actual alignment, we are still dead, (ii) if we solve actual alignment without transparency, then theoretically we might succeed (although in practice it would sure help a lot to have transparency to catch errors in time) and (iii) there are less strong reasons to think transparency must be robustly solvable compared to reasons to think alignment must be robustly solvable.
In any case, I really don’t understand why Eliezer thinks the rest of AI safety are unaware of the type of attack vectors he describes.
The journal paper that Stuart Armstrong coauthored on “interruptibility” is a far step down from Armstrong’s other work on corrigibility. It had to be dumbed way down (I’m counting obscuration with fancy equations and math results as “dumbing down”) to be published in a mainstream journal.
I agree that currently publishing in mainstream venues seems to require dumbing down, but IMO we should proceed by publishing dumbed-down versions in the mainstream + smarted-up versions/commentary in our own venues. And, not all of AI safety is focused on publishing in mainstream venues? There is plenty of stuff on the alignment forum, on various blogs etc.
Overall I actually agree that lots of work by the AI safety community is unimpressive (tbh I wish MIRI would lead by example instead of going stealth-mode, but maybe I don’t understand the considerations). What I’m confused by is the particular example in the OP. I also dunno about “fancy equations and math results”, I feel like the field would benefit from getting a lot more mathy (ofc in meaningful ways rather than just using mathematical notation as decoration).
I don’t know Eliezer’s view on this — presumably he either disagrees that the example he gave is “mundane AI safety stuff”, or he disagrees that “mundane AI safety stuff” is widespread? I’ll note that you’re a MIRI research associate, so I wouldn’t have auto-assumed your stuff is representative of the stuff Eliezer is criticizing.
Safety Interruptible Agents is an example Eliezer’s given in the past of work that isn’t “real” (back in 2017):
The rest of Intellectual Progress Inside and Outside Academia may be useful context. Or maybe this is also not a representative example of the stuff EY has in mind in the OP conversation?
There is ample discussion of distribution shifts (“seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set”) by other people. Random examples: Christiano, Shah, DeepMind.
Maybe Eliezer is talking specifically about the context of transparency. Personally, I haven’t worked much on transparency because IMO (i) even if we solve transparency perfectly but don’t solve actual alignment, we are still dead, (ii) if we solve actual alignment without transparency, then theoretically we might succeed (although in practice it would sure help a lot to have transparency to catch errors in time) and (iii) there are less strong reasons to think transparency must be robustly solvable compared to reasons to think alignment must be robustly solvable.
In any case, I really don’t understand why Eliezer thinks the rest of AI safety are unaware of the type of attack vectors he describes.
I agree that currently publishing in mainstream venues seems to require dumbing down, but IMO we should proceed by publishing dumbed-down versions in the mainstream + smarted-up versions/commentary in our own venues. And, not all of AI safety is focused on publishing in mainstream venues? There is plenty of stuff on the alignment forum, on various blogs etc.
Overall I actually agree that lots of work by the AI safety community is unimpressive (tbh I wish MIRI would lead by example instead of going stealth-mode, but maybe I don’t understand the considerations). What I’m confused by is the particular example in the OP. I also dunno about “fancy equations and math results”, I feel like the field would benefit from getting a lot more mathy (ofc in meaningful ways rather than just using mathematical notation as decoration).