I suspect a third important reason is that MIRI thinks alignment is mostly about achieving a certain kind of interpretability/understandability/etc. in the first AGI systems. Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.
E.g., if you think alignment research is mostly about testing outer reward function to see what first-order behavior they produce in non-AGI systems, rather than about ‘looking in the learned model’s brain’ to spot mesa-optimization and analyze what that optimization is ultimately ‘trying to do’ (or whatever), then you probably won’t produce stuff that MIRI’s excited about regardless of how experimental vs. theoretical your work is.
(In which case, maybe this is not actually a crux for the usefulness of most alignment experiments, and is instead a crux for the usefulness of most alignment research in general.)
I suspect a third important reason is that MIRI thinks alignment is mostly about achieving a certain kind of interpretability/understandability/etc. in the first AGI systems. Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.
E.g., if you think alignment research is mostly about testing outer reward function to see what first-order behavior they produce in non-AGI systems, rather than about ‘looking in the learned model’s brain’ to spot mesa-optimization and analyze what that optimization is ultimately ‘trying to do’ (or whatever), then you probably won’t produce stuff that MIRI’s excited about regardless of how experimental vs. theoretical your work is.
(In which case, maybe this is not actually a crux for the usefulness of most alignment experiments, and is instead a crux for the usefulness of most alignment research in general.)