This post engages substantively and clearly with IMO the first or second most important thing we could be accomplishing on LW: making better estimates of how difficult alignment will be.
It analyzes how people who know a good deal about alignment theory could say something like “AI is easy to control” in good faith—and why that’s wrong, in both senses.
What Belrose and Pope were mostly saying, without being explicit about it, is that current AI is easy to control, then extrapolating from there by basically assuming we won’t make any changes to AI in the future that might dramatically change this situation.
This post addresses this and more subtle points, clarifying the discussion.
This post engages substantively and clearly with IMO the first or second most important thing we could be accomplishing on LW: making better estimates of how difficult alignment will be.
It analyzes how people who know a good deal about alignment theory could say something like “AI is easy to control” in good faith—and why that’s wrong, in both senses.
What Belrose and Pope were mostly saying, without being explicit about it, is that current AI is easy to control, then extrapolating from there by basically assuming we won’t make any changes to AI in the future that might dramatically change this situation.
This post addresses this and more subtle points, clarifying the discussion.