This seems right to me, and the essay could probably benefit from saying something about what counts as self-improvement in the relevant sense. I think the answer is probably something like “improvements that could plausibly lead to unplanned changes in the model’s goals (final or sub).” It’s hard to know exactly what those are. I agree it’s less likely that simply increasing processor speed a bit would do it (though Bostrom argues that big speed increases might). At any rate, it seems to me that whatever the set includes, it will be symmetric as between human-produced and AI-produced improvements to AI. So for the important improvements—the ones risking misalignment—the arguments should remain symmetrical.
This seems right to me, and the essay could probably benefit from saying something about what counts as self-improvement in the relevant sense. I think the answer is probably something like “improvements that could plausibly lead to unplanned changes in the model’s goals (final or sub).” It’s hard to know exactly what those are. I agree it’s less likely that simply increasing processor speed a bit would do it (though Bostrom argues that big speed increases might). At any rate, it seems to me that whatever the set includes, it will be symmetric as between human-produced and AI-produced improvements to AI. So for the important improvements—the ones risking misalignment—the arguments should remain symmetrical.