(Edit: others have made this point already, but anyhow)
My main objection to this angle: self-improvements do not necessarily look like “design a successor AI to be in charge”. They can look more like “acquire better world models”, “spin up more copies”, “build better processors”, “train lots of narrow AI to act as fingers”, etc.
I don’t expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.
I think my response to this is similar to the one to Wei Dai above. Which is to agree that there are certain kinds of improvements that generate less risk of misalignment but it’s hard to be certain. It seems like those paths are (1) less likely to produce transformational improvements in capabilities than other, more aggressive, changes and (2) not the kinds of changes we usually worry about in the arguments for human-AI risk, such that the risks remain largely symmetric. But maybe I’m missing something here!
(Edit: others have made this point already, but anyhow)
My main objection to this angle: self-improvements do not necessarily look like “design a successor AI to be in charge”. They can look more like “acquire better world models”, “spin up more copies”, “build better processors”, “train lots of narrow AI to act as fingers”, etc.
I don’t expect an AI mind to have trouble finding lots of pathways like these (that tractably improve abilities without risking a misalignment catastrophe) that take it well above human level, given the chance.
I think my response to this is similar to the one to Wei Dai above. Which is to agree that there are certain kinds of improvements that generate less risk of misalignment but it’s hard to be certain. It seems like those paths are (1) less likely to produce transformational improvements in capabilities than other, more aggressive, changes and (2) not the kinds of changes we usually worry about in the arguments for human-AI risk, such that the risks remain largely symmetric. But maybe I’m missing something here!