This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what “self-modification” is (though I’m not sure it matters that much).
Both of the comments feel to me like you’re making the AI system way dumber than humans, and I don’t understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I’m allowed to try and test things before committing to the better human.
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what “self-modification” is (though I’m not sure it matters that much).
Both of the comments feel to me like you’re making the AI system way dumber than humans, and I don’t understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I’m allowed to try and test things before committing to the better human.