However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification). It seems likely that it could be in a situation like some of the Cake or Death problems, where it views a change to itself as impacting only part of its future behavior (like affecting actions but not values, such that it suspects that a future it that took path A would be disappointed in itself and fix that bug, without realizing that the change it’s making will cause future it to not be disappointed by path A), or is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what “self-modification” is (though I’m not sure it matters that much).
Both of the comments feel to me like you’re making the AI system way dumber than humans, and I don’t understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I’m allowed to try and test things before committing to the better human.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification). It seems likely that it could be in a situation like some of the Cake or Death problems, where it views a change to itself as impacting only part of its future behavior (like affecting actions but not values, such that it suspects that a future it that took path A would be disappointed in itself and fix that bug, without realizing that the change it’s making will cause future it to not be disappointed by path A), or is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what “self-modification” is (though I’m not sure it matters that much).
Both of the comments feel to me like you’re making the AI system way dumber than humans, and I don’t understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I’m allowed to try and test things before committing to the better human.