The thing I’m worried about is fixing only one of them
Right, I’m arguing that if you only fixed one of them, you would notice _immediately_, and either revert back to the version with both bugs, or find the other bug. I’m also claiming that this should be what happens in general, assuming sufficient caution around self-modification (on the AI’s part, not the researcher’s part).
I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?
I don’t think of gradient descent as self-modification. If an AI system were able to choose (i.e. learned a policy for) when to run gradient descent on itself, and what training data it should use for that, that might be self-modification. Meta-learning feels similar to me—the AI doesn’t get to choose training data or what to run gradient descent on. The only learned part in current meta-learning approaches is how to perform a task, not how to learn.
I don’t expect this problem to be solved well if researchers aren’t thinking much about self-modification (and thus how to solve it well).
This might be all of our disagreement actually. Like you, I’m quite pessimistic about any system where researchers put in place a protocol for self-modification, which seems to be what you are imagining. Either the protocol is too lax and we get the sort of issues you’re talking about, or it’s too strict and self-modification never happens. However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don’t have enough transparency for the consequences.
But in that case, you would simply choose not to use it, or to do a lot of research before trying to use it.
However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification). It seems likely that it could be in a situation like some of the Cake or Death problems, where it views a change to itself as impacting only part of its future behavior (like affecting actions but not values, such that it suspects that a future it that took path A would be disappointed in itself and fix that bug, without realizing that the change it’s making will cause future it to not be disappointed by path A), or is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what “self-modification” is (though I’m not sure it matters that much).
Both of the comments feel to me like you’re making the AI system way dumber than humans, and I don’t understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I’m allowed to try and test things before committing to the better human.
Right, I’m arguing that if you only fixed one of them, you would notice _immediately_, and either revert back to the version with both bugs, or find the other bug. I’m also claiming that this should be what happens in general, assuming sufficient caution around self-modification (on the AI’s part, not the researcher’s part).
I don’t think of gradient descent as self-modification. If an AI system were able to choose (i.e. learned a policy for) when to run gradient descent on itself, and what training data it should use for that, that might be self-modification. Meta-learning feels similar to me—the AI doesn’t get to choose training data or what to run gradient descent on. The only learned part in current meta-learning approaches is how to perform a task, not how to learn.
This might be all of our disagreement actually. Like you, I’m quite pessimistic about any system where researchers put in place a protocol for self-modification, which seems to be what you are imagining. Either the protocol is too lax and we get the sort of issues you’re talking about, or it’s too strict and self-modification never happens. However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
But in that case, you would simply choose not to use it, or to do a lot of research before trying to use it.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification). It seems likely that it could be in a situation like some of the Cake or Death problems, where it views a change to itself as impacting only part of its future behavior (like affecting actions but not values, such that it suspects that a future it that took path A would be disappointed in itself and fix that bug, without realizing that the change it’s making will cause future it to not be disappointed by path A), or is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what “self-modification” is (though I’m not sure it matters that much).
Both of the comments feel to me like you’re making the AI system way dumber than humans, and I don’t understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I’m allowed to try and test things before committing to the better human.