you can have serious bugs and still do relatively well on the objective. A public example of this is in OpenAI Five (https://blog.openai.com/openai-five/), but I also hear this expressed when talking to RL researchers (and see this myself).
My impression is that most of these ‘serious bugs’ are something like “oops, our gradient descent is actually gradient ascent, but it worked out alright because our utility function is also the negative of what it should be” which is not particularly heartening.
While you still want to be very careful with self-modification, it seems generally fine not to have a formal proof before making the change, and evaluating the change after it has taken place. (This would fail dramatically if the change drastically changed behavior, but if it only degrades performance, I expect the AI would still be competent enough to notice and undo the change.)
Even to changes to how to performs or evaluates self-modification? Eurisko comes to mind as a program that could and did give itself cancer, requiring its programmer to notice that it had died and restart it, and the sort of thing that AI programmers would do by default.
My impression is that most of these ‘serious bugs’ are something like “oops, our gradient descent is actually gradient ascent, but it worked out alright because our utility function is also the negative of what it should be” which is not particularly heartening.
Even if this were true I would not update much. If you actually had only one of those and not the other, you would notice _really fast_, so it’s not going to harm you.
The bugs I’m imagining are more like “we did a bunch of math and got out an equation, but missed a minus sign in one of the terms that’s usually quite small, resulting in a small error in the value calculated, making learning less efficient”. OpenAI had a bug where their bots would get a negative reward for reaching level 25. If you introduce these kinds of bugs with a change, you’ll notice less efficient learning, and hopefully correct it, and it only leads to degraded performance, not catastrophic outcomes.
Even to changes to how to performs or evaluates self-modification? Eurisko comes to mind as a program that could and did give itself cancer, requiring its programmer to notice that it had died and restart it, and the sort of thing that AI programmers would do by default.
I agree that you want to be extra careful with self-modification, but there are lots of easy steps you can do to in fact be extra careful, eg. creating a copy of yourself with the modification and seeing what it tends to do on a suite of problems where you expect the modification to be helpful/harmful.
We may also have different pictures of self-modification looks like. Under your view, it seems like AI researchers are going to add a self-modification routine to the AI, which can unilaterally rewrite the source code of the AI as it wants. Under my view, AI researchers don’t really think much about self-modification, and just build an AI system capable of learning and performing general tasks, one of which could be the task of improving the AI system with very high confidence that the proposed improvement will work.
Do you generally trust that you personally could be handed the key to human self-modification? I feel reasonably confident that such a tool would help me (or at least, not harm me, in that I might decide not to use it). Since it’s much easier for an AI to run experiments on copies of itself, it should be a much easier task for the AI to use such a tool well.
Under your view, it seems like AI researchers are going to add a self-modification routine to the AI, which can unilaterally rewrite the source code of the AI as it wants . Under my view, AI researchers don’t really think much about self-modification, and just build an AI system capable of learning and performing general tasks, one of which could be the task of improving the AI system with very high confidence that the proposed improvement will work.
I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?
I think it’s likely that there will be some safeguards in place, much in the way that you don’t get robust multicellular life without some mechanisms of correcting cancers when they develop. The root of my worry here is that I don’t expect this problem to be solved well if researchers aren’t thinking much about self-modification (and thus how to solve it well).
Do you generally trust that you personally could be handed the key to human self-modification?
I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don’t have enough transparency for the consequences. If I have a dial with my IQ on it, probably, or if I have a set of dials related to the strength of various motivations, probably, but here I would still feel like there are significant risks associated with moving outside normal bounds that I would be accepting because we live in weird times. [For example, it seems likely that some genes that increase intelligence also increase brain cancer risk, and it seems possible that ‘turning the IQ dial’ with this key would similarly increase my chance of having brain cancer.]
Similarly, being able to print the genome for potential children rather than rolling randomly or selecting from a few options seems like it would be useful and I would use it, but is not making the situation significantly safer and could easily lead to systematic problems because of correlated choices.
The thing I’m worried about is fixing only one of them
Right, I’m arguing that if you only fixed one of them, you would notice _immediately_, and either revert back to the version with both bugs, or find the other bug. I’m also claiming that this should be what happens in general, assuming sufficient caution around self-modification (on the AI’s part, not the researcher’s part).
I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?
I don’t think of gradient descent as self-modification. If an AI system were able to choose (i.e. learned a policy for) when to run gradient descent on itself, and what training data it should use for that, that might be self-modification. Meta-learning feels similar to me—the AI doesn’t get to choose training data or what to run gradient descent on. The only learned part in current meta-learning approaches is how to perform a task, not how to learn.
I don’t expect this problem to be solved well if researchers aren’t thinking much about self-modification (and thus how to solve it well).
This might be all of our disagreement actually. Like you, I’m quite pessimistic about any system where researchers put in place a protocol for self-modification, which seems to be what you are imagining. Either the protocol is too lax and we get the sort of issues you’re talking about, or it’s too strict and self-modification never happens. However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don’t have enough transparency for the consequences.
But in that case, you would simply choose not to use it, or to do a lot of research before trying to use it.
However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification). It seems likely that it could be in a situation like some of the Cake or Death problems, where it views a change to itself as impacting only part of its future behavior (like affecting actions but not values, such that it suspects that a future it that took path A would be disappointed in itself and fix that bug, without realizing that the change it’s making will cause future it to not be disappointed by path A), or is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what “self-modification” is (though I’m not sure it matters that much).
Both of the comments feel to me like you’re making the AI system way dumber than humans, and I don’t understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I’m allowed to try and test things before committing to the better human.
My impression is that most of these ‘serious bugs’ are something like “oops, our gradient descent is actually gradient ascent, but it worked out alright because our utility function is also the negative of what it should be” which is not particularly heartening.
Even to changes to how to performs or evaluates self-modification? Eurisko comes to mind as a program that could and did give itself cancer, requiring its programmer to notice that it had died and restart it, and the sort of thing that AI programmers would do by default.
Even if this were true I would not update much. If you actually had only one of those and not the other, you would notice _really fast_, so it’s not going to harm you.
The bugs I’m imagining are more like “we did a bunch of math and got out an equation, but missed a minus sign in one of the terms that’s usually quite small, resulting in a small error in the value calculated, making learning less efficient”. OpenAI had a bug where their bots would get a negative reward for reaching level 25. If you introduce these kinds of bugs with a change, you’ll notice less efficient learning, and hopefully correct it, and it only leads to degraded performance, not catastrophic outcomes.
I agree that you want to be extra careful with self-modification, but there are lots of easy steps you can do to in fact be extra careful, eg. creating a copy of yourself with the modification and seeing what it tends to do on a suite of problems where you expect the modification to be helpful/harmful.
We may also have different pictures of self-modification looks like. Under your view, it seems like AI researchers are going to add a self-modification routine to the AI, which can unilaterally rewrite the source code of the AI as it wants. Under my view, AI researchers don’t really think much about self-modification, and just build an AI system capable of learning and performing general tasks, one of which could be the task of improving the AI system with very high confidence that the proposed improvement will work.
Do you generally trust that you personally could be handed the key to human self-modification? I feel reasonably confident that such a tool would help me (or at least, not harm me, in that I might decide not to use it). Since it’s much easier for an AI to run experiments on copies of itself, it should be a much easier task for the AI to use such a tool well.
The thing I’m worried about is fixing only one of them—see Reason as Memetic Immune Disorder.
I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?
I think it’s likely that there will be some safeguards in place, much in the way that you don’t get robust multicellular life without some mechanisms of correcting cancers when they develop. The root of my worry here is that I don’t expect this problem to be solved well if researchers aren’t thinking much about self-modification (and thus how to solve it well).
I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don’t have enough transparency for the consequences. If I have a dial with my IQ on it, probably, or if I have a set of dials related to the strength of various motivations, probably, but here I would still feel like there are significant risks associated with moving outside normal bounds that I would be accepting because we live in weird times. [For example, it seems likely that some genes that increase intelligence also increase brain cancer risk, and it seems possible that ‘turning the IQ dial’ with this key would similarly increase my chance of having brain cancer.]
Similarly, being able to print the genome for potential children rather than rolling randomly or selecting from a few options seems like it would be useful and I would use it, but is not making the situation significantly safer and could easily lead to systematic problems because of correlated choices.
Right, I’m arguing that if you only fixed one of them, you would notice _immediately_, and either revert back to the version with both bugs, or find the other bug. I’m also claiming that this should be what happens in general, assuming sufficient caution around self-modification (on the AI’s part, not the researcher’s part).
I don’t think of gradient descent as self-modification. If an AI system were able to choose (i.e. learned a policy for) when to run gradient descent on itself, and what training data it should use for that, that might be self-modification. Meta-learning feels similar to me—the AI doesn’t get to choose training data or what to run gradient descent on. The only learned part in current meta-learning approaches is how to perform a task, not how to learn.
This might be all of our disagreement actually. Like you, I’m quite pessimistic about any system where researchers put in place a protocol for self-modification, which seems to be what you are imagining. Either the protocol is too lax and we get the sort of issues you’re talking about, or it’s too strict and self-modification never happens. However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
But in that case, you would simply choose not to use it, or to do a lot of research before trying to use it.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification). It seems likely that it could be in a situation like some of the Cake or Death problems, where it views a change to itself as impacting only part of its future behavior (like affecting actions but not values, such that it suspects that a future it that took path A would be disappointed in itself and fix that bug, without realizing that the change it’s making will cause future it to not be disappointed by path A), or is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes “sufficiently complex”. I’m not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.
+100. This is why I feel queasy about “OK, I judge this self-modification to be fine” when the self-modifications are sufficiently complex, if this judgment isn’t based off something like zero-shot reasoning (in which case we’d have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).
I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what “self-modification” is (though I’m not sure it matters that much).
Both of the comments feel to me like you’re making the AI system way dumber than humans, and I don’t understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I’m allowed to try and test things before committing to the better human.