adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book.
The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur.
I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an inner-optimizer that is now taking advantage of gradient hacking. I think while this is definitely a sufficient assumption, I don’t think it’s a necessary assumption and my current models suggest that we should find this behavior without the need for inner optimizers. This also makes me somewhat more optimistic about studying it.
My thinking about this is still pretty fuzzy and in its early stages, but the reasoning goes as follows:
If we assume the lottery-ticket hypothesis of neural networks, we initialize our network with a large number of possible models of the world. In a sufficiently large networks, some of those models will be accurate models of not necessarily the world, but the training process of the very system that is currently being trained. This is pretty likely given that SGD isn’t very complicated and it doesn’t seem very hard to build a model of how it works.
From an evolutionary perspective, we are going to be selecting for networks that get positively rewarded by the gradient descent learning algorithm. Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network, creating a small ridge in the reward landscape that results in it itself getting most of the reward (This is currently very metaphorical and I feel fuzzy on whether this conceptualization makes sense). This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage.
By default, luckily, this will create something I might want to call a “benign gradient hacker” that might deteriorate the performance of the system, but not obviously give rise to anything like a full inner optimizer. It seems that this strategy is simple enough that you don’t actually need anything close to a consequentialist optimizer to run into it, and instead it seems more synonymous to cancer, in that it’s a way to hijack the natural selection mechanism of a system from the inside to get more resources, and like cancer seems more likely to just hurt the performance of the overall system, instead of taking systematic control over it.
This makes me think that the first paragraph of the post seems somewhat wrong to me when it says:
“Gradient hacking” is a term I’ve been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way.
I think gradient hacking should refer to something somewhat broader that also captures situations like the above where you don’t have a deceptively aligned mesa-optimizer, but still have dynamics where you select for networks that adversarially use knowledge about the SGD algorithm for competitive advantage. Though it’s plausible that Evan intends the term “deceptively aligned mesa-optimizer” to refer to something broader that would also capture the scenario above.
-----
Separately, as an elaboration, I have gotten a lot of mileage out of generalizing the idea of gradient hacking in this post, to the more general idea that if you have a very simple training process whose output can often easily be predicted and controlled, you will run into similar problems. It seems valuable to try to generalize the theory proposed here to other training mechanisms and study more which training mechanisms are more easily hacked like this, and which one are not.
As I said elsewhere, I’m glad that my review captured points you deem important!
I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an inner-optimizer that is now taking advantage of gradient hacking. I think while this is definitely a sufficient assumption, I don’t think it’s a necessary assumption and my current models suggest that we should find this behavior without the need for inner optimizers. This also makes me somewhat more optimistic about studying it.
I agree that gradient hacking isn’t limited to inner optimizers; yet I don’t think that defining it that way in the post was necessarily a bad idea. First, it’s for coherence with Risks from Learned Optimization. Second, assuming some internal structure definitely helps with conceptualizing the kind of things that count as gradient hacking. With inner optimizer, you can say relatively unambiguously “it tries to protect it’s mesa-objective”, as there should be an explicit representation of it. That becomes harder without the inner optimization hypothesis.
That being said, I am definitely focusing on gradient hacking as an issue with learned goal-directed systems instead of learned optimizers. This is one case where I have argued that a definition of goal-directedness would allow us to remove the explicit optimization hypothesis without sacrificing the clarity it brought.
If we assume the lottery-ticket hypothesis of neural networks, we initialize our network with a large number of possible models of the world. In a sufficiently large networks, some of those models will be accurate models of not necessarily the world, but the training process of the very system that is currently being trained. This is pretty likely given that SGD isn’t very complicated and it doesn’t seem very hard to build a model of how it works.
Two thoughts about that:
Even if some subnetwork basically captures SGD (or the relevant training process), I’m unconvinced that it would be useful in the beginning, and so it might be “written over” by the updates.
Related to the previous point, it looks crucial to understand what is needed in addition to a model of SGD in order to gradient hack. Which brings me to your next point.
From an evolutionary perspective, we are going to be selecting for networks that get positively rewarded by the gradient descent learning algorithm. Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network, creating a small ridge in the reward landscape that results in it itself getting most of the reward (This is currently very metaphorical and I feel fuzzy on whether this conceptualization makes sense). This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage.
I’m confused about what you mean here. If the point is to make the network a local minimal, you probably just have to make it very brittle to any change. I also not sure what you mean by competing networks. I assumed it meant the neighboring models in model space, which are reachable by reasonable gradients. If that’s the case, then I think my example is simpler and doesn’t need the SGD modelling. If not, then I would appreciate more detailed explanations.
By default, luckily, this will create something I might want to call a “benign gradient hacker” that might deteriorate the performance of the system, but not obviously give rise to anything like a full inner optimizer. It seems that this strategy is simple enough that you don’t actually need anything close to a consequentialist optimizer to run into it, and instead it seems more synonymous to cancer, in that it’s a way to hijack the natural selection mechanism of a system from the inside to get more resources, and like cancer seems more likely to just hurt the performance of the overall system, instead of taking systematic control over it.
Why is that supposed to be a good thing? Sure, inner optimizers with misaligned mesa-objective suck, but so do gradient hackers without inner optimization. Anything that helps ensure that training cannot correct discrepancies and/or errors with regard to the base-objective sounds extremely dangerous to me.
I think gradient hacking should refer to something somewhat broader that also captures situations like the above where you don’t have a deceptively aligned mesa-optimizer, but still have dynamics where you select for networks that adversarially use knowledge about the SGD algorithm for competitive advantage. Though it’s plausible that Evan intends the term “deceptively aligned mesa-optimizer” to refer to something broader that would also capture the scenario above.
AFAIK, Evan really means inner optimizer in this context, with actual explicit internal search. Personally I agree about including situations where the learned model isn’t an optimizer but is still in some sense goal-directed.
Separately, as an elaboration, I have gotten a lot of mileage out of generalizing the idea of gradient hacking in this post, to the more general idea that if you have a very simple training process whose output can often easily be predicted and controlled, you will run into similar problems. It seems valuable to try to generalize the theory proposed here to other training mechanisms and study more which training mechanisms are more easily hacked like this, and which one are not.
Hum, I hadn’t thought of this generalization. Thanks for the idea!
Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network
I think the part in bold should instead be something like “failing hard if SGD would (not) update weights in such and such way”. (SGD is a local search algorithm; it gradually improves a single network.)
This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage.
As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb’s problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of “being a person that 1-boxs”, because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
I think the part in bold should instead be something like “failing hard if SGD would (not) update weights in such and such way”. (SGD is a local search algorithm; it gradually improves a single network.)
As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb’s problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of “being a person that 1-boxs”, because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
Thanks for the concrete example, I think I understand better what you meant. What you describe looks like the hypothesis “Any sufficiently intelligent model will be able to gradient hack, and thus will do it”. Which might be true. But I’m actually more interested in the question of how gradient hacking could emerge without having to pass that threshold of intelligence, because I believe such examples will be easier to interpret and study.
So in summary, I do think what you say makes sense for the general risk of gradient hacking, yet I don’t believe it is really useful for studying gradient hacking with our current knowledge.
It does seem useful to make the distinction between thinking about how gradient hacking failures look like in worlds where they cause an existential catastrophe, and thinking about how to best pursue empirical research today about gradient hacking.
adamshimi says almost everything I wanted to say in my review, so I am very glad he made the points he did, and I would love for both his review and the top level post to be included in the book.
The key thing I want to emphasize a bit more is that I think the post as given is very abstract, and I have personally gotten a lot of value out of trying to think of more concrete scenarios where gradient hacking can occur.
I think one of the weakest aspects of the post is that it starts with the assumption that an AI system has already given rise to an inner-optimizer that is now taking advantage of gradient hacking. I think while this is definitely a sufficient assumption, I don’t think it’s a necessary assumption and my current models suggest that we should find this behavior without the need for inner optimizers. This also makes me somewhat more optimistic about studying it.
My thinking about this is still pretty fuzzy and in its early stages, but the reasoning goes as follows:
If we assume the lottery-ticket hypothesis of neural networks, we initialize our network with a large number of possible models of the world. In a sufficiently large networks, some of those models will be accurate models of not necessarily the world, but the training process of the very system that is currently being trained. This is pretty likely given that SGD isn’t very complicated and it doesn’t seem very hard to build a model of how it works.
From an evolutionary perspective, we are going to be selecting for networks that get positively rewarded by the gradient descent learning algorithm. Some of the networks that have an accurate model of the training process will stumble upon the strategy of failing hard if SGD would reward any other competing network, creating a small ridge in the reward landscape that results in it itself getting most of the reward (This is currently very metaphorical and I feel fuzzy on whether this conceptualization makes sense). This strategy seems more complicated, so is less likely to randomly exist in a network, but it is very strongly selected for, since at least from an evolutionary perspective it appears like it would give the network a substantive advantage.
By default, luckily, this will create something I might want to call a “benign gradient hacker” that might deteriorate the performance of the system, but not obviously give rise to anything like a full inner optimizer. It seems that this strategy is simple enough that you don’t actually need anything close to a consequentialist optimizer to run into it, and instead it seems more synonymous to cancer, in that it’s a way to hijack the natural selection mechanism of a system from the inside to get more resources, and like cancer seems more likely to just hurt the performance of the overall system, instead of taking systematic control over it.
This makes me think that the first paragraph of the post seems somewhat wrong to me when it says:
I think gradient hacking should refer to something somewhat broader that also captures situations like the above where you don’t have a deceptively aligned mesa-optimizer, but still have dynamics where you select for networks that adversarially use knowledge about the SGD algorithm for competitive advantage. Though it’s plausible that Evan intends the term “deceptively aligned mesa-optimizer” to refer to something broader that would also capture the scenario above.
-----
Separately, as an elaboration, I have gotten a lot of mileage out of generalizing the idea of gradient hacking in this post, to the more general idea that if you have a very simple training process whose output can often easily be predicted and controlled, you will run into similar problems. It seems valuable to try to generalize the theory proposed here to other training mechanisms and study more which training mechanisms are more easily hacked like this, and which one are not.
As I said elsewhere, I’m glad that my review captured points you deem important!
I agree that gradient hacking isn’t limited to inner optimizers; yet I don’t think that defining it that way in the post was necessarily a bad idea. First, it’s for coherence with Risks from Learned Optimization. Second, assuming some internal structure definitely helps with conceptualizing the kind of things that count as gradient hacking. With inner optimizer, you can say relatively unambiguously “it tries to protect it’s mesa-objective”, as there should be an explicit representation of it. That becomes harder without the inner optimization hypothesis.
That being said, I am definitely focusing on gradient hacking as an issue with learned goal-directed systems instead of learned optimizers. This is one case where I have argued that a definition of goal-directedness would allow us to remove the explicit optimization hypothesis without sacrificing the clarity it brought.
Two thoughts about that:
Even if some subnetwork basically captures SGD (or the relevant training process), I’m unconvinced that it would be useful in the beginning, and so it might be “written over” by the updates.
Related to the previous point, it looks crucial to understand what is needed in addition to a model of SGD in order to gradient hack. Which brings me to your next point.
I’m confused about what you mean here. If the point is to make the network a local minimal, you probably just have to make it very brittle to any change. I also not sure what you mean by competing networks. I assumed it meant the neighboring models in model space, which are reachable by reasonable gradients. If that’s the case, then I think my example is simpler and doesn’t need the SGD modelling. If not, then I would appreciate more detailed explanations.
Why is that supposed to be a good thing? Sure, inner optimizers with misaligned mesa-objective suck, but so do gradient hackers without inner optimization. Anything that helps ensure that training cannot correct discrepancies and/or errors with regard to the base-objective sounds extremely dangerous to me.
AFAIK, Evan really means inner optimizer in this context, with actual explicit internal search. Personally I agree about including situations where the learned model isn’t an optimizer but is still in some sense goal-directed.
Hum, I hadn’t thought of this generalization. Thanks for the idea!
I think the part in bold should instead be something like “failing hard if SGD would (not) update weights in such and such way”. (SGD is a local search algorithm; it gradually improves a single network.)
As I already argued in another thread, the idea is not that SGD creates the gradient hacking logic specifically (in case this is what you had in mind here). As an analogy, consider a human that decides to 1-box in Newcomb’s problem (which is related to the idea of gradient hacking, because the human decides to 1-box in order to have the property of “being a person that 1-boxs”, because having that property is instrumentally useful). The specific strategy to 1-box is not selected for by human evolution, but rather general problem-solving capabilities were (and those capabilities resulted in the human coming up with the 1-box strategy).
Agreed. I said something similar in my comment.
Thanks for the concrete example, I think I understand better what you meant. What you describe looks like the hypothesis “Any sufficiently intelligent model will be able to gradient hack, and thus will do it”. Which might be true. But I’m actually more interested in the question of how gradient hacking could emerge without having to pass that threshold of intelligence, because I believe such examples will be easier to interpret and study.
So in summary, I do think what you say makes sense for the general risk of gradient hacking, yet I don’t believe it is really useful for studying gradient hacking with our current knowledge.
It does seem useful to make the distinction between thinking about how gradient hacking failures look like in worlds where they cause an existential catastrophe, and thinking about how to best pursue empirical research today about gradient hacking.