I am confused by your confusion. Your basic question is “what is the source of the adversarial selection”. The answer is “the system itself” (or in some cases, the training/search procedure that produces the system satisfying your specification). In your linked comment, you say “There’s no malicious ghost trying to exploit weaknesses in our alignment techniques.” I think you’ve basically hit on the crux, there. The “adversarially robust” frame is essentially saying you should think about the problem in exactly this way.
I think Eliezer has conceded that Stuart Russel puts the point best. It goes something like: “If you have an optimization process in which you forget to specify every variable that you care about, then unspecified variables are likely to be set to extreme values.” I would tack on that due to the fragility of human value, it’s much easier to set such a variable to an extremely bad value than an extremely good one.
Basically, however the goal of the system is specified or represented, you should ask yourself if there’s some way to satisfy that goal in a way that doesn’t actually do what you want. Because if there is, and it’s simpler than what you actually wanted, then that’s what will happen instead. (Side note: the system won’t literally do something just because you hate it. But the same is true for other Goodhart examples. Companies in the Soviet Union didn’t game the targets because they hated the government, but because it was the simplest way to satisfy the goal as given.)
“If the system is trying/wants to break its safety properties, then it’s not safe/you’ve already made a massive mistake somewhere else.” I mean, yes, definitely. Eliezer makes this point a lot in some Arbital articles, saying stuff like “If the system is spending computation searching for things to harm you or thwart your safety protocols, then you are doing the fundamentally wrong thing with your computation and you should do something else instead.” The question is how to do so.
Also from your linked comment: “Cybersecurity requires adversarial robustness, intent alignment does not.” Okay, but if you come up with some scheme to achieve intent alignment, you should naturally ask “Is there a way to game this scheme and not actually do what I intended?” Take this Arbital article on the problem of fully-updated deference. Moral uncertainty has been proposed as a solution to intent alignment. If the system is uncertain as to your true goals, then it will hopefully be deferential. But the article lays out a way the system might game the proposal. If the agent can maximize its meta-utility function over what it thinks we might value, and still not do what we want, then clearly this proposal is insufficient.
If you propose an intent alignment scheme such that when we ask “Is there any way the system could satisfy this scheme and still be trying to harm us?”, the answer is “No”, then congrats, you’ve solved the adversarial robustness problem! That seems to me to be the goal and the point of this way of thinking.
I think that the intuitions from “classical” multivariable optimization are poor guides for thinking about either human values or the cognition of deep learning systems. To highlight a concrete (but mostly irrelevant, IMO) example of how they diverge, this claim:
If you have an optimization process in which you forget to specify every variable that you care about, then unspecified variables are likely to be set to extreme values.
is largely false for deep learning systems, whose parameters mostly[1] don’t grow to extreme positive or negative values. In fact, in the limit of wide networks under the NTK initialization, the average change in parameter values goes to zero.
Additionally, even very strong optimization towards a given metric does not imply that the system will pursue whatever strategy is available that minimizes the metric in question. E.g., GPT-3 was subjected to an enormous amount of optimization pressure to reduce its loss. But GPT-3 itself does not behave as though it has any desire to decrease its own loss. If you ask it to choose its own curriculum, it won’t default to the most easily predicted possible data.
Related: humans don’t sit in dark rooms all day, even though doing so would minimize predictive error, and their visual cortexes both (1) optimize towards low predictive error, and (2) there are pathways available by which visual cortexes can influence their human’s motor behavior.
Another source of Adversarial robustness issues relates to the model itself becoming deceptive.
As for this:
My intuitions are that imagining a system is actively trying to break safety properties is a wrong framing; it conditions on having designed a system that is not safe.
I unfortunately think this is exactly what real world AI companies are building.
I am confused by your confusion. Your basic question is “what is the source of the adversarial selection”. The answer is “the system itself” (or in some cases, the training/search procedure that produces the system satisfying your specification). In your linked comment, you say “There’s no malicious ghost trying to exploit weaknesses in our alignment techniques.” I think you’ve basically hit on the crux, there. The “adversarially robust” frame is essentially saying you should think about the problem in exactly this way.
I think Eliezer has conceded that Stuart Russel puts the point best. It goes something like: “If you have an optimization process in which you forget to specify every variable that you care about, then unspecified variables are likely to be set to extreme values.” I would tack on that due to the fragility of human value, it’s much easier to set such a variable to an extremely bad value than an extremely good one.
Basically, however the goal of the system is specified or represented, you should ask yourself if there’s some way to satisfy that goal in a way that doesn’t actually do what you want. Because if there is, and it’s simpler than what you actually wanted, then that’s what will happen instead. (Side note: the system won’t literally do something just because you hate it. But the same is true for other Goodhart examples. Companies in the Soviet Union didn’t game the targets because they hated the government, but because it was the simplest way to satisfy the goal as given.)
“If the system is trying/wants to break its safety properties, then it’s not safe/you’ve already made a massive mistake somewhere else.” I mean, yes, definitely. Eliezer makes this point a lot in some Arbital articles, saying stuff like “If the system is spending computation searching for things to harm you or thwart your safety protocols, then you are doing the fundamentally wrong thing with your computation and you should do something else instead.” The question is how to do so.
Also from your linked comment: “Cybersecurity requires adversarial robustness, intent alignment does not.” Okay, but if you come up with some scheme to achieve intent alignment, you should naturally ask “Is there a way to game this scheme and not actually do what I intended?” Take this Arbital article on the problem of fully-updated deference. Moral uncertainty has been proposed as a solution to intent alignment. If the system is uncertain as to your true goals, then it will hopefully be deferential. But the article lays out a way the system might game the proposal. If the agent can maximize its meta-utility function over what it thinks we might value, and still not do what we want, then clearly this proposal is insufficient.
If you propose an intent alignment scheme such that when we ask “Is there any way the system could satisfy this scheme and still be trying to harm us?”, the answer is “No”, then congrats, you’ve solved the adversarial robustness problem! That seems to me to be the goal and the point of this way of thinking.
I think that the intuitions from “classical” multivariable optimization are poor guides for thinking about either human values or the cognition of deep learning systems. To highlight a concrete (but mostly irrelevant, IMO) example of how they diverge, this claim:
is largely false for deep learning systems, whose parameters mostly[1] don’t grow to extreme positive or negative values. In fact, in the limit of wide networks under the NTK initialization, the average change in parameter values goes to zero.
Additionally, even very strong optimization towards a given metric does not imply that the system will pursue whatever strategy is available that minimizes the metric in question. E.g., GPT-3 was subjected to an enormous amount of optimization pressure to reduce its loss. But GPT-3 itself does not behave as though it has any desire to decrease its own loss. If you ask it to choose its own curriculum, it won’t default to the most easily predicted possible data.
Related: humans don’t sit in dark rooms all day, even though doing so would minimize predictive error, and their visual cortexes both (1) optimize towards low predictive error, and (2) there are pathways available by which visual cortexes can influence their human’s motor behavior.
Related: Reward is not the optimization target
Some specific subsets of weights, such as those in layer norms, can be an exception here, as they can grow into the hundreds for some architectures.
Another source of Adversarial robustness issues relates to the model itself becoming deceptive.
As for this:
I unfortunately think this is exactly what real world AI companies are building.