Quintin Pope comments on Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why?

Quintin Pope Feb 10, 2023, 12:59 AM
5 points
2
I think that the intuitions from “classical” multivariable optimization are poor guides for thinking about either human values or the cognition of deep learning systems. To highlight a concrete (but mostly irrelevant, IMO) example of how they diverge, this claim:
If you have an optimization process in which you forget to specify every variable that you care about, then unspecified variables are likely to be set to extreme values.
is largely false for deep learning systems, whose parameters mostly^[1] don’t grow to extreme positive or negative values. In fact, in the limit of wide networks under the NTK initialization, the average change in parameter values goes to zero.
Additionally, even very strong optimization towards a given metric does not imply that the system will pursue whatever strategy is available that minimizes the metric in question. E.g., GPT-3 was subjected to an enormous amount of optimization pressure to reduce its loss. But GPT-3 itself does not behave as though it has any desire to decrease its own loss. If you ask it to choose its own curriculum, it won’t default to the most easily predicted possible data.
Related: humans don’t sit in dark rooms all day, even though doing so would minimize predictive error, and their visual cortexes both (1) optimize towards low predictive error, and (2) there are pathways available by which visual cortexes can influence their human’s motor behavior.
Related: Reward is not the optimization target
1. ^
  Some specific subsets of weights, such as those in layer norms, can be an exception here, as they can grow into the hundreds for some architectures.