Signer comments on AI #33: Cool New Interpretability Paper

Signer 13 Oct 2023 15:36 UTC
14 points
11

This seems like the most important crux. Why should we not expect the maximizer we trained to X-maximize to use its affordances to maximize X’, where X’ is the exact actual thing the training feedback represents as a target, and that differs at least somewhat from X? Why should we expect to like the way it does that, even if X’ did equal X? I do not understand the other perspective.

Because that’s what happens—humans don’t always wirehead and neural networks don’t always overfit. Because training feedback is not utility, there are also local effects of training process and space of training data.
- Zvi 14 Oct 2023 11:44 UTC
  2 points
  0
  Parent
  What I’m saying as it applies to us is, in this case and at human-level, with our level of affordances/compute/data/etc, humans have found that the best way to maximize X’ is to instead maximize some set of things Z, where Z is a complicated array of intermediate goals, so we have evolved/learned to do exactly this. The exact composition of Z aims for X’, and often misses. And because of the ways humans interact with the world and other humans, if you don’t do something pretty complex, you won’t get much X’ or X, so I don’t know what else you were expecting—the world punishes wireheading pretty hard.
  But, as our affordances/capabilities/local-data increase and we exert more optimization pressure, we should expect to see more narrow and accurate targeting of X’, in ways that nicely generalize less. And indeed I think we largely do see this in humans even over our current ranges, fwiw, in highly robust fashion (although not universally).
  Neural networks do not always overfit if we get the settings right. But they very often do, we just throw those out and try again, which is also part of the optimization process. As I understand it, if you give them the ability to fully overfit, they will totally do it, which is one reason you have to mess with the settings, you have to set them up so that the way to max X’ is not to do what to us we would call an overfit, either specifically or generalizations of that, which is a large part of why everything is so finicky.
  Or, the general human pattern: When dealing with arbitrary outside forces more powerful than you, that you don’t fully understand, you learn to and do abide the spirit of the enterprise, to avoid rustling feathers (impact penalties), to not aim too narrowly, try not to lie, etc. But that’s because we lack the affordances to stop doing that.