Zvi comments on AI #33: Cool New Interpretability Paper

Zvi 14 Oct 2023 11:44 UTC
2 points
0
What I’m saying as it applies to us is, in this case and at human-level, with our level of affordances/compute/data/etc, humans have found that the best way to maximize X’ is to instead maximize some set of things Z, where Z is a complicated array of intermediate goals, so we have evolved/learned to do exactly this. The exact composition of Z aims for X’, and often misses. And because of the ways humans interact with the world and other humans, if you don’t do something pretty complex, you won’t get much X’ or X, so I don’t know what else you were expecting—the world punishes wireheading pretty hard.
But, as our affordances/capabilities/local-data increase and we exert more optimization pressure, we should expect to see more narrow and accurate targeting of X’, in ways that nicely generalize less. And indeed I think we largely do see this in humans even over our current ranges, fwiw, in highly robust fashion (although not universally).
Neural networks do not always overfit if we get the settings right. But they very often do, we just throw those out and try again, which is also part of the optimization process. As I understand it, if you give them the ability to fully overfit, they will totally do it, which is one reason you have to mess with the settings, you have to set them up so that the way to max X’ is not to do what to us we would call an overfit, either specifically or generalizations of that, which is a large part of why everything is so finicky.
Or, the general human pattern: When dealing with arbitrary outside forces more powerful than you, that you don’t fully understand, you learn to and do abide the spirit of the enterprise, to avoid rustling feathers (impact penalties), to not aim too narrowly, try not to lie, etc. But that’s because we lack the affordances to stop doing that.