I think here it makes sense to talk about internal parts, separate from behavior, and real. And similarly in the single agent case: there are physical mechanisms producing the behavior, which can have different characteristics, and which in particular can be ‘in conflict’—in a way that motivates change—or not. I think it is also worth observing that humans find their preferences ‘in conflict’ and try to resolve them, which is suggests that they at least are better understood in terms of both behavior and underlying preferences that are separate from it.
I think this is worth highlighting as something we too often ignore to our peril. Paying attention to internal parts is sometimes “annoying” in the sense that we can build much easier to reason about models by ignoring mechanisms and simply treating things, like AIs, as black boxes (or as made up of a small number of black boxes) with some behavior we can observe from the outside. But doing so will result in us consistently being surprised in ways we needn’t have been.
For example, you treat two AIs as if they are EU maximizers and you model the utility function they are maximizing. But they actually behave in different ways in some situation even though the modeled utility function predicted the same behavior. And I don’t think this is just a failure to make a good enough model of the utility function, I think it’s fundamental, the way Goodheart is, that when we model something we necessarily are measuring it and so not getting the real thing, and will necessarily risk surprise when the model and the real thing do different things.
So the less detailed our models, and the more they ignore internals, the more we put ourselves at risk. Anyway, kind of a tangent from the post, but I feel like I constantly see simple models being used to try to explain AI that push out important internals in the name of simpler models that create real risks of confusing ourselves in attempts to build aligned AI.
I think this is worth highlighting as something we too often ignore to our peril. Paying attention to internal parts is sometimes “annoying” in the sense that we can build much easier to reason about models by ignoring mechanisms and simply treating things, like AIs, as black boxes (or as made up of a small number of black boxes) with some behavior we can observe from the outside. But doing so will result in us consistently being surprised in ways we needn’t have been.
For example, you treat two AIs as if they are EU maximizers and you model the utility function they are maximizing. But they actually behave in different ways in some situation even though the modeled utility function predicted the same behavior. And I don’t think this is just a failure to make a good enough model of the utility function, I think it’s fundamental, the way Goodheart is, that when we model something we necessarily are measuring it and so not getting the real thing, and will necessarily risk surprise when the model and the real thing do different things.
So the less detailed our models, and the more they ignore internals, the more we put ourselves at risk. Anyway, kind of a tangent from the post, but I feel like I constantly see simple models being used to try to explain AI that push out important internals in the name of simpler models that create real risks of confusing ourselves in attempts to build aligned AI.