aysja comments on Dreams of AI alignment: The danger of suggestive names

aysja 16 Feb 2024 9:02 UTC
12 points
7
I agree with you that people get sloppy with these terms, and this seems bad. But there’s something important to me about holding space for uncertainty, too. I think that we understand practically every term on this list exceedingly poorly. Yes, we can point to things in the world, and sometimes even the mechanisms underlying them, but we don’t know what we mean in any satisfyingly general way. E.g. “agency” does not seem well described to me as “trained by reinforcement learning.” I don’t really know what it is well described by, and that’s the point. Pretending otherwise only precludes us from trying to describe it better.
I think there’s a lot of room for improvement in how we understand minds, i.e., I expect science is possible here. So I feel wary of mental moves such as these, e.g., replacing “optimal” with “set of sequential actions which have subjectively maximal expected utility relative to [entity X]‘s imputed beliefs,” as if that settled the matter. Because I think it gives a sense that we know what we’re talking about when I don’t think we do. Is a utility function the right way to model an agent? Can we reliably impute beliefs? How do we know we’re doing that right, or that when we say ‘belief’ it maps to something that is in fact like a belief? What is a belief? Why actions instead of world states? And so on.
It seems good to aim for precision and gears-level understanding wherever possible. But I don’t want this to convince us that we aren’t confused. Yes, we could replace the “tool versus agent” debate with things like “was it trained via RL or not,” or what have you, but it wouldn’t be very satisfying because ultimately that isn’t the thing we’re trying to point at. We don’t have good definitions of mind-type things yet, and I don’t want us to forget that.