johnswentworth comments on Classification of AI alignment research: deconfusion, “good enough” non-superintelligent AI alignment, superintelligent AI alignment

johnswentworth 15 Jul 2020 18:55 UTC
6 points
All alignment is “good enough” alignment, there is no such thing as “perfect” alignment except in idealized theory.
I strongly disagree with this. It may be true in some technical sense—e.g. we can’t be 100% certain there’s not a bug in our code—but I do think there exists a sharp, qualitative distinction between systems which are optimizing-for-the-thing-we-call-human-values and systems which aren’t doing that. Most likely underlying generator of disagreement: I think there’s a natural, precise notion of what we mean when we point to “human values”, in much the same way that there’s a natural, precise notion of what we mean when we point to a flower. There’s still multiple steps between pointing to flowers and pointing to human values, but one feature I expect to carry over is that it’s not an underspecified or fully-subjective notion—there is a well-defined sense in which the physical system of molecules comprising a human brain “wants things”, and a well-defined notion of what that system wants.
What links here?
- [AN #148]: Analyzing generalization across more axes than just accuracy or loss by Rohin Shah (28 Apr 2021 18:30 UTC; 24 points)
- Rohin Shah's comment on Testing The Natural Abstraction Hypothesis: Project Intro by johnswentworth (23 Apr 2021 21:39 UTC; 4 points)
- Rohin Shah 15 Jul 2020 19:21 UTC
  7 points
  Parent
  I broadly agree with this perspective (and in fact it’s one of my reasons for optimism about AI alignment).
  But usually when LessWrongers argue against “good enough” alignment, they’re arguing against alignment methods, saying that “nothing except proofs” will work, because only proofs give near-100% confidence. (I might be strawmanning this argument, I don’t really understand it.)
  You’re talking about the internal structure of the AI system (is the AI system actually in fact optimizing for “human values”, or something else), where I do expect a sharper, qualitative distinction. I’m claiming that our ability to get on the right side of that distinction is relatively smooth across the methods that we could use.
  Part of my optimism about AI alignment (relative to LW) comes from thinking that since there (probably) is a relatively sharp qualitative divide between “aligned computation” and “unaligned computation”, the “engineering approach” has more of a shot at working. (This isn’t a big factor in my optimism though.)
- Charlie Steiner 15 Jul 2020 23:13 UTC
  2 points
  Parent
  I almost ended up writing a whole post more or less psychologizing this point recently.
  Quotes from the probably-never-to-be-published post, which I might as well fillet out to present here:
  Last year I was thinking about how humans refer to things. For example, when I say “human values,” it seems like I am pointing to something (some thing), as surely as if I was using my finger to point at some material object. And so if we want an AI to learn about human values, it sure would be nice if it could follow that pointer out to the thing-being-pointed-to.
  At the time, it wasn’t at all obvious to me that I had already stepped off the path, but I had. Rather than trying to understand this thing humans do—refer to things—in terms of the map-making problem humans actually face [From earlier: The physical world is really complicated. Humans get some information about the world via the senses, and then we model it so that we can make sense of our senses, predict the world, and make plans. This can be a really useful starting point for explanations of confusing phenomena.], I had framed the problem with an analogy to physical objects. As if the analogy was clean, and as if objects were natural (dare I say directly-perceived) building blocks of the world.
  It’s a very tricky mistake to avoid, this thing of thinking that reality will respect your labels. I wanted to understand the “human values” label, and so I mistakenly tried to look for the process by which we associate that label with some natural object, or even natural pattern, out in the world that corresponds to “human values.” But reality doesn’t have objects for things just because we have labels for them. This is the fallacy of essentialism—the notion that if we have a word like “roundness,” then there must be some thing out in the world that is roundness. The roundness-essence, if you will.
  EDIT: To forestall the obvious objection to the last sentence that roundness is a pattern, and surely with a little elbow grease you could write down something about spherical symmetry that is equivalent to roundness-essence, the most relevant point to human values is that even if we have a label for a pattern, that pattern still doesn’t have to exist. The label-making process of the human brain does not first require comprehension of some referent of the label.
  Rather than finding a theory in which we can find a precise notion of human values, we need a theory in which we can do okay despite not having a precise notion of human values (yes, I agree that sounds paradoxical). And by the naturalization thesis, this sort of reasoning plausibly also applies to an aligned AI.
  This isn’t “rah rah type 2 research, boo type 1 research.” What I mean is that I think the indeterminacy of human values connects the two together, like the critical point of water allows for a continuous transition between liquid and gas.
  - johnswentworth 16 Jul 2020 1:15 UTC
    2 points
    Parent
    Counterargument: suppose a group of humans split off from the rest of humanity long enough ago that they have no significant shared cultural background. They develop language independently. Assuming they live in an area with trees, do they still develop a word for “tree”, recognize individual trees as objects, and generally have a notion of tree which matches our notion? I think the answer is pretty clearly “yes”—in part because the number of examples a baby needs to learn what a word means is not nearly large enough to narrow down the massive object space unless they already have some latent classification for those objects.
    It’s true that the label-making making process of the human brain does not require a referent in order to generate a word, but most words have them anyway—including (but not limited to) any word whose meaning can be reasonably-reliably communicated to someone who’s never heard it before using less than a million examples.
    One human can have a word for a pattern which doesn’t exist. Two humans can use that word. But if you put the two humans in separate, identical rooms and ask them both to point to the <word>, and they consistently point to the same thing, then that’s pretty clear evidence that the pattern exists in the world. “Human values” are a bit too abstract for that exact test, but I think we have more than enough analogous evidence to conclude that they do exist.
    - Charlie Steiner 16 Jul 2020 4:26 UTC
      4 points
      Parent
      Okay, let’s go with “tree.” Is an acorn a tree? A seedling? What if the seedling is just a sprouted acorn in a plastic bag, versus a sprouted acorn that’s planted in the ground? A dead, fallen-over tree? What about a big unprocessed log? The same log but with its bark stripped off?
      How likely do you think it is that there’s some culture out there that disagrees with you about at least two of these? How likely is it that you would disagree with yourself, given different contextual cues?
      Trees obviously exist. And I agree with you that a clever clusterer will probably find some cluster that more or less overlaps with “tree” (though who knows, there’s probably a culture out there that has a word for woody-stemmed plants but not for trees specifically, or no word for trees but words for each of the three different kinds of trees in their environment specifically).
      But an AI that’s trying to find the “one true definition of trees” will quickly run into problems. There is no thing, nothing with the properties intuitive to an object or substance, that defines trees. And if you make an AI that goes out and looks at the world and comes up with its own clusterings and then tries to learn what “tree” means from relatively few examples, this is precisely a ‘good-enough’ hack of the type 2 variety.
      - johnswentworth 16 Jul 2020 15:49 UTC
        9 points
        Parent
        Is an acorn a tree? A seedling? What if the seedling is just a sprouted acorn in a plastic bag, versus a sprouted acorn that’s planted in the ground? A dead, fallen-over tree? What about a big unprocessed log? The same log but with its bark stripped off?
        How likely do you think it is that there’s some culture out there that disagrees with you about at least two of these? How likely is it that you would disagree with yourself, given different contextual cues?
        Wrong questions. A cluster does not need to have sharp classification boundaries in order for the cluster itself to be precisely defined, and it’s precise definition of the cluster itself that matters.
        An even-more-simplified example: suppose we have a cluster in some dataset which we model as normal with mean 3.55 and variance 2.08. There may be points on the edge of the cluster which are ambiguously/uncertainly classified, and that’s fine. The precision of the cluster itself is not about sharp classification, it’s about precise estimation of the parameters (i.e. mean 3.55 and variance 2.08, plus however we’re quantifying normality). If our algorithm is “working correctly”, then there is an actual pattern out in the world corresponding to our cluster, and that pattern is the thing we want to point to—not any particular point within the pattern.
        Back to trees. The one true definition of trees does not unambiguously classify all objects as tree or not-tree; that is not the sense in which it is precisely defined. Rather, there is some precisely-defined generative model for observations-of-trees, and the concept of “tree” points to that model. Assuming the human-labelling-algorithm is “working correctly”, that generative model matches an actual pattern in the world, and the precision of the model follows from the pattern. None of this requires unambiguous classification of logs as tree/not-tree.
        On to human values. (I’ll just talk about one human at the moment, because cross-human disagreements are orthogonal to the point here.) The answer to “what does this human want?” does not always need to be unambiguous—indeed it should not always be unambiguous, because that is not the actual nature of human values. Rather, I have some precisely-defined generative model for observations-involving-my-values. Assuming my algorithm is “working correctly”, there is an actual pattern out in the world corresponding to that cluster, and that pattern is the thing we want to point to. That’s not just “good enough”; pointing to that pattern (assuming it exists) is perfect alignment. That’s what “mission accomplished” looks like. It’s the thing we’re modelling when we model our own desires.
        Charlie Steiner 16 Jul 2020 21:06 UTC
        2 points
        Parent
        Rather, there is some precisely-defined generative model for observations-of-trees, and the concept of “tree” points to that model. Assuming the human-labelling-algorithm is “working correctly”, that generative model matches an actual pattern in the world, and the precision of the model follows from the pattern. None of this requires unambiguous classification of logs as tree/not-tree.
        This contains the ad-hoc assumption that if there’s one history in which I’ll say logs are trees, and another history in which I won’t, then what I’m doing is approximating a “real concept” in which logs are sorta-trees.
        This is a modeling assumption about humans that doesn’t have to be true. You could just as well say that in the two different worlds, I’m actually referring to two related but distinct concepts. (Or you could model me as picking things to say about trees in a way that doesn’t talk about the properties of some “concept of trees” at all.)
        The root problem is that “pointing to a real pattern” is not something humans can do in a vacuum. “I’m a great communicator, but people just don’t understand me,” as the joke goes. As far as I can tell, what you mean is that you’re envisioning an AI that learns about patterns in the world, and then matches those patterns to some collection of data that it’s been told to assume is “pointing to a pattern.” And there is no unique scheme for this—at the very least, you’ve got a choice of universal turing machine, as well as a free parameter describing the expected human level of abstraction. And this isn’t a case where any choice will do, because we’re in the limited-data regime, where different ontologies can easily lead to different categorizations.
        johnswentworth 16 Jul 2020 21:37 UTC
        2 points
        Parent
        This contains the ad-hoc assumption that if there’s one history in which I’ll say logs are trees, and another history in which I won’t, then what I’m doing is approximating a “real concept” in which logs are sorta-trees.
        That is not an assumption, it is an implication of the use of the concept “tree” to make predictions. For instance, if I can learn general facts about trees by examining a small number of trees, then I know that “tree” corresponds to a real pattern out in the world. This extends to logs: to the extent that a log is a tree, I can learn general facts about trees by examining logs (and vice versa), and verify what I’ve learned by looking at more trees/logs.
        Pointing to a real pattern is indeed not something humans can do in a vacuum. Fortunately we do not live in a vacuum; we live in a universe with lots of real patterns in it. Different algorithms will indeed result in somewhat different classifications/patterns learned at any given time, but we can still expect a fairly large class of algorithms to converge to the same classifications/patterns over time, precisely because they are learning from the same universe. A perfectly-aligned AI will not have a perfect model of human values at any given time, but it can update in the right direction—in some sense it’s the update-procedure which is “aligned” with the true pattern, not the model itself which is “aligned”.
        That’s why we often talk about perfectly “pointing” to human values, rather than building a perfect model of human values. It’s not about having a perfect model at any given time, it’s about “having a pointer” to the real-world pattern of human values, allowing us to do things like update our model in the right direction.
        As far as I can tell, what you mean is that you’re envisioning an AI that learns about patterns in the world, and then matches those patterns to some collection of data that it’s been told to assume is “pointing to a pattern.” And there is no unique scheme for this—at the very least, you’ve got a choice of universal turing machine, as well as a free parameter describing the expected human level of abstraction. And this isn’t a case where any choice will do, because we’re in the limited-data regime...
        I definitely do not imagine that some random architecture would get it right with realistic amounts of data. Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem—it’s exactly the sort of thing that e.g. my work on abstraction will hopefully help with.
        (Also, matching the patterns to some collection of data intended to point to the pattern is not the only way of doing things, or even a very good way given the difficulty of verification, though for purposes of this discussion it’s a fine approach to examine.)
        Charlie Steiner 16 Jul 2020 23:20 UTC
        2 points
        Parent
        That is not an assumption, it is an implication of the use of the concept “tree” to make predictions.
        I would disagree in spirit—an AI can happily find a referent to the “tree” token that depends on context in a way that works like a word with multiple possible definitions.
        Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem
        I hope this is where we can start agreeing. Because the problem isn’t just finding something that performs well according to a known scoring rule. We don’t quite know how to implement the notion “this method for learning human values performs well” on a computer without basically already referring to some notion of human values for “performs well.”
        We either need to ground “performs well” in some theory of humans as approximate agents that doesn’t need to know about their values, or we need some theory that avoids the chicken-and-egg problem altogether by simultaneously learning human models and the standards to judge them by.
        johnswentworth 17 Jul 2020 2:35 UTC
        2 points
        Parent
        I hope this is where we can start agreeing. Because the problem isn’t just finding something that performs well according to a known scoring rule. We don’t quite know how to implement the notion “this method for learning human values performs well” on a computer without basically already referring to some notion of human values for “performs well.”
        To clarify, when said “performs well”, I did not mean “learns human values well”, nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world—much like earlier when I talked about “the human-labelling-algorithm ‘working correctly’”.
        Probably not the best choice of words on my part; sorry for causing a tangent.
        I would disagree in spirit—an AI can happily find a referent to the “tree” token that depends on context in a way that works like a word with multiple possible definitions.
        I’m sure it could, but I am claiming that such a thing would have worse predictive power. Roughly speaking: if there’s one notion of tree that includes saplings, and another that includes logs, then the model misses the ability to learn facts about saplings by examining logs. Conversely, if it doesn’t miss those sorts of things, then it isn’t actually behaving like a word with multiple possible referents. (I don’t actually think it’s that simple—the referent of “tree” is more than just a comparison class—but it hopefully suffices to make the point.)
        Charlie Steiner 17 Jul 2020 5:59 UTC
        2 points
        Parent
        To clarify, when said “performs well”, I did not mean “learns human values well”, nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world—much like earlier when I talked about “the human-labelling-algorithm ‘working correctly’”.
        Ah well. I’ll probably argue with you more about this elsewhere, then :)
        Jameson Quinn 16 Jul 2020 16:53 UTC
        1 point
        Parent
        This is very well-said, but I still want to dispute the possibility of “perfect alignment”. In your clustering analogy: the very existence of clusters presupposes definitions of entities-that-correspond-to-points, dimensions-of-the-space-of-points, and measurements-of-given-points-in-given-dimensions. All of those definitions involve imperfect modeling assumptions and simplifications. Your analogy also assumes that a normal-mixture-model is capable of perfectly capturing reality; I’m aware that this is provably asymptotically true for an infinite-cluster Dirichlet process mixture, but we don’t live in asymptopia and in reality it is effectively yet another strong assumption that holds at best weakly.
        In other words, while I agree with (and appreciate your clear expression of) your main point that it’s possible to have a well-defined category without being able to do perfect categorization, I dispute the idea that it is possible even in theory to have a perfectly-defined one.
        johnswentworth 16 Jul 2020 17:19 UTC
        2 points
        Parent
        All of those definitions involve imperfect modeling assumptions and simplifications. Your analogy also assumes that a normal-mixture-model is capable of perfectly capturing reality; I’m aware that this is provably asymptotically true for an infinite-cluster Dirichlet process mixture, but we don’t live in asymptopia and in reality it is effectively yet another strong assumption that holds at best weakly.
        This is a critical point; it’s the reason we want to point to the pattern in the territory rather than to a human’s model itself. It may be that the human is using something analogous to a normal-mixture-model, which won’t perfectly match reality. But in order for that to actually be predictive, it has to find some real pattern in the world (which may not be perfectly normal, etc). The goal is to point to that real pattern, not to the human’s approximate representation of that pattern.
        Now, two natural (and illustrative) objections to this:
        If the human’s representation is an approximation, then there may not be a unique pattern to which their notions correspond; the “corresponding pattern” may be underdefined.
        If we’re trying to align an AI to a human, then presumably we want the AI to use the human’s own idea of the human’s values, not some “idealized” version.
        The answer to both of these is the same: we humans often update our own notion of what our values are, in response to new information. The reality-pattern we want to point to is the pattern toward which we are updating; it’s the thing our learning-algorithm is learning about. I think this is what coherent extrapolated volition is trying to get at: it asks “what would we want if we knew more, thought faster, …”. Assuming that the human-label-algorithm is working correctly, and continues working correctly, those are exactly the sort of conditions generally associated with convergence of the human’s model to the true reality-pattern.