johnswentworth comments on Classification of AI alignment research: deconfusion, “good enough” non-superintelligent AI alignment, superintelligent AI alignment

johnswentworth 16 Jul 2020 15:49 UTC
9 points
Is an acorn a tree? A seedling? What if the seedling is just a sprouted acorn in a plastic bag, versus a sprouted acorn that’s planted in the ground? A dead, fallen-over tree? What about a big unprocessed log? The same log but with its bark stripped off?
How likely do you think it is that there’s some culture out there that disagrees with you about at least two of these? How likely is it that you would disagree with yourself, given different contextual cues?
Wrong questions. A cluster does not need to have sharp classification boundaries in order for the cluster itself to be precisely defined, and it’s precise definition of the cluster itself that matters.
An even-more-simplified example: suppose we have a cluster in some dataset which we model as normal with mean 3.55 and variance 2.08. There may be points on the edge of the cluster which are ambiguously/uncertainly classified, and that’s fine. The precision of the cluster itself is not about sharp classification, it’s about precise estimation of the parameters (i.e. mean 3.55 and variance 2.08, plus however we’re quantifying normality). If our algorithm is “working correctly”, then there is an actual pattern out in the world corresponding to our cluster, and that pattern is the thing we want to point to—not any particular point within the pattern.
Back to trees. The one true definition of trees does not unambiguously classify all objects as tree or not-tree; that is not the sense in which it is precisely defined. Rather, there is some precisely-defined generative model for observations-of-trees, and the concept of “tree” points to that model. Assuming the human-labelling-algorithm is “working correctly”, that generative model matches an actual pattern in the world, and the precision of the model follows from the pattern. None of this requires unambiguous classification of logs as tree/not-tree.
On to human values. (I’ll just talk about one human at the moment, because cross-human disagreements are orthogonal to the point here.) The answer to “what does this human want?” does not always need to be unambiguous—indeed it should not always be unambiguous, because that is not the actual nature of human values. Rather, I have some precisely-defined generative model for observations-involving-my-values. Assuming my algorithm is “working correctly”, there is an actual pattern out in the world corresponding to that cluster, and that pattern is the thing we want to point to. That’s not just “good enough”; pointing to that pattern (assuming it exists) is perfect alignment. That’s what “mission accomplished” looks like. It’s the thing we’re modelling when we model our own desires.
- Charlie Steiner 16 Jul 2020 21:06 UTC
  2 points
  Parent
  Rather, there is some precisely-defined generative model for observations-of-trees, and the concept of “tree” points to that model. Assuming the human-labelling-algorithm is “working correctly”, that generative model matches an actual pattern in the world, and the precision of the model follows from the pattern. None of this requires unambiguous classification of logs as tree/not-tree.
  This contains the ad-hoc assumption that if there’s one history in which I’ll say logs are trees, and another history in which I won’t, then what I’m doing is approximating a “real concept” in which logs are sorta-trees.
  This is a modeling assumption about humans that doesn’t have to be true. You could just as well say that in the two different worlds, I’m actually referring to two related but distinct concepts. (Or you could model me as picking things to say about trees in a way that doesn’t talk about the properties of some “concept of trees” at all.)
  The root problem is that “pointing to a real pattern” is not something humans can do in a vacuum. “I’m a great communicator, but people just don’t understand me,” as the joke goes. As far as I can tell, what you mean is that you’re envisioning an AI that learns about patterns in the world, and then matches those patterns to some collection of data that it’s been told to assume is “pointing to a pattern.” And there is no unique scheme for this—at the very least, you’ve got a choice of universal turing machine, as well as a free parameter describing the expected human level of abstraction. And this isn’t a case where any choice will do, because we’re in the limited-data regime, where different ontologies can easily lead to different categorizations.
  - johnswentworth 16 Jul 2020 21:37 UTC
    2 points
    Parent
    This contains the ad-hoc assumption that if there’s one history in which I’ll say logs are trees, and another history in which I won’t, then what I’m doing is approximating a “real concept” in which logs are sorta-trees.
    That is not an assumption, it is an implication of the use of the concept “tree” to make predictions. For instance, if I can learn general facts about trees by examining a small number of trees, then I know that “tree” corresponds to a real pattern out in the world. This extends to logs: to the extent that a log is a tree, I can learn general facts about trees by examining logs (and vice versa), and verify what I’ve learned by looking at more trees/logs.
    Pointing to a real pattern is indeed not something humans can do in a vacuum. Fortunately we do not live in a vacuum; we live in a universe with lots of real patterns in it. Different algorithms will indeed result in somewhat different classifications/patterns learned at any given time, but we can still expect a fairly large class of algorithms to converge to the same classifications/patterns over time, precisely because they are learning from the same universe. A perfectly-aligned AI will not have a perfect model of human values at any given time, but it can update in the right direction—in some sense it’s the update-procedure which is “aligned” with the true pattern, not the model itself which is “aligned”.
    That’s why we often talk about perfectly “pointing” to human values, rather than building a perfect model of human values. It’s not about having a perfect model at any given time, it’s about “having a pointer” to the real-world pattern of human values, allowing us to do things like update our model in the right direction.
    As far as I can tell, what you mean is that you’re envisioning an AI that learns about patterns in the world, and then matches those patterns to some collection of data that it’s been told to assume is “pointing to a pattern.” And there is no unique scheme for this—at the very least, you’ve got a choice of universal turing machine, as well as a free parameter describing the expected human level of abstraction. And this isn’t a case where any choice will do, because we’re in the limited-data regime...
    I definitely do not imagine that some random architecture would get it right with realistic amounts of data. Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem—it’s exactly the sort of thing that e.g. my work on abstraction will hopefully help with.
    (Also, matching the patterns to some collection of data intended to point to the pattern is not the only way of doing things, or even a very good way given the difficulty of verification, though for purposes of this discussion it’s a fine approach to examine.)
    - Charlie Steiner 16 Jul 2020 23:20 UTC
      2 points
      Parent
      That is not an assumption, it is an implication of the use of the concept “tree” to make predictions.
      I would disagree in spirit—an AI can happily find a referent to the “tree” token that depends on context in a way that works like a word with multiple possible definitions.
      Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem
      I hope this is where we can start agreeing. Because the problem isn’t just finding something that performs well according to a known scoring rule. We don’t quite know how to implement the notion “this method for learning human values performs well” on a computer without basically already referring to some notion of human values for “performs well.”
      We either need to ground “performs well” in some theory of humans as approximate agents that doesn’t need to know about their values, or we need some theory that avoids the chicken-and-egg problem altogether by simultaneously learning human models and the standards to judge them by.
      - johnswentworth 17 Jul 2020 2:35 UTC
        2 points
        Parent
        I hope this is where we can start agreeing. Because the problem isn’t just finding something that performs well according to a known scoring rule. We don’t quite know how to implement the notion “this method for learning human values performs well” on a computer without basically already referring to some notion of human values for “performs well.”
        To clarify, when said “performs well”, I did not mean “learns human values well”, nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world—much like earlier when I talked about “the human-labelling-algorithm ‘working correctly’”.
        Probably not the best choice of words on my part; sorry for causing a tangent.
        I would disagree in spirit—an AI can happily find a referent to the “tree” token that depends on context in a way that works like a word with multiple possible definitions.
        I’m sure it could, but I am claiming that such a thing would have worse predictive power. Roughly speaking: if there’s one notion of tree that includes saplings, and another that includes logs, then the model misses the ability to learn facts about saplings by examining logs. Conversely, if it doesn’t miss those sorts of things, then it isn’t actually behaving like a word with multiple possible referents. (I don’t actually think it’s that simple—the referent of “tree” is more than just a comparison class—but it hopefully suffices to make the point.)
        Charlie Steiner 17 Jul 2020 5:59 UTC
        2 points
        Parent
        To clarify, when said “performs well”, I did not mean “learns human values well”, nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world—much like earlier when I talked about “the human-labelling-algorithm ‘working correctly’”.
        Ah well. I’ll probably argue with you more about this elsewhere, then :)
- Jameson Quinn 16 Jul 2020 16:53 UTC
  1 point
  Parent
  This is very well-said, but I still want to dispute the possibility of “perfect alignment”. In your clustering analogy: the very existence of clusters presupposes definitions of entities-that-correspond-to-points, dimensions-of-the-space-of-points, and measurements-of-given-points-in-given-dimensions. All of those definitions involve imperfect modeling assumptions and simplifications. Your analogy also assumes that a normal-mixture-model is capable of perfectly capturing reality; I’m aware that this is provably asymptotically true for an infinite-cluster Dirichlet process mixture, but we don’t live in asymptopia and in reality it is effectively yet another strong assumption that holds at best weakly.
  In other words, while I agree with (and appreciate your clear expression of) your main point that it’s possible to have a well-defined category without being able to do perfect categorization, I dispute the idea that it is possible even in theory to have a perfectly-defined one.
  - johnswentworth 16 Jul 2020 17:19 UTC
    2 points
    Parent
    All of those definitions involve imperfect modeling assumptions and simplifications. Your analogy also assumes that a normal-mixture-model is capable of perfectly capturing reality; I’m aware that this is provably asymptotically true for an infinite-cluster Dirichlet process mixture, but we don’t live in asymptopia and in reality it is effectively yet another strong assumption that holds at best weakly.
    This is a critical point; it’s the reason we want to point to the pattern in the territory rather than to a human’s model itself. It may be that the human is using something analogous to a normal-mixture-model, which won’t perfectly match reality. But in order for that to actually be predictive, it has to find some real pattern in the world (which may not be perfectly normal, etc). The goal is to point to that real pattern, not to the human’s approximate representation of that pattern.
    Now, two natural (and illustrative) objections to this:
    If the human’s representation is an approximation, then there may not be a unique pattern to which their notions correspond; the “corresponding pattern” may be underdefined.
    If we’re trying to align an AI to a human, then presumably we want the AI to use the human’s own idea of the human’s values, not some “idealized” version.
    The answer to both of these is the same: we humans often update our own notion of what our values are, in response to new information. The reality-pattern we want to point to is the pattern toward which we are updating; it’s the thing our learning-algorithm is learning about. I think this is what coherent extrapolated volition is trying to get at: it asks “what would we want if we knew more, thought faster, …”. Assuming that the human-label-algorithm is working correctly, and continues working correctly, those are exactly the sort of conditions generally associated with convergence of the human’s model to the true reality-pattern.