johnswentworth comments on Classification of AI alignment research: deconfusion, “good enough” non-superintelligent AI alignment, superintelligent AI alignment

johnswentworth 16 Jul 2020 21:37 UTC
2 points
This contains the ad-hoc assumption that if there’s one history in which I’ll say logs are trees, and another history in which I won’t, then what I’m doing is approximating a “real concept” in which logs are sorta-trees.
That is not an assumption, it is an implication of the use of the concept “tree” to make predictions. For instance, if I can learn general facts about trees by examining a small number of trees, then I know that “tree” corresponds to a real pattern out in the world. This extends to logs: to the extent that a log is a tree, I can learn general facts about trees by examining logs (and vice versa), and verify what I’ve learned by looking at more trees/logs.
Pointing to a real pattern is indeed not something humans can do in a vacuum. Fortunately we do not live in a vacuum; we live in a universe with lots of real patterns in it. Different algorithms will indeed result in somewhat different classifications/patterns learned at any given time, but we can still expect a fairly large class of algorithms to converge to the same classifications/patterns over time, precisely because they are learning from the same universe. A perfectly-aligned AI will not have a perfect model of human values at any given time, but it can update in the right direction—in some sense it’s the update-procedure which is “aligned” with the true pattern, not the model itself which is “aligned”.
That’s why we often talk about perfectly “pointing” to human values, rather than building a perfect model of human values. It’s not about having a perfect model at any given time, it’s about “having a pointer” to the real-world pattern of human values, allowing us to do things like update our model in the right direction.
As far as I can tell, what you mean is that you’re envisioning an AI that learns about patterns in the world, and then matches those patterns to some collection of data that it’s been told to assume is “pointing to a pattern.” And there is no unique scheme for this—at the very least, you’ve got a choice of universal turing machine, as well as a free parameter describing the expected human level of abstraction. And this isn’t a case where any choice will do, because we’re in the limited-data regime...
I definitely do not imagine that some random architecture would get it right with realistic amounts of data. Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem—it’s exactly the sort of thing that e.g. my work on abstraction will hopefully help with.
(Also, matching the patterns to some collection of data intended to point to the pattern is not the only way of doing things, or even a very good way given the difficulty of verification, though for purposes of this discussion it’s a fine approach to examine.)
- Charlie Steiner 16 Jul 2020 23:20 UTC
  2 points
  Parent
  That is not an assumption, it is an implication of the use of the concept “tree” to make predictions.
  I would disagree in spirit—an AI can happily find a referent to the “tree” token that depends on context in a way that works like a word with multiple possible definitions.
  Picking an architecture which matches the structure of our universe closely enough to perform well with limited data is a key problem
  I hope this is where we can start agreeing. Because the problem isn’t just finding something that performs well according to a known scoring rule. We don’t quite know how to implement the notion “this method for learning human values performs well” on a computer without basically already referring to some notion of human values for “performs well.”
  We either need to ground “performs well” in some theory of humans as approximate agents that doesn’t need to know about their values, or we need some theory that avoids the chicken-and-egg problem altogether by simultaneously learning human models and the standards to judge them by.
  - johnswentworth 17 Jul 2020 2:35 UTC
    2 points
    Parent
    I hope this is where we can start agreeing. Because the problem isn’t just finding something that performs well according to a known scoring rule. We don’t quite know how to implement the notion “this method for learning human values performs well” on a computer without basically already referring to some notion of human values for “performs well.”
    To clarify, when said “performs well”, I did not mean “learns human values well”, nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world—much like earlier when I talked about “the human-labelling-algorithm ‘working correctly’”.
    Probably not the best choice of words on my part; sorry for causing a tangent.
    I would disagree in spirit—an AI can happily find a referent to the “tree” token that depends on context in a way that works like a word with multiple possible definitions.
    I’m sure it could, but I am claiming that such a thing would have worse predictive power. Roughly speaking: if there’s one notion of tree that includes saplings, and another that includes logs, then the model misses the ability to learn facts about saplings by examining logs. Conversely, if it doesn’t miss those sorts of things, then it isn’t actually behaving like a word with multiple possible referents. (I don’t actually think it’s that simple—the referent of “tree” is more than just a comparison class—but it hopefully suffices to make the point.)
    - Charlie Steiner 17 Jul 2020 5:59 UTC
      2 points
      Parent
      To clarify, when said “performs well”, I did not mean “learns human values well”, nor did I have any sort of scoring rule in mind. I intended to mean that the algorithm learns patterns which are actually present in the world—much like earlier when I talked about “the human-labelling-algorithm ‘working correctly’”.
      Ah well. I’ll probably argue with you more about this elsewhere, then :)