sunwillrise comments on A simple case for extreme inner misalignment

sunwillrise 13 Jul 2024 16:31 UTC
20 points
8
In this post I’ll explore each of these in turn. I’ll primarily aim to make the positive case in this post; if you have an objection that I don’t mention here, I may discuss it in the next post.
This is good to know, and I expect that this doesn’t mean you want commenters to withhold their objections until after you discuss a section of them in your next post. I personally find that writing out confusions and disagreements at the very moment they appear is most useful instead of letting time pass by and attention to this topic to go by without them getting resolved.
Anyway, onto the object-level.
Premise 1 claims that, as AIs become more intelligent, their representations will become more compressed. Premise 2 claims that the simplest goals resemble squiggle-maximization. The relationship described in premise 1 may break down as AIs become arbitrarily intelligent—but if it doesn’t, then premise 2 suggests that their goals will converge toward some kind of squiggle-maximization. (Note that I’m eliding over some subtleties related to which representations exactly get compressed, which I’ll address in my next post.)
This is the key ^[1] paragraph in this post, and frankly, I don’t get it. Or rather, I think I get it, but it seems like straightforwardly false in my reading of it, so I put serious probability on me having misunderstood it completely.
This reasoning seems to assume^[2] that the “goal” of the AI is part of its “representations,” so that when the representations get more compressed, so does the goal. Put aside for now the (in my view rather compelling) case that the belief/goal distinction probably doesn’t ultimately make sense as something that carves reality at the joints into coherent, distinct concepts. The much more serious problem is that there is an implicit and rather subtle equivocation between different meanings of “representations” that makes either this inference or Premise 1 invalid.
The “representations,” in the relevant sense that makes Premise 1 worth taking seriously, are object-level, positive rather than normative internal representations of the underlying territory. But the “goal” lies in another, separate magisterium. Yes, it refers to reality, so when the map approximating reality changes, so does its description. But the core of the goal does not, for it is normative rather than positive; it simply gets reinterpreted, as faithfully as possible, in the new ontology. This is one of the essences of the Orthogonality Thesis, is it not ^[3]? That the goal is independent (i.e., orthogonal, implying uncorrelated) of the factual beliefs about reality,
Put differently, the mapping from the initial ontology to the final, more “compressed” ontology does not shrink the representation of the goal before or after mapping it; it simply maps it. If it all (approximately) adds up to normality, meaning that the new ontology is capable of replicating (perhaps with more detail, granularity, or degrees of freedom) the observations of the old one ^[4], I expect the “relative measure” of the goal representation to stay approximately ^[5] the same. And more importantly, I expect the “inverse transformation” from the new ontology to the old one to map the new representation back to the old one (since the new representation is supposed to be more compressed, i.e. informationally richer than the old one, in mathematical terms I would expect the preimage of the new representation to be approximately the old one).
I don’t see the value systematization post as really bringing much relevant explanatory or argumentative power about this matter, either, perhaps because of its still-speculative nature?
1. ^
  Or rather, one of the key
2. ^
  I have a very hard time making heads or tails of it if it doesn’t assume what I talk about next.
3. ^
  Well, actually, Steven Byrnes recently described the Orthogonality Thesis in his “Claim” in what seemed to me like a much more restricted, weaker form that made it more like the “Nonlinearity Thesis”, or the “Weak-correlation-at-most + tails coming apart thesis”.
4. ^
  Such as how the small-mass, low-velocity limit of General Relativity replicates standard Newtonian mechanics.
5. ^
  I say “approximately” because of potential issues due to stuff analogous to Wentworth’s “Pointers problem” and the way in which some (presumably small) parts of the goal in its old representation might be entirely incoherent and impossible to rescue in the new one.
What links here?
- Richard_Ngo's comment on A more systematic case for inner misalignment by Richard_Ngo (20 Jul 2024 17:48 UTC; 2 points)
- Richard_Ngo 13 Jul 2024 22:32 UTC
  7 points
  −2
  Parent
  I expect that this doesn’t mean you want commenters to withhold their objections until after you discuss a section of them in your next post
  That’s right, please critique away. And thanks for the thoughtful comment!
  This reasoning seems to assume^[2] that the “goal” of the AI is part of its “representations,”
  An AI needs to represent its goals somehow; therefore its goals are part of its representations. But this is just a semantic point; I dig into your substantive criticism next.
  The “representations,” in the relevant sense that makes Premise 1 worth taking seriously, are object-level, positive rather than normative internal representations of the underlying territory. But the “goal” lies in another, separate magisterium. Yes, it refers to reality, so when the map approximating reality changes, so does its description. But the core of the goal does not, for it is normative rather than positive; it simply gets reinterpreted, as faithfully as possible, in the new ontology.
  This is one of the objections I’ll be covering in the next post. But you’re right to call me out on it, because I am assuming that goal representations also become compressed. Or, to put it another way: I am assuming that the pressure towards simplicity described in premise 1 doesn’t distinguish very much between goal representations and concept representations.
  Why? Well, it’s related to a claim you make yourself: that “the belief/goal distinction probably doesn’t ultimately make sense as something that carves reality at the joints into coherent, distinct concepts”. In other words, I don’t think there is a fully clean distinction between goal representations and belief/concept representations. (In humans it often seems like the goal of achieving X is roughly equivalent to a deep-rooted belief that achieving X would be good, where “good” is a kinda fuzzy predicate that we typically don’t look at very hard.) And so, given this, when I postulate a pressure to simplify representations my default assumption is that this will apply to both types of representations—as it seems to in my own brain, which often tries very hard to simplify my moral goals in a roughly analogous way to how it tries to simplify my beliefs.
  This is one of the essences of the Orthogonality Thesis, is it not ^[3]? That the goal is independent (i.e., orthogonal, implying uncorrelated) of the factual beliefs about reality,
  The orthogonality thesis was always about what agents were possible, not which were likely. It is consistent with the orthogonality thesis to say that increasingly intelligent agents have a very strong tendency to compress their representations, and that this tends to change their goals towards ones which are simple in their new ontologies (although it’s technically possible that they retain goals which are very complex in their new ontologies). Or, to put it another way: the orthogonality thesis doesn’t imply that goals and beliefs are uncorrelated (which is a very extreme claim—e.g. it implies that superintelligences are just as likely to have goals related to confused concepts like phlogiston or souls as humans are).
  - sunwillrise 13 Jul 2024 23:45 UTC
    15 points
    4
    Parent
    I don’t think there is a fully clean distinction between goal representations and belief/concept representations
    Alright, this is making me excited for your upcoming post.
    In humans it often seems like the goal of achieving X is roughly equivalent to a deep-rooted belief that achieving X would be good, where “good” is a kinda fuzzy predicate that we typically don’t look at very hard
    I like this framing a lot. I like it so much, in fact, that I intend to use it in my upcoming long post compiling arguments against the theoretical soundness and practical viability of CEV.
    The orthogonality thesis was always about what agents were possible, not which were likely. [...] the orthogonality thesis doesn’t imply that goals and beliefs are uncorrelated
    This is related to what I wrote in footnote 3 of my previous comment to you. But anyway, would you agree that this is an absolutely terrible naming of this concept? When I type in “orthogonal” in Google, the very first thing that pops up is a list of definitions containing “[2] STATISTICS (of variates) statistically independent.” And even if people aren’t meant to be familiar with this definition, the most common and basic use of orthogonal, namely in linear algebra, implies that two vectors are not only linearly independent, but also that they are as far from pointing in a “related” direction as mathematically possible!
    It completely boggles the mind that “orthogonality” was the word chosen as the encoding of these ideas.
    Anyway, I left the most substantive and important comment for last.
    And so, given this, when I postulate a pressure to simplify representations my default assumption is that this will apply to both types of representations—as it seems to in my own brain, which often tries very hard to simplify my moral goals in a roughly analogous way to how it tries to simplify my beliefs.
    The thing about this is that you don’t seem to be currently undergoing the type of ontological crisis or massive shift in capabilities that would be analogous to an AI getting meaningfully more intelligent due to algorithmic improvements or increased compute or data (if you actually are, godspeed!)
    So would you argue that this type of goal simplification and compression happens organically and continuously even in the absence of such a “phase transition”? I have a non-rigorous feeling that this argument would prove too much by implying more short-term modification of human desires than we actually observe in real life.
    Relatedly, would you say that your moral goals are simpler now than they were, say, back when you were a child? I am pretty sure that the answer, at least for me, is “definitely not,” and that basically every single time I have grown “wiser” and had my belief system meaningfully altered, I came out of that process with a deeper appreciation for the complexity of life and for the intricacies and details of what I care about.
    - rif a. saurous 14 Jul 2024 16:29 UTC
      28 points
      14
      Parent
      I’m generally confused by the argument here.
      As we examine successively more intelligent agents and their representations, the representation of any particular thing will perhaps be more compressed, but also and importantly, more intelligent agents represent things that less intelligent agents don’t represent at all. I’m more intelligent than a mouse, but I wouldn’t say I have a more compressed representation of differential calculus than a mouse does. Terry Tao is likely more intelligent than I am, likely has a more compressed representation of differential calculus than I do, but he also has representations of a bunch of other mathematics I can’t represent at all, so the overall complexity of his representations in total is plausibly higher.
      
      Why wouldn’t the same thing happen for goals? I’m perfectly willing to say I’m smarter than a dog and a dog is smarter than a paramecium, but it sure seems like the dog’s goals are more complex than the paramecium’s, and mine are more complex than the dog’s. Any given fixed goal might have a more compressed representation in the more intelligent animal (I’m not sure it does, but that’s the premise so let’s accept it), but the set of things being represented is also increasing in complexity across organisms. Driving the point home, Terry Tao seems to have goals of proving theorems I don’t even understand the statement of, and these seem like complex goals to me.
      
      So overall I’m not following from the premises to the conclusions. I wish I could make this sharper. Help welcome.
      - sunwillrise 14 Jul 2024 19:03 UTC
        2 points
        0
        Parent
        I think what you’re saying just makes a lot of sense, honestly.
        I’d suspect one possible counterargument is that, just like how more intelligent agents with more compressed models can more compactly represent complex goals, they are also capable of drawing ever-finer distinctions that allow them to identify possible goals that have very short encodings in the new ontology, but which don’t make sense at all as stand-alone, mostly-coherent targets in the old ontology (because it is simply too weak to represent them). So it’s not just that goals get compressed, but also that new possible kinds of goals (many of them really simple) get added to the game.
        But this process should also allow new goals to arise that have ~ any arbitrary encoding length in the new ontology, because it should be just as easy to draw new, subtle distinctions inside a complex goal (which outputs a new medium- or large-complexity goal) as it would be inside a really simple goal (which outputs the type of new super-small-complexity goal that the previous paragraph talks about). So I don’t think this counterargument ultimately works, and I suspect it shouldn’t change our expectations in any meaningful way.
    - MinusGix 14 Jul 2024 13:31 UTC
      1 point
      0
      Parent
      I’m skeptical of the naming being bad, it fits with that definition and the common understanding of the word. The Orthogonality Thesis is saying that the two qualities of goal/value are not necessarily related, which may seem trivial nowadays but there used to be plenty of people going “if the AI becomes smart, even if it is weird, it will be moral towards humans!” through reasoning of the form “smart → not dumb goals like paperclips”. There’s structure imposed on what minds actually get created, based on what architectures, what humans train the AI on, etc. Just as two vectors can be orthogonal in R^2 while the actual points you plot in the space are correlated.
      - sunwillrise 14 Jul 2024 13:42 UTC
        1 point
        0
        Parent
        it fits with that definition
        With what definition? The one most applicable here, dealing with random variables (relative to our subjective uncertainty), says “random variables that are independent”. Independence implies uncorrelation, even if the converse doesn’t hold.
        Just as two vectors can be orthogonal in R^2 while the actual points you plot in the space are correlated.
        This is totally false as a matter of math if you use the most common definition of orthogonality in this context. I do agree that what you are saying could be correct if you do not think of orthogonality that way and instead simply look at it in terms of the entries of the vectors, but then you enter the realm of trying to capture the “goals” and “beliefs” as specific Euclidean vectors, and I think that isn’t the best idea for generating intuition because one of the points of the Orthogonality thesis seems to be to instead abstract away from the specific representation you choose for intelligence and values (which can bias you one way or another) and to instead focus on the broad, somewhat-informal conclusion.
        MinusGix 14 Jul 2024 18:55 UTC
        1 point
        0
        Parent
        
        it fits with that definition
        
        Ah, I rewrote my comment a few times and lost what I was referencing. I originally was referencing the geometric meaning (as an alternate to your statistical definition), two vectors at a right angle from each other.
        
        But the statistical understanding works from what I can tell? You have your initial space with extreme uncertainty, and the orthogonality thesis simply states that (intelligence, goals) are not related — you can pair some intelligence with any goal. They are independent of each other at this most basic level. This is the orthogonality thesis. Then, in practice, you condition your probability distribution over that space with your more specific knowledge about what minds will be created, and how they’ll be created. You can consider this as giving you a new space, moving probability around. As an absurd example: if height/weight of creatures were uncorrelated in principal, but then we update on “this is an athletic human”, then in that new distribution they are correlated! This is what I was trying to get at with my R^2 example, but apologies that I was unclear since I was still coming at it from a frame of normal geometry. (Think, each axis is an independent normal distribution but then you condition on some knowledge that restricts them such that they become correlated)
        
        I agree that it is an informal argument and that pinning it down to very detailed specifics isn’t necessary or helpful at this low-level, I’m merely attempting to explain why orthogonality works. It is a statement about the basic state of minds before we consider details, and they are orthogonal there; because it is an argumentative response to assumptions about “smart → not dumb goals”.