sunwillrise comments on A more systematic case for inner misalignment

sunwillrise 20 Jul 2024 15:09 UTC
LW: 11 AF: 5
4
AF
Richard, I still don’t get it, and I think my objections in the comments of the initial post (1, 2), alongside those of rif a. sauros, remain correct. More specifically, there seems to be a very misleading equivocation going on regarding what “simpler” means. I think it’s crucial to emphasize that is a 2-place word, but your argument (at least when written in non-rigorous, non-mathematical terms) treats it as if it was a 1-place word, and this is what is causing the confusions.
Consider an agent that gets a “boost” $f$ from an ontology $O_{1}$ with the fuzzy-boundary representation of possible belief/goal pairs $(B_{1}, G_{1})$ to an ontology $O_{2}$ with a new set of (still probably fuzzy-boundary) pairs $(B_{2}, O_{2})$ , such that $O_{2}$ corresponds to more “intelligence”, meaning it compresses map representations of the underlying territory, in accordance with Prediction = Compression.
The first section of this post argues that, despite the simplicity-speed tradeoff and other related problems, this change will nonetheless likely compress the beliefs, meaning that any belief $B \in B_{1}$ will be mapped to a belief $f (B) \in B_{2}$ that requires fewer bits for the agent to identify, which we can (roughly) think of as having a smaller (ontology-specific analogue of) K-complexity: $K_{ϕ} (f (B), O_{2}) < K_{ϕ} (B, O_{1})$ . I think this is correct.
The second section argues that, because there is no clear belief/goal boundary and because the returns to compression remain as relevant for goals as they are for beliefs, the same will happen to the goals. This means that any goal $G \in G_{1}$ will likely be mapped to a goal $f (G) \in G_{2}$ that requires fewer bits for the agent to identify, which we can (roughly) think of as having a smaller (ontology-specific analogue of) K-complexity: $K_{ϕ} (f (G), O_{2}) < K_{ϕ} (G, O_{1})$ . I think this is also correct.
Finally, the third section argues that this monotonically decreasing process will likely not get stuck in local optima and should instead converge to as small a representation size as possible. I’m not fully convinced of this, but I will accept it for now.
Alright, so we’ve established that $K_{ϕ} (f (G), B_{2})$ will get really small, and this means that the goal is really compressed and simple. That is like a squiggle-maximizer (as you wrote, AIs that attempt to fill the universe with some very low-level pattern that’s meaningless to humans, e.g., “molecular squiggles” of a certain shape), right?
No. This is where the equivocation comes in. The simplicity of a goal is inherently dependent on the ontology you use to view it through: while $K_{ϕ} (f (G), O_{2}) < K_{ϕ} (G, O_{1})$ is (likely) true, pay attention to how this changes the ontology! The goal of the agent is indeed very simple, but not because the “essence” of the goal simplifies; instead, it’s merely because it gets access to a more powerful ontology that has more detail, granularity, and degrees of freedom. If you try to view $f (G)$ in $O_{1}$ instead of $O_{2}$ , meaning you look at the preimage $f^{- 1} [f (G)]$ , this should approximately be the same as $G$ : your argument establishes no reason for us to think that there is any force pulling the goal itself, as opposed to its representation, to be made smaller. As I wrote earlier:
The “representations,” in the relevant sense that makes Premise 1 worth taking seriously, are object-level, positive rather than normative internal representations of the underlying territory. But the “goal” lies in another, separate magisterium. Yes, it refers to reality, so when the map approximating reality changes, so does its description. But the core of the goal does not, for it is normative rather than positive; it simply gets reinterpreted, as faithfully as possible, in the new ontology. [...] That the goal is independent (i.e., orthogonal, implying uncorrelated) of the factual beliefs about reality.
Put differently, the mapping from the initial ontology to the final, more “compressed” ontology does not shrink the representation of the goal before or after mapping it; it simply maps it. If it all (approximately) adds up to normality, meaning that the new ontology is capable of replicating (perhaps with more detail, granularity, or degrees of freedom) the observations of the old one ^[4], I expect the “relative measure” of the goal representation to stay approximately ^[5] the same. And more importantly, I expect the “inverse transformation” from the new ontology to the old one to map the new representation back to the old one (since the new representation is supposed to be more compressed, i.e. informationally richer than the old one, in mathematical terms I would expect the preimage of the new representation to be approximately the old one).
[4] Such as how the small-mass, low-velocity limit of General Relativity replicates standard Newtonian mechanics.
[5] I say “approximately” because of potential issues due to stuff analogous to Wentworth’s “Pointers problem” and the way in which some (presumably small) parts of the goal in its old representation might be entirely incoherent and impossible to rescue in the new one.
Imagine the following scenario for illustrative purposes: a (dumb) AI has in front of it the integers from 1 to 10, and its goal is to select a single number among them that is either 2, 4, 6, 8, or 10. Now the AI gets the “ontology boost” and its understanding of its goal gets more compressed and simpler: it needs to select one of the even numbers. Is this a simpler goal?
Well, from one perspective, yes: the boosted AI has in its world-model a representation of the goal that requires fewer bits. But from the more important perspective, no: the goal hasn’t changed, and if you map “evenness” back into the more primitive ontology of the unboosted AI, you get the same goal. So, from the perspective of the unboosted AI, the goal of the boosted one is not any simpler; it’s just smart enough to represent the goal with fewer bits.
So goals that seem simple to humans (in our faulty ontology) or goals that seem like they would be relatively simpler compared to the rest in a more advanced ontology (like the squiggle-maximizer) are of a completely different kind of “simple” than what your argument shows: the AI doesn’t look through the set of goals to pick the one that is simplest (beware the Orthogonality Thesis, as in our previous exchange), it just simplifies ~ everything. That kind of goal simplification says more about the ontology than it does about the goal.
You also said earlier, in response to my comment:
And so, given this, when I postulate a pressure to simplify representations my default assumption is that this will apply to both types of representations—as it seems to in my own brain, which often tries very hard to simplify my moral goals in a roughly analogous way to how it tries to simplify my beliefs.
This still equivocates in the same way between the different meanings of “simple”, but let’s set that aside for now. I would be curious what your response would be to what I and rif a. sauros said in response:
sunwillrise: The thing about this is that you don’t seem to be currently undergoing the type of ontological crisis or massive shift in capabilities that would be analogous to an AI getting meaningfully more intelligent due to algorithmic improvements or increased compute or data (if you actually are, godspeed!)
So would you argue that this type of goal simplification and compression happens organically and continuously even in the absence of such a “phase transition”? I have a non-rigorous feeling that this argument would prove too much by implying more short-term modification of human desires than we actually observe in real life.
Relatedly, would you say that your moral goals are simpler now than they were, say, back when you were a child? I am pretty sure that the answer, at least for me, is “definitely not,” and that basically every single time I have grown “wiser” and had my belief system meaningfully altered, I came out of that process with a deeper appreciation for the complexity of life and for the intricacies and details of what I care about.
rif a. sauros: As we examine successively more intelligent agents and their representations, the representation of any particular thing will perhaps be more compressed, but also and importantly, more intelligent agents represent things that less intelligent agents don’t represent at all. I’m more intelligent than a mouse, but I wouldn’t say I have a more compressed representation of differential calculus than a mouse does. Terry Tao is likely more intelligent than I am, likely has a more compressed representation of differential calculus than I do, but he also has representations of a bunch of other mathematics I can’t represent at all, so the overall complexity of his representations in total is plausibly higher.

Why wouldn’t the same thing happen for goals? I’m perfectly willing to say I’m smarter than a dog and a dog is smarter than a paramecium, but it sure seems like the dog’s goals are more complex than the paramecium’s, and mine are more complex than the dog’s. Any given fixed goal might have a more compressed representation in the more intelligent animal (I’m not sure it does, but that’s the premise so let’s accept it), but the set of things being represented is also increasing in complexity across organisms. Driving the point home, Terry Tao seems to have goals of proving theorems I don’t even understand the statement of, and these seem like complex goals to me.
- Richard_Ngo 20 Jul 2024 17:48 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Thanks for the extensive comment! I’m finding this discussion valuable. Let me start by responding to the first half of your comment, and I’ll get to the rest later.
  The simplicity of a goal is inherently dependent on the ontology you use to view it through: while $K_{ϕ} (f (G), O_{2}) < K_{ϕ} (G, O_{1})$ is (likely) true, pay attention to how this changes the ontology! The goal of the agent is indeed very simple, but not because the “essence” of the goal simplifies; instead, it’s merely because it gets access to a more powerful ontology that has more detail, granularity, and degrees of freedom. If you try to view $f (G)$ in $O_{1}$ instead of $O_{2}$ , meaning you look at the preimage $f^{- 1} [f (G)]$ , this should approximately be the same as $G$ : your argument establishes no reason for us to think that there is any force pulling the goal itself, as opposed to its representation, to be made smaller.
  One way of framing our disagreement: I’m not convinced that the f operation makes sense as you’ve defined it. That is, I don’t think it can both be invertible and map to goals with low complexity in the new ontology.
  Consider a goal that someone from the past used to have, which now makes no sense in your ontology—for example, the goal of reaching the edge of the earth, for someone who thought the earth was flat. What does this goal look like in your ontology? I submit that it looks very complicated, because your ontology is very hostile to the concept of the “edge of the earth”. As soon as you try to represent the hypothetical world in which the earth is flat (which you need to do in order to point to the concept of its “edge”), you now have to assume that the laws of physics as you know them are wrong; that all the photos from space were faked; that the government is run by a massive conspiracy; etc. Basically, in order to represent this goal, you have to set up a parallel hypothetical ontology (or in your terminology, $f (G)$ needs to encode a lot of the content of $O_{1}$ ). Very complicated!
  I’m then claiming that whatever force pushes our ontologies to simplify also pushes us away from using this sort of complicated construction to represent our transformed goals. Instead, the most natural thing to do is to adapt the goal in some way that ends up being simple in your new ontology. For example, you might decide that the most natural way to adapt “reaching the edge of the earth” means “going into space”; or maybe it means “reaching the poles”; or maybe it means “pushing the frontiers of human exploration” in a more metaphorical sense. Importantly, under this type of transformation, many different goals from the old ontology will end up being mapped to simple concepts in the new ontology (like “going into space”), and so it doesn’t match your definition of $f$ .
  All of this still applies (but less strongly) to concepts that are not incoherent in the new ontology, but rather just messy. E.g. suppose you had a goal related to “air”, back when you thought air was a primitive substance. Now we know that air is about 78% nitrogen, 21% oxygen, and 0.93% argon. Okay, so that’s one way of defining “air” in our new ontology. But this definition of air has a lot of messy edge cases—what if the ratios are slightly off? What if you have the same ratios, but much different pressures or temperatures? Etc. If you have to arbitrarily classify all these edge cases in order to pursue your goal, then your goal has now become very complex. So maybe instead you’ll map your goal to the idea of a “gas”, rather than “gas that has specific composition X”. But then you discover a new ontology in which “gas” is a messy concept...
  If helpful I could probably translate this argument into something closer to your ontology, but I’m being lazy for now because your ontology is a little foreign to me. Let me know if this makes sense.
  What links here?
  - Coalitional agency by Richard_Ngo (22 Jul 2024 0:09 UTC; 56 points)
  - sunwillrise 20 Jul 2024 17:52 UTC
    LW: 3 AF: 2
    0
    AF Parent
    One way of framing our disagreement: I’m not convinced that the f operation makes sense as you’ve defined it. That is, I don’t think it can both be invertible and map to a goal with low complexity in the new ontology.
    To clarify, I don’t think $f$ is invertible, and that is why I talked about the preimage and not the inverse. I find it very plausible that $f$ is not injective, i.e. that in a more compact ontology coming from a more intelligent agent, ideas/configurations etc that were different in the old ontology get mapped to the same thing in the new ontology (because the more intelligent agent realizes that they are somehow the same on a deeper level). I also believe f would not be surjective, as I wrote in response to rif a. sauros:
    I’d suspect one possible counterargument is that, just like how more intelligent agents with more compressed models can more compactly represent complex goals, they are also capable of drawing ever-finer distinctions that allow them to identify possible goals that have very short encodings in the new ontology, but which don’t make sense at all as stand-alone, mostly-coherent targets in the old ontology (because it is simply too weak to represent them). So it’s not just that goals get compressed, but also that new possible kinds of goals (many of them really simple) get added to the game.
    But this process should also allow new goals to arise that have ~ any arbitrary encoding length in the new ontology, because it should be just as easy to draw new, subtle distinctions inside a complex goal (which outputs a new medium- or large-complexity goal) as it would be inside a really simple goal (which outputs the type of new super-small-complexity goal that the previous paragraph talks about). So I don’t think this counterargument ultimately works, and I suspect it shouldn’t change our expectations in any meaningful way.
    Nonetheless, I still expect $f^{- 1} [f (G)]$ (viewed as the preimage of $f (G)$ under the $f$ mapping) and $G$ to only differ very slightly.
    - Richard_Ngo 20 Jul 2024 18:04 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Ah, sorry for the carelessness on my end. But this still seems like a substantive disagreement: you expect
      $f^{- 1} [f (G)] \approx G$ , and I don’t, for the reasons in my comment.