Richard_Ngo comments on A more systematic case for inner misalignment

Richard_Ngo 20 Jul 2024 17:48 UTC
LW: 2 AF: 2
0
AF
Thanks for the extensive comment! I’m finding this discussion valuable. Let me start by responding to the first half of your comment, and I’ll get to the rest later.
The simplicity of a goal is inherently dependent on the ontology you use to view it through: while $K_{ϕ} (f (G), O_{2}) < K_{ϕ} (G, O_{1})$ is (likely) true, pay attention to how this changes the ontology! The goal of the agent is indeed very simple, but not because the “essence” of the goal simplifies; instead, it’s merely because it gets access to a more powerful ontology that has more detail, granularity, and degrees of freedom. If you try to view $f (G)$ in $O_{1}$ instead of $O_{2}$ , meaning you look at the preimage $f^{- 1} [f (G)]$ , this should approximately be the same as $G$ : your argument establishes no reason for us to think that there is any force pulling the goal itself, as opposed to its representation, to be made smaller.
One way of framing our disagreement: I’m not convinced that the f operation makes sense as you’ve defined it. That is, I don’t think it can both be invertible and map to goals with low complexity in the new ontology.
Consider a goal that someone from the past used to have, which now makes no sense in your ontology—for example, the goal of reaching the edge of the earth, for someone who thought the earth was flat. What does this goal look like in your ontology? I submit that it looks very complicated, because your ontology is very hostile to the concept of the “edge of the earth”. As soon as you try to represent the hypothetical world in which the earth is flat (which you need to do in order to point to the concept of its “edge”), you now have to assume that the laws of physics as you know them are wrong; that all the photos from space were faked; that the government is run by a massive conspiracy; etc. Basically, in order to represent this goal, you have to set up a parallel hypothetical ontology (or in your terminology, $f (G)$ needs to encode a lot of the content of $O_{1}$ ). Very complicated!
I’m then claiming that whatever force pushes our ontologies to simplify also pushes us away from using this sort of complicated construction to represent our transformed goals. Instead, the most natural thing to do is to adapt the goal in some way that ends up being simple in your new ontology. For example, you might decide that the most natural way to adapt “reaching the edge of the earth” means “going into space”; or maybe it means “reaching the poles”; or maybe it means “pushing the frontiers of human exploration” in a more metaphorical sense. Importantly, under this type of transformation, many different goals from the old ontology will end up being mapped to simple concepts in the new ontology (like “going into space”), and so it doesn’t match your definition of $f$ .
All of this still applies (but less strongly) to concepts that are not incoherent in the new ontology, but rather just messy. E.g. suppose you had a goal related to “air”, back when you thought air was a primitive substance. Now we know that air is about 78% nitrogen, 21% oxygen, and 0.93% argon. Okay, so that’s one way of defining “air” in our new ontology. But this definition of air has a lot of messy edge cases—what if the ratios are slightly off? What if you have the same ratios, but much different pressures or temperatures? Etc. If you have to arbitrarily classify all these edge cases in order to pursue your goal, then your goal has now become very complex. So maybe instead you’ll map your goal to the idea of a “gas”, rather than “gas that has specific composition X”. But then you discover a new ontology in which “gas” is a messy concept...
If helpful I could probably translate this argument into something closer to your ontology, but I’m being lazy for now because your ontology is a little foreign to me. Let me know if this makes sense.
What links here?
- Coalitional agency by Richard_Ngo (22 Jul 2024 0:09 UTC; 61 points)
- sunwillrise 20 Jul 2024 17:52 UTC
  LW: 3 AF: 2
  0
  AF Parent
  One way of framing our disagreement: I’m not convinced that the f operation makes sense as you’ve defined it. That is, I don’t think it can both be invertible and map to a goal with low complexity in the new ontology.
  To clarify, I don’t think $f$ is invertible, and that is why I talked about the preimage and not the inverse. I find it very plausible that $f$ is not injective, i.e. that in a more compact ontology coming from a more intelligent agent, ideas/configurations etc that were different in the old ontology get mapped to the same thing in the new ontology (because the more intelligent agent realizes that they are somehow the same on a deeper level). I also believe f would not be surjective, as I wrote in response to rif a. sauros:
  I’d suspect one possible counterargument is that, just like how more intelligent agents with more compressed models can more compactly represent complex goals, they are also capable of drawing ever-finer distinctions that allow them to identify possible goals that have very short encodings in the new ontology, but which don’t make sense at all as stand-alone, mostly-coherent targets in the old ontology (because it is simply too weak to represent them). So it’s not just that goals get compressed, but also that new possible kinds of goals (many of them really simple) get added to the game.
  But this process should also allow new goals to arise that have ~ any arbitrary encoding length in the new ontology, because it should be just as easy to draw new, subtle distinctions inside a complex goal (which outputs a new medium- or large-complexity goal) as it would be inside a really simple goal (which outputs the type of new super-small-complexity goal that the previous paragraph talks about). So I don’t think this counterargument ultimately works, and I suspect it shouldn’t change our expectations in any meaningful way.
  Nonetheless, I still expect $f^{- 1} [f (G)]$ (viewed as the preimage of $f (G)$ under the $f$ mapping) and $G$ to only differ very slightly.
  - Richard_Ngo 20 Jul 2024 18:04 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Ah, sorry for the carelessness on my end. But this still seems like a substantive disagreement: you expect
    $f^{- 1} [f (G)] \approx G$ , and I don’t, for the reasons in my comment.