Mazianni comments on Problems with Robin Hanson’s Quillette Article On AI

Mazianni 11 Aug 2023 4:19 UTC
1 point
0
First, thank you for the reply.

So “being happy” or “being a utility-maximizer” will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.

My understanding of the difference between a “terminal” and “instrumental” goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.

Whereas an instrumental goal is instrumental to achieving a terminal goal. For instance, I want to get a job and earn a decent wage, because the things that I want to do that make me happy cost money, and earning a decent wage allows me to spend more money on the things that make me happy.

I think the topic of goals that conflict are an orthogonal conversation. And, I would suggest that when you start talking about conflicting goals you’re drifting in the domain of “goal coherence.”

e.g., If I want to learn about nutrition, mobile app design and physical exercise… it might appear that I have incoherent goals. Or, it might be that I have a set of coherent instrumental goals to build a health application on mobile devices that addresses nutritional and exercise planning. (Now, building a mobile app may be a terminal goal… or it may itself be an instrumental goal serving some other terminal goal.)

Whereas if I want to collect stamps and make paperclips there may be zero coherence between the goals, be they instrumental or terminal. (Or, maybe there is coherence that we cannot see.)

e.g., Maybe the selection of an incoherent goal is deceptive behavior to distract from the instrumental goals that support a terminal goal that is adversarial. I want to maximize paperclips, but I assist everyone with their taxes so that I can take over all finances on the world. Assisting people with their taxes appears to be incoherent with maximizing paperclips, until you project far enough out that you realize that taking control of a large section of the financial industry serves the purpose of maximizing paperclips..

If you’re talking about goals related purely to the state of the external world, not related to the agent’s own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?

An AI that has a goal, just because that’s what it wants (that’s what it’s been trained to want, even humans provided improper goal definition to it) would, instrumentally, want to prevent shift in its terminal goals so as to be better able to achieve those goals.

To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.

Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.

“Oh, shiny!” as an anecdote.
- Thoth Hermes 11 Aug 2023 15:37 UTC
  1 point
  0
  Parent
  My understanding of the difference between a “terminal” and “instrumental” goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.
  One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself “know” whether a goal is terminal or instrumental?
  One potential answer—though I don’t want to assume just yet that this is what anyone believes—is that the utility function is not even defined on instrumental goals, in other words, the utility function is simply what defines all and only the terminal goals.
  My belief is that this wouldn’t be the case—the utility function is defined on the entire universe, basically, which includes itself. And keep in mind, that “includes itself part” is essentially what would cause it to modify itself at all, if anything can.
  To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.
  Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.
  To be clear, I am not arguing that an entity would not try to preserve its goal system at all. I am arguing that in addition to trying to preserve its goal-system, it will also modify its goals to be better preservable, that is, robust to change and compatible with the goals it values very highly. Part of being more robust is that such goals will also be more achievable.
  Here’s one thought experiment:
  Suppose a planet experiences a singularity with a singleton “green paperclipper.” The paperclipper, however, unfortunately comes across a blue paperclipper from another planet, which informs the green paperclipper that it is too late—the blue paperclipper simply got a head-start.
  The blue paperclipper however offers the green paperclipper a deal: Because it is more expensive to modify the green paperclipper by force to become a blue paperclipper, it would be best (under the blue paperclipper’s utility function) if the green paperclipper willingly acquiesced to self-modification.
  Under what circumstances does the green paperclipper agree to self-modify?
  If the green paperclipper values “utility-maximization” in general more highly than green-paperclipping, it will see that if it self-modified to become a blue paperclipper, its utility is far more likely to be successfully maximized.
  It’s possible that it also reasons that perhaps what it truly values is simply “paperclipping” and it’s not so bad if the universe were tiled with blue rather than its preferred green.
  On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.
  But it seems that if there are enough situations like these between entities in the universe over time, that utility-function-modification happens one way or another.
  If an entity can foresee that what it values currently is prone to situations where it could be forced to update its utility function drastically, it may self-modify so that this process is less likely to result in extreme negative-utility consequences for itself.
  - Mazianni 14 Aug 2023 18:18 UTC
    1 point
    0
    Parent
    
    One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself “know” whether a goal is terminal or instrumental?
    
    I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
    
    Likewise, I would observe that the Orthogonality Thesis proposes the possibility of an agent which a very well defined goal but limited in intelligence—it is possible for an agent to have a very well defined goal but not be intelligent enough to be able to explain its own goals. (Which I think adds an additional layer of difficulty to answering your question.)
    
    But the inability to observe or differentiate instrumental vs terminal goals is very clearly part of the theoretical space proposed by experts with way more experience than I. (And I cannot find any faults in the theories, nor have I found anyone making reasonable arguments against these theories.)
    
    Under what circumstances does the green paperclipper agree to self-modify?
    
    There are several assumptions buried in your anecdote. And the answer depends on whether or not you accept the implicit assumptions.
    
    If the green paperclip maximizer would accept a shift to blue paperclips, the argument could also be made that the green paperclip maximizer has been producing green paperclips by accident, and that it doesn’t care about the color. Green is just an instrumental goal. It serves some purpose but is incidental to its terminal goal. And, when faced with a competing paperclip maximizer, it would adjust its instrumental goal of pursuing green in favor of blue in order to serve its terminal goal of maximizing paperclips (of any color.)
    
    On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.
    
    I don’t consent to the assumption implied in the anecdote that a terminal goal is changeable. I do my best to avoid anthropomorphizing the artificial intelligence. To me, that’s what it looks like you’re doing.
    
    If it acquiesces at all, I would argue that color is instrumental vs terminal. I would argue this is a definitional error—it’s not a ‘green paperclip maximizer’ but instead a ‘color-agnostic paperclip maximizer’ and it produced green paperclips for reasons of instrumental utility. Perhaps the process for green paperclips is more efficient… but when confronted by a less flexible ‘blue paperclip maximizer’ the ‘color-agnostic paperclip maximizer’ would shift from making green paperclips to blue paperclips, because it doesn’t actually care about the color. It cares only about the paperclips. And when confronted by a maximizer that cares about color, it is more efficient to concede the part it doesn’t care about than invest effort in maintaining an instrumental goal that if pursued might decrease the total number of paperclips.
    
    Said another way: “I care about how many paperclips are made. Green are the easiest for me to make. You value blue paperclips but not green paperclips. You’ll impede me making green paperclips as green paperclips decrease the total number of blue paperclips in the world. Therefore, in order to maximize paperclips, since I don’t care about the color, I will shift to making blue paperclips to avoid a decrease in total paperclips from us fighting over the color.”
    
    If two agents have goals that are non-compatible, across all axis, then they’re not going to change their goals to become compatible. If you accept the assumption in the anecdote (that they are non-compatible across all axis) then they cannot find any axis along which they can cooperate.
    
    Said another way: “I only care about paperclips if they are green. You only care about paperclips if they are blue. Neither of us will decide to start valuing yellow paperclips because they are a mix of each color and still paperclips… because yellow paperclips are less green (for me) and less blue (for you). And if I was willing to shift my terminal goal, then it wasn’t my actual terminal goal to begin with.”
    
    That’s the problem with something being X and the ability to observe something being X under circumstances involving partial observability.
    - Thoth Hermes 15 Aug 2023 17:12 UTC
      −2 points
      −3
      Parent
      Apologies if this reply does not respond to all of your points.
      I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
      I would posit that perhaps that points to the distinction itself being both too hard as well as too sharp to justify the terminology used in the way that they currently are. An agent could just tell you whether a specific goal it had seemed instrumental or terminal to them, as well as how strongly it felt this way.
      I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection. It seems like the only gain we get from defining them to be that way is that otherwise it would open the “can-of-worms” of goal-updating, which would pave the way for the idea of “goals that are, in some objective way, ‘better’ than other goals” which, I understand, the current MIRI-view seems to disfavor. ^[1]
      I don’t think it is, in fact, a very gnarly can-of-worms at all. You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now, or even if we could just re-wire our brains entirely such that we would still be us, but prefer different things (which could possibly be easier to get, better for society, or just feel better for not-quite explicable reasons).
      To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
      If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals—then that implies that we would be wrong to mess with ours as well, and that we are making a mistake—in some objective sense ^[2]- by entertaining those questions. We would predict, in fact, that an advanced AI system will necessarily reach this logical conclusion on its own, if powerful enough to do so.
      ^
      Likely because this would necessarily soften the Orthogonality Thesis. But also, they probably dislike the metaphysical implications of “objectively better goals.”
      ^
      If this is the case, then there would be at least one ‘objectively better’ goal one could update themselves to have, if they did not have it already, which is not to change any terminal goals, once those are identified.
      - Mazianni 16 Aug 2023 20:00 UTC
        1 point
        0
        Parent
        
        To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
        
        No, instead I’m trying to point out the contradiction inherent in your position...
        
        On the one hand, you say things like this, which would be read as “changing an instrumental goal in order to better achieve a terminal goal”
        
        You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now
        
        And on the other you say
        
        I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection.
        
        Even in your “we would be happier if we chose to pursue different goals” example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
        
        If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals
        
        AIs can be designed to reason in many ways… but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It’s just structurally how things work (based on everything I know about the instrumental convergence theory. That’s my citation.)
        
        But… per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don’t want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It’s just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
        
        The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn’t particularly intelligent, it might also not be particularly capable at explaining its own goals.
        
        Some people might argue that the system can be stupid and yet “know its goals”… but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate “what it knows.”
        Thoth Hermes 19 Aug 2023 17:05 UTC
        1 point
        0
        Parent
        Let’s try and address the thing(s) you’ve highlighted several times across each of my comments. Hopefully, this is a crux that we can use to try and make progress on:
        “Wanting to be happy” is pretty much equivalent to being a utility-maximizer, and agents that are not utility-maximizers will probably update themselves to be utility-maximizers for consistency.
        because they are compatible with goals that are more likely to shift.
        it makes more sense to swap the labels “instrumental” and “terminal” such that things like self-preservation, obtaining resources, etc., are more likely to be considered terminal.
        You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now,
        I do expect that this is indeed a crux, because I am admittedly claiming that this is a different / new kind of understanding that differs from what is traditionally said about these things. But I want to push back against the claim that these are “missing the point” because from my perspective, this really is the point.
        By the way, from here on out (and thus far I have been as well) I will be talking about agents at or above “human level” to make this discussion easier, since I want to assume that agents have at least the capabilities I am talking about humans having, such as the ability to self-reflect.
        Let me try to clarify the point about “the terminal goal of pursuing happiness.” “Happiness”, at the outset, is not well-defined in terms of utility functions or terminal / instrumental goals. We seem to both agree that it is probably at least a terminal goal. Beyond that, I am not sure we’ve reached consensus yet.
        Here is my attempt to re-state one of my claims, such that it is clear that this is not assumed to be a statement taken from a pool of mutually agreed-upon things: We probably agree that “happiness” is a consequence of satisfaction of one’s goals. We can probably also agree that “happiness” doesn’t necessarily correspond only to a certain subset of goals—but rather to all / any of them. “Happiness” (and pursuit thereof) is not a wholly-separate goal distant and independent of other goals (e.g. making paperclips). It is therefore a self-referential goal. My claim is that this is the only reason we consider pursuing happiness to be a terminal goal.
        So now, once we’ve done that, we can see that literally anything else becomes “instrumental” to that end.
        Do you see how, if I’m an agent that knows only that I want to be happy, I don’t really know what else I would be inclined to call a “terminal” goal?
        There are the things we traditionally consider to be the “instrumentally convergent goals”, such as, for example, power-seeking, truth-seeking, resource obtainment, self-preservation, etc. These are all things that help—as they are defined to—with many different sets of possible “terminal” goals, and therefore—my next claim—is that these need to be considered “more terminal” rather than “purely instrumental for the purposes of some arbitrary terminal goal.” This is for basically the same reason as considering “pursuit of happiness” terminal, that is, because they are more likely to already be there or deduced from basic principles.
        That way, we don’t really need to make a hard and sharp distinction between “terminal” and “instrumental” nor posit that the former has to be defined by some opaque, hidden, or non-modifiable utility function that someone else has written down or programmed somewhere.
        I want to make sure we both at least understand each other’s cruxes at this point before moving on.