Rob Bensinger answers why assume AGIs will optimize for fixed goals?

Rob Bensinger 10 Jun 2022 3:54 UTC
LW: 76 AF: 26
0
AF
It has been argued that if you already have the fixed-terminal-goal-directed wrapper structure, then you will prefer to avoid outside influences that will modify your goal. This is true, but does not explain why the structure would emerge in the first place.
I think Eliezer usually assumes that goals start off not stable, and then some not-necessarily-stable optimization process (e.g., the agent modifying itself to do stuff, or a gradient-descent-ish or evolution-ish process iterating over mesa-optimizers) makes the unstable goals more stable over time, because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future.
(I don’t need a temporally stable goal in order to self-modify toward stability, because all of my time-slices will tend to agree that stability is globally optimal, though they’ll disagree about which time-slice’s goal ought to be the one stably optimized.)
E.g., quoting Eliezer:
So what actually happens as near as I can figure (predicting future = hard) is that somebody is trying to teach their research AI to, god knows what, maybe just obey human orders in a safe way, and it seems to be doing that, and a mix of things goes wrong like:

The preferences not being really readable because it’s a system of neural nets acting on a world-representation built up by other neural nets, parts of the system are self-modifying and the self-modifiers are being trained by gradient descent in Tensorflow, there’s a bunch of people in the company trying to work on a safer version but it’s way less powerful than the one that does unrestricted self-modification, they’re really excited when the system seems to be substantially improving multiple components, there’s a social and cognitive conflict I find hard to empathize with because I personally would be running screaming in the other direction two years earlier, there’s a lot of false alarms and suggested or attempted misbehavior that the creators all patch successfully, some instrumental strategies pass this filter because they arose in places that were harder to see and less transparent, the system at some point seems to finally “get it” and lock in to good behavior which is the point at which it has a good enough human model to predict what gets the supervised rewards and what the humans don’t want to hear, they scale the system further, it goes past the point of real strategic understanding and having a little agent inside plotting, the programmers shut down six visibly formulated goals to develop cognitive steganography and the seventh one slips through, somebody says “slow down” and somebody else observes that China and Russia both managed to steal a copy of the code from six months ago and while China might proceed cautiously Russia probably won’t, the agent starts to conceal some capability gains, it builds an environmental subagent, the environmental agent begins self-improving more freely, undefined things happen as a sensory-supervision ML-based architecture shakes out into the convergent shape of expected utility with a utility function over the environmental model, the main result is driven by whatever the self-modifying decision systems happen to see as locally optimal in their supervised system locally acting on a different domain than the domain of data on which it was trained, the light cone is transformed to the optimum of a utility function that grew out of the stable version of a criterion that originally happened to be about a reward signal counter on a GPU or God knows what.

Perhaps the optimal configuration for utility per unit of matter, under this utility function, happens to be a tiny molecular structure shaped roughly like a paperclip.

That is what a paperclip maximizer is. It does not come from a paperclip factory AI. That would be a silly idea and is a distortion of the original example.
One way of thinking about this is that a temporally unstable agent is similar to a group of agents that exist at the same time, and are fighting over resources.
In the case where a group of agents exist at the same time, each with different utility functions, there will be a tendency (once the agents become strong enough and have a varied enough option space) for the strongest agent to try to seize control from the other agents, so that the strongest agent can get everything it wants.
A similar dynamic exists for (sufficiently capable) temporally unstable agents. Alice turns into a werewolf every time the moon is full; since human-Alice and werewolf-Alice have very different goals, human-Alice will tend (once she’s strong enough) to want to chain up werewolf-Alice, or cure herself of lycanthropy, or brainwash her werewolf self, or otherwise ensure that human-Alice’s goals are met more reliably.
Another way this can shake out is that human-Alice and werewolf-Alice make an agreement to self-modify into a new coherent optimizer that optimizes some compromise of the two utility functions. Both sides will tend to prefer this over, e.g., the scenario where human-Alice keeps turning on a switch and then werewolf-Alice keeps turning the switch back off, forcing both of them to burn resources in a tug-of-war.
What links here?
- nostalgebraist 10 Jun 2022 23:13 UTC
  LW: 91 AF: 32
  0
  AF Parent
  because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future
  I personally doubt that this is true, which is maybe the crux here.
  This seems like a possibly common assumption, and I’d like to see a more fleshed-out argument for it. I remember Scott making this same assumption in a recent conversation:
  I agree humans aren’t like that, and that this is surprising.
  Maybe this is because humans aren’t real consequentialists, they’re perceptual control theory agents trying to satisfy finite drives? [...] Might gradient descent produce a PCT agent instead of a mesa-optimizer? I don’t know. My guess is maybe, but that optimizers would be more, well, optimal [...]
  But is it true that “optimizers are more optimal”?
  When I’m designing systems or processes, I tend to find that the opposite is true—for reasons that are basically the same reasons we’re talking about AI safety in the first place.
  A powerful optimizer, with no checks or moderating influences on it, will tend to make extreme Goodharted choices that look good according to its exact value function, and very bad (because extreme) according to almost any other value function.
  Long before things reach the point where the outer optimizer is developing a superintelligent inner optimizer, it has plenty of chances to learn the general design principle that “putting all the capabilities inside an optimizing outer loop ~always does something very far from what you want.”
  Some concrete examples from real life:
  - Using gradient descent. I use gradient descent to make things literally every day. But gradient descent is never the outermost loop of what I’m doing.
    
    That would look like “setting up a single training run, running it, and then using the model artifact that results, without giving yourself freedom to go back and do it over again (unless you can find a way to automate that process itself with gradient descent).” This is a peculiar policy which no one follows. The individual artifacts resulting from individual training runs are quite often bad—they’re overfit, or underfit, or training diverged, or they got great val metrics but the output sucks and it turns out your val set has problems, or they got great val metrics but the output isn’t meaningfully better and the model is 10x slower than the last one and the improvement isn’t worth it, or they are legitimately the best thing you can get on your dataset but that causes you to realize you really need to go gather more data, or whatever.
    
    All the impressive ML artifacts made “by gradient descent” are really outputs of this sort of process of repeated experimentation, refining of targets, data gathering and curation, reframing of the problem, etc. We could argue over whether this process is itself a form of “optimization,” but in any case we have in our hands a (truly) powerful thing that very clearly is optimization, and yet to leverage it effectively without getting Goodharted, we have to wrap it inside some other thing.
  - Delegating to other people. To quote myself from here:
    
    “How would I want people to behave if I – as in actual me, not a toy character like Alice or Bob – were managing a team of people on some project? I wouldn’t want them to be ruthless global optimizers; I wouldn’t want them to formalize the project goals, derive their paperclip-analogue, and go off and do that. I would want them to take local iterative steps, check in with me and with each other a lot, stay mostly relatively close to things already known to work but with some fraction of time devoted to far-out exploration, etc.”
    
    There are of course many Goodhart horror stories about organizations that focus too hard on metrics. The way around this doesn’t seem to be “find the really truly correct metrics,” since optimization will always find a way to trick you. Instead, it seems crucial to include some mitigating checks on the process of optimizing for whatever metrics you pick.
  - Checks against dictatorship as a principle of government design, as opposed to the alternative of just trying to find a really good dictator.
    
    Mostly self-explanatory. Admittedly a dictator is not likely to be a coherent optimizer, but I expect a dictatorship to behave more like one than a parliamentary democracy.
    
    If coherence is a convergent goal, why don’t all political sides come together and build a system that coherently does something, whatever that might be? In this context, at least, it seems intuitive enough that no one really wants this outcome.
  In brief, I don’t see how to reconcile
  1. “in the general case, coherent optimizers always end up doing some bad, extreme Goodharted thing” (which matches both my practical experience and a common argument in AI safety), and
  2. “outer optimizers / deliberating agents will tend to converge on building (more) coherent (inner) optimizers, because they expect this to better satisfy their own goals,” i.e. the “optimizers are more optimal” assumption.
  EDIT: an additional consideration applies in the situation where the AI is already at least as smart as us, and can modify itself to become more coherent. Because I’d expect that AI to notice the existence of the alignment problem just as much as we do (why wouldn’t it?). I mean, would you modify yourself into a coherent EU-maximizing superintelligence with no alignment guarantees? If that option became available in real life, would you take it? Of course not. And our hypothetical capable-but-not-coherent AI is facing the exact same question.
  What links here?
  - Yitz 19 Jun 2022 4:14 UTC
    16 points
    0
    Parent
    This is a really high-quality comment, and I hope that at least some expert can take the time to either convincingly argue against it, or help confirm it somehow.
  - Vladimir_Nesov 19 Jun 2022 19:10 UTC
    9 points
    Parent
    
    I mean, would you modify yourself into a coherent EU-maximizing superintelligence with no alignment guarantees? If that option became available in real life, would you take it? Of course not. And our hypothetical capable-but-not-coherent AI is facing the exact same question.
    
    Why no alignment guarantees and why modify yourself and not build separately? The concern is that even if a non-coherent AGI solves its own alignment problem correctly, builds an EU-maximizing superintelligence aligned with the non-coherent AGI, the utility of the resulting superintelligence is still not aligned with humanity.
    
    So the less convenient question should be, “Would you build a coherent optimizer if you had all the alignment guarantees you would want, all the time in the world to make sure it’s done right?” A positive answer to that question given by first non-coherent AGIs supports relevance of coherent optimizers and their alignment.
  - alexlyzhov 19 Jun 2022 11:45 UTC
    9 points
    Parent
    When you say that coherent optimizers are doing some bad thing, do you imply that it would always be a bad decision for the AI to make the goal stable? But wouldn’t it heavily depend on what other options it thinks it has, and in some cases maybe worth the shot? If such a decision problem is presented to the AI even once, it doesn’t seem good.
    The stability of the value function seems like something multidimensional, so perhaps it doesn’t immediately turn into a 100% hardcore explicit optimizer forever, but there is at least some stabilization. In particular, bottom-up signals that change the value function most drastically may be blocked.
    AI can make its value function more stable to external changes, but it can also make it more malleable internally to partially compensate for Goodharting. The end result for outside actors though is that it only gets harder to change anything.
    
    Edit: BTW, I’ve read some LW articles on Goodharting but I’m also not yet convinced it will be such a huge problem at superhuman capability levels—seems uncertain to me. Some factors may make it worse as you get there (complexity of the domain, dimensionality of the space of solutions), and some factors may make it better (the better you model the world, the better you can optimize for the true target). For instance, as the model gets smarter, the problems from your examples seem to be eliminated: in 1, it would optimize end-to-end, and in 2, the quality of the decisions would grow (if the model had access to the ground truth value function all along, then it would grow because of better world models and better tree search for decision-making). If the model has to check-in and use feedback from the external process (human values) to not stray off course, then as it’s smarter it’s discovering a more efficient way to collect the feedback, has better priors, etc.
  - leogao 17 Jun 2022 4:28 UTC
    6 points
    Parent
    One possible reconciliation: outer optimizers converge on building more coherent inner optimizers because the outer objective is only over a restricted domain, and making the coherent inner optimizer not blow up inside that domain is much much easier than making it not blow up at all, and potentially easier than just learning all the adaptations to do the thing. Concretely, for instance, with SGD, the restricted domain is the training distribution, and getting your coherent optimizer to act nice on the training distribution isn’t that hard, the hard part of fully aligning it is getting from objectives that shake out as [act nice on the training distribution but then kill everyone when you get a chance] to an objective that’s actually aligned, and SGD doesn’t really care about the hard part.
  - Jeremy Gillen 13 Dec 2024 15:35 UTC
    2 points
    0
    Parent
    because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future
    I personally doubt that this is true, which is maybe the crux here.
    Would you like to do a dialogue about this? To me it seems clearly true in exactly the same way that having more time to pursue a goal makes it more likely you will achieve that goal.
    It’s possible another crux is related to the danger of Goodharting, which I think you are exaggerating the danger of. When an agent actually understand what it wants, and/or understands the limits of its understanding, then Goodhart is easy to mitigate, and it should try hard to achieve its goals (i.e. optimize a metric).
- Gunnar_Zarncke 10 Jun 2022 21:23 UTC
  7 points
  Parent
  Do we have evidence about more intelligent beings being more stable or getting more stable over time? Are more intelligent humans more stable/get more stable/get stable more quickly?
- RHollerith 10 Jun 2022 20:42 UTC
  1 point
  Parent
  I agree with this comment. I would add that there is an important sense in which the typical human is not a temporally unstable agent.
  
  It will help to have an example: the typical 9-year-old boy is uninterested in how much the girls in his environment like him and doesn’t necessarily wish to spend time with girls (unless those girls are acting like boys). It is tempting to say that the boy will probably undergo a change in his utility function over the next 5 or so years, but if you want to use the concept of expected utility (defined as the sum of the utility of the various outcome weighted by their probability) then to keep the math simple you must assume that the boy’s utility function does not change with time with the result that you must define the utility function to be not the boy’s current preferences, but rather his current preferences (conscious and unconscious) plus the process by which those preference will change over time.
  
  Humans are even worse at perceiving the process that changes their preferences over time than they are at perceiving their current preferences. (The example of the 9-year-old boy is an exception to that general rule: even the 9-year-old boys tend to know that their preferences around girls are probably going to change in not too many years.) The author of the OP seems to have conflated the goals that the human knows that he has with the human’s utility function whereas they are quite different.
  
  It might be that there is some subtle point the OP is making about temporally unstable agents that I have not addressed in my comment, but if he expects me to hear him out on it, he should write it up in such a way as to make to clear that he not just confused about how the concept of the utility function is being applied to AGIs.
  
  I haven’t explained or shown how or why the assumption that the AGI’s utility function is constant over time simplifies the math—and simplifies an analysis that does not delve into actual math. Briefly, if you want to create a model in which the utility function evolves over time, you have to specify how it evolves—and to keep the model accurate, you have to specify how evidence coming in from the AGI’s senses influences the evolution. But of course, sensory information is not the only things influencing the evolution; we might call the other influence an “outer utility function”. But then why not keep the model simple and assume (define) the goals that the human is aware of to be not terms (terminology?) in a utility function, but rather subgoals? Any intelligent agent will need some machinery to identify and track subgoals. That machinery must modify the priorities of the subgoals in response to evidence coming in from the senses. Why not just require our model to include a model of the subgoal-updating machinery, then equate the things the human perceives as his current goals with subgoals?
  
  Here is another way of seeing it. Since a human being is “implemented” using only deterministic laws of physics, the “seed” of all of the human’s behaviors, choices and actions over a lifetime are already present in the human being at birth! Actually that is not true: maybe the human’s brain is hit by a cosmic ray when the human is 7 years old with the result that the human grows up to like boys whereas if it weren’t for the cosmic ray, he would like girls. (Humans have evolved to be resistant to such “random” influences, but such influences nevertheless do occasionally happen.) But it is true that the “seed” of all of the human’s behaviors, choices and actions over a lifetime are already present at birth! (That sentence is just a copy of a previous sentence omitting the words “in the human being” to take into account the possibility that the “seed” includes a cosmic ray light years away from Earth at the time of the person’s birth.) So for us to assume that the human’s utility function does not vary over time not only simplifies the math, but also is more physically realistic.
  
  If you define the utility function of a human being the way I have recommended above that you do, you must realize that there are many ways in which humans are unaware or uncertain about their own utility function and that the function is very complex (incorporating for example the processes that produce cosmic rays) although maybe all you need is an approximation. Still, that is better than defining your model such that utility function vary over time.