GPT-X, Paperclip Maximizer? Analyzing AGI and Final Goals
Epistemic status: Low confidence, feel I may be missing something important about how to “put goals into” intelligent systems.
In the artificial intelligence literature, it’s common to see researchers examining the question of “what is the right final goal to give an AGI”. Researchers frequently offer examples of what might happen if we give a superintelligent AGI the wrong final goal; for example, Nick Bostrom zeros in on this question in his book Superintelligence, focusing on a superintelligent AGI with a final goal of maximizing paperclips (it was put into use by a paperclip factory). In his scenario, the AGI interprets its goal literally (as computers tend to do) and turns the entire universe into paperclips. Bostrom shares this story to show that we must be careful with the final goal we put into an AGI system; we’ll want whatever final goal we do put in to be fully representative of human values (and much work has gone into figuring out how to specify them). While Bostrom (and others) are right to call out the slipperiness of language and the importance of exact specification, there’s a deeper question that often seems to be left unaddressed – specifically, whether we’ll be able to give an AGI any final goals at all.
The assumed answer to this question is “yes”, and it’s easy to see why; computers and final goals have gone hand-in-hand since the days of Babbage. We come up with a problem, then structure the computer in a way that solves it, plain and simple. This holds true even for the more complex problems that we’ve recently made progress on, where we don’t know the exact path the computer will take (e.g. training a computer to play Go). For these types of programs, we still formally pose the problem (e.g. minimize Go losses), then let it run an algorithm (e.g. reinforcement learning and gradient descent) which guarantees that the program’s behavior will target the given final goal. These programs are “narrow”, in that they target an exactly specified goal in a limited domain. We have every reason to believe that these “narrow” programs will continue to improve (and they’re already quite powerful, outclassing humans in tasks like Go and object recognition), but they’re in a separate class from generally intelligent programs. To achieve the world-changing impacts that Bostrom and others talk about, we’ll need systems with general intelligence; specifically, the ability to understand and reason about the extended domain of the natural world (much as humans and other animals do). While we don’t yet know what the architecture of these systems will look like, there seem to be two categories of systems which might arise:
We may find that our current “narrow” systems naturally progress into generally intelligent systems by expanding their input domain (to include the natural world) and adding more computing power. These systems are structured around a single final goal and they learn by using gradient descent to move towards configurations which better accomplish the specified goal. The goal anchors the system; progress (or lack thereof) on the goal informs the system of how to adapt.
We may build systems which work in a manner comparable to the brain, eschewing goal-based anchoring and gradient descent and instead relying on unsupervised lower level algorithms which function to create a world model based on inputs, without any task-specific goal driving the process. We don’t yet have a good sense of the types of algorithms which produce this behavior, although we do know that our neocortical algorithms seem to. Think about how a baby learns a model of the world; by a couple years of age, a child has a fairly robust model, understanding objects, concepts, and abstract reference (language). There’s no “task level goal” in the baby’s brain to anchor the learning process – rather, the neocortical algorithms serve to create internal representations which match up increasingly well with external stimuli. We can imagine carrying this idea further, and building AGI systems where adaptations take the form of progress towards better world representations.
Getting back to our original question, it seems type 1 systems are easy; we can clearly give them final goals. The architecture of these systems could resemble that of today’s systems, like an AlphaGo but with a domain of the natural world (instead of a Go board) and a final goal of making paperclips (instead of playing Go well). The question for these systems is not how to build them or how to give them final goals, but whether they can handle the vast, uncertain domain of the natural world. Is the singular pressure to “maximize paperclips” sufficient for a system to develop concepts? It feels like the domain is too broad, and the goal too narrow, for any real understanding to develop.
We can push our intuitions further by thinking about the training a different way. AlphaGo’s training can be visualized by imagining that its final goal (play Go well) defines a many-dimensional landscape (an axis for each variable in the system). Any possible configuration has a point on this landscape, and the lower point is, the better the system plays Go. We can’t see this landscape (if we could, optimization would be easy!), but for any individual point, we can simulate games and figure out the way “down” (i.e. the gradient). Using this technique, we can take incremental steps down, moving toward a more optimal way of playing. This training works well because of the definiteness of the game; there’s a (relatively) straightforward path down (i.e. the gradient). The training also works well because the goal is directly related to every (or most) aspect(s) of the environment (i.e. “playing Go well” depends directly on every potential move and existing stone position). It seems neither of these conditions hold when we extend the domain to the natural world. A final goal like “maximize paperclips” defines a landscape, but one that is jagged and uneven (think of all the ways to maximize paperclips!). There’s a point (by definition) in the landscape that maximizes paperclips, but there’s no feasible path toward that point due to the shape of the landscape. Most of the systems inputs would be only indirectly related to the goal (e.g. dogs are mostly unrelated to paperclip maximization) and so attempts to advance by anchoring to the goal will be limited in power (i.e. the system would not develop the concept “dog”). We may eventually find a way to tweak type 1 systems to get them to the point of general intelligence, but it seems an unlikely path.
Type 2 systems, on the other hand, are a bit trickier to reason about. They’re different from the systems we’re used to building, as there’s no task-specific goal to form a solution landscape. Instead, these systems are centered around a “world model” landscape, moving towards a better representation of the world, and then using that representation to solve problems (in the general sense). These systems can still be structured to have goals, just not final goals. Again, we can look to the human brain for intuitions. Simplifying greatly, the brain consists of the neocortex, which models the world, and the subcortex, which “pressures” the neocortex towards certain goals. These goals include things like eating, drinking, having sex, etc. – all important parts of human nature, but distinctly different from “final goals”, as they do not anchor the update process of the neocortex. We can see this played out at the macro level by observing how people act; while people generally seek these goals, they aren’t obligated to, and can choose among an infinity of other actions. However, though looking at human behavior helps provide some intuition on type 2 systems, we must be careful to avoid anthropomorphizing AGI, as the systems we build may have completely different architectures and “pressures”. The behavioral space of these types of systems is hard to reason about as we have not made as much progress, but we’re getting closer. Specifically, it seems GPT-3, a natural language prediction model, can help us reason about the behavior of type 2 systems and develop an understanding of what it means for these types of systems to have “final goals”. (For those unfamiliar with GPT-3, this post provides a more in-depth overview)
GPT-3 has not yet achieved fully general intelligence (as language is still a more limited domain than the natural world), but it has come closer than any other system built so far. The key to GPT-3’s generality is its type 2 structure; rather than having a task-specific final goal, the system is instead set up to model its domain (by predicting the next word, given a series of words). GPT-3 can write stories, create poetry, or even answer math problems (albeit poorly) because all these activities are encompassed within the world model it builds. With this power, however, comes a potential issue – there’s no way to enforce particular desired behaviors for GPT-3. For instance, let’s say we want GPT-3 to always portray humans in a favorable light in the text it composes. We don’t have a way of specifying this up front, because encoding that goal requires concepts which the system doesn’t yet have (this is before training, when the system has no concept of “human” or “favorable light”). Likewise, we won’t be able to specify this after training, as we don’t fully understand how the system’s connections work together (GPT-3 has 175 billion parameters), and so we can’t “fit in” that constraint. We could try to control the portrayal of humans through limiting the input data (e.g. leave out WWII), but this technique doesn’t scale (can’t control the natural world in the same way), and still can’t ensure positive portrayal. With a type 1 system, constraints like this are easy to enforce; you simply build them into the final goal. Type 2 systems, however, present significantly more difficulty, especially as we approach the scale necessary for human-level intelligence.
One way we could look to control type 2 systems is through “pressures”, rather than final goals. For example, we could identify a list of positive words and structure GPT-3 to make stronger connections when the word “human” appears near those words, or we could add a filter to the end output for certain negative words paired with “human”. These types of techniques can guide the system in a particular way, but there’s no guarantee of the desired behavior (there’s a whole variety of ways unfavorable portrayals of humans could slip through the cracks). We can view these techniques as analogous to the “pressures” humans feel to eat and drink; we’re not guaranteed to, but in the vast majority of cases we will.
As we look to scale up type 2 systems (for example, a GPT type system with light and sound sensors, coupled with various effectors) we will need to think about the right “pressures” to apply. The problem is not quite as simple as just figuring out the right final goal to “put in” the system; rather, we’ll have to try to understand the complex interplay between world modeling algorithms and blunt, inexact constraints. On the bright side, however, we have 8,000,000,000+ working models to learn from, all of which manage (with varying success) to coexist!
You might be interested in Shaping Safer Goals.