Many arguments about AI risk depend on the notion of “optimizing”, but so far it has eluded a good definition. One natural approach is to say that an optimizer causes the world to have higher values according to some reasonable utility function, but this seems insufficient, as then a <@bottle cap would be an optimizer@>(@Bottle Caps Aren’t Optimisers@) for keeping water in the bottle.
This post provides a new definition of optimization, by taking a page from <@Embedded Agents@> and analyzing a system as a whole instead of separating the agent and environment. An **optimizing system** is then one which tends to evolve toward some special configurations (called the **target configuration set**), when starting anywhere in some larger set of configurations (called the **basin of attraction**), _even if_ the system is perturbed.
For example, in gradient descent, we start with some initial guess at the parameters θ, and then continually compute loss gradients and move θ in the appropriate direction. The target configuration set is all the local minima of the loss landscape. Such a program has a very special property: while it is running, you can change the value of θ (e.g. via a debugger), and the program will probably _still work_. This is quite impressive: certainly most programs would not work if you arbitrarily changed the value of one of the variables in the middle of execution. Thus, this is an optimizing system that is robust to perturbations in θ. Of course, it isn’t robust to arbitrary perturbations: if you change any other variable in the program, it will probably stop working. In general, we can quantify how powerful an optimizing system is by how robust it is to perturbations, and how small the target configuration set is.
The bottle cap example is _not_ an optimizing system because there is no broad basin of configurations from which we get to the bottle being full of water. The bottle cap doesn’t cause the bottle to be full of water when it didn’t start out full of water.
Optimizing systems are a superset of goal-directed agentic systems, which require a separation between the optimizer and the thing being optimized. For example, a tree is certainly an optimizing system (the target is to be a fully grown tree, and it is robust to perturbations of soil quality, or if you cut off a branch, etc). However, it does not seem to be a goal-directed agentic system, as it would be hard to separate into an “optimizer” and a “thing being optimized”.
This does mean that we can no longer ask “what is doing the optimization” in an optimizing system. This is a feature, not a bug: if you expect to always be able to answer this question, you typically get confusing results. For example, you might say that your liver is optimizing for making money, since without it you would die and fail to make money.
The full post has several other examples that help make the concept clearer.
Planned opinion:
I’ve <@previously argued@>(@Intuitions about goal-directed behavior@) that we need to take generalization into account in a definition of optimization or goal-directed behavior. This definition achieves that by primarily analyzing the robustness of the optimizing system to perturbations. While this does rely on a notion of counterfactuals, it still seems significantly better than any previous attempt to ground optimization.
I particularly like that the concept doesn’t force us to have a separate agent and environment, as that distinction does seem quite leaky upon close inspection. I gave a shot at explaining several other concepts from AI alignment within this framework in this comment, and it worked quite well. In particular, a computer program is a goal-directed AI system if there is an environment such that adding the computer program to the environment transforms it into a optimizing system for some “interesting” target configuration states (with one caveat explained in the comment).
Planned summary for the Alignment Newsletter:
Planned opinion: