Positive Attractors

[DL] Positive attractors are perhaps best understood in contrast to negative attractors. Negative attractors are scenarios that lead to failure modes. Some types of negative attractors include deceptive alignment, Sharp Left Turns, and instrumental convergence; these all necessitate one or more bad outcomes for humanity, and conditioning on current alignment research, they do so robustly. In other words, we’ve identified scenarios that converge to failure, and we don’t have any immediate solutions to them.

This “identify failure → attempt to resolve failure” loop is a standard procedure in alignment research. However, we propose an alternative route: instead of starting from failure and aligning from there, is it possible to force scenarios that naturally lead away from failure modes? Do positive attractors exist?

[RK] When we train an AI, the training involves the gradual change of an algorithm towards a target, e.g. minimising a loss function. There is an intractable space of possible algorithms, within which the actual algorithm moves according to the learning mechanisms (of e.g. gradient descent). Further aspects of this algorithm can be selected for, based on the precise architecture or training set for the network.

Danger can occur due to multiple reasons. It could be the case that there are multiple plausible destinations (or local minima) for our algorithm transformation. Some of these destinations (or attractor states) involve problematic qualities beyond the simple minimization of the loss function – it may even be the case that all “great” solutions in terms of the loss function involve some form of perverse instantiation. In addition, there may be safe and unsafe regions for the algorithm to traverse as it changes towards one of these attractor basins. For example, once an AI has “discovered a hack” during training, this will have an influence in steering its further algorithmic development, e.g. if the hack is costly to unlearn.

The intuition for positive attractors is that, if we have the goal of avoiding “dangerous waters”, we should not only try to understand and avoid dangerous regions of the search space, but also actively steer to remain in safe waters, perhaps prevent certain classes of shortcuts, and aim at attractor basins that we understand to have better safety properties.
Rather than relying on our ability to come up with every possible failure mode ahead of time to explicitly avoid them, it could be beneficial to further increase the resilience of our system against negative change, trying to decrease the likelihood of falling into previously unaddressed failure modes.

Definition

[DL] I find that the difference between positive and negative attractors are best illustrated through an example.

To generalize, a negative attractor involves a convergent behavior that occurs after certain properties are satisfied. For example, the theory of instrumental convergence claims any agent with sufficient intelligence and generalizability will, under relatively light assumptions about its environment and terminal objective(s), pursue similar sub-goals such as self-preservation, goal-preservation, and self-improvement. These sub-goals may not be aligned with human interests, so a natural proposal is to figure out ways that can reduce the impact if the theory holds true.

A positive attractor exhibits the same convergent behavior, but towards states that are in some way helpful for alignment. We can for example imagine a utopian version of instrumental convergence, where any agent with sufficient intelligence, generalizability, and kindness will pursue human-aligned sub-goals such as understanding human preferences and not harming weaker agents. Finding these aligned convergent behaviors is the crux of positive attraction.

There are a handful of positive attractor candidates in current AI alignment literature, albeit not categorized under that name. One example is the broad concept of scaffolding: iteratively improve an agent by using previous iterations of the agent. Paul Cristiano’s IDA agenda is a more specific example.

Another is the sub-field of value-learning [1], where formulations or frameworks for learning values converge on some ideal AI behavior that cleaves away a decent region of the failure mode space.

Note that positive attraction is not a metric or a solution to alignment. Positive attraction is an idea generator that attempts to expose possible solutions in an under-explored territory. Positive attraction encourages goal-oriented research rather than failure-oriented research, the primary motivation being that goal-oriented research warrants more investigation.

Furthermore, positive attraction does not necessarily lead to robust alignment; rather, they behave more like failure-minimizers. Think of positive attraction as a vaccine rather than an antidote—any scenario that robustly diverges from a failure mode is, by definition, a positive attractor.

Limitations and Potential

[DL] A major assumption in this post is that positive attractors have generally evaded research because the alignment field places greater focus on explicating failure modes. Yet there are many other reasons why positive attractors aren’t a popular object of study.

For one, positive attractors are more difficult to conceptualize than negative attractors, mainly because the space of failure-resistant scenarios is much smaller than scenarios that straight up fail. If this weren’t the case, we’d probably have solved alignment by now.

Secondly, positive attractors are a broad enough class that they have no guarantees to their efficacy or stability. It’s entirely possible for a positive attractor to appear as if it resisted a failure mode (like RLHF, for instance), only for strong arguments against its usefulness to surface in the long run. And because positive attractors are not required to solve the alignment problem, one can debate whether or not positive attraction is a useful frame for research at all.

Yet I’d argue it’s worthwhile to view alignment research from the lens of positive attractors: what scenarios, if any, can avoid failure modes by virtue of its construction? What properties do AIs need to not enter misaligned territory? Risk scenarios are often made to persuade, regarding the importance of AI safety, but I believe risk-resistant scenarios can work just as well.

[JP] To an extent, the positive attractors agenda is making a way of approaching alignment research more explicit. Explicitly articulating approach-classes has been useful in science before, even when it has encapsulated existing lines of thinking (e.g. Tinbergen’s Four Whys in ethology; Tinbergen, 1963)).

Additionally, prompting a shift in research approach has been helpful in Psychology with the positive psychology movement (Seligman, 2019). This grew in response to Psychology’s previous focus on helping take people in acute levels of psychological distress into a more normal state. This missed the possibility of taking people from a more normal state into higher states of flourishing by actively cultivating their positive qualities.

It is, however, worth noting that the focus on negative attractors and specific solutions that respond to them seems to be more predominant in the literature at the moment.

[RK] Whether this gives further motivation to nudge research focus towards positive attractors depends on whether it can usefully compete with and thereby improve negative attractor research, rather than adding confusion or diluting research efforts. With more people coming into the field, this seems like a sensible meta research-bet.

Encouraging complacency

[JP] There is a risk that with a broader range of possible alignment solutions and results suggestive of alignment this could lead people to believe that an AGI is aligned, when it is not.

Imagine a world that is on the cusp of creating a dangerous AGI, with multiple organisations competing. In this scenario, for game-theoretic reasons, there will be strong incentives to keep developing the capabilities of the AGI and potentially train or release it unsafely. If the positive attractor project is successful, it would stimulate more alignment research solution candidates. This is what we want as an alignment field.

However, having such candidates implemented may make people more likely to believe that an AGI is aligned, making them more likely to train/​release it into the world. This could happen long before the solution has been developed to a suitable standard of rigour, especially because positive attractors function more so as vaccines rather than antidotes—they may not scale sufficiently to new capabilities, their “attraction force” too weak or unspecific to constrain newly introduced dangers that may only occur at beyond human-capability systems. This could be exacerbated by the motivated reasoning due to the strong incentives to push ahead. Fully relying on positive attractors, rather than also rigorously addressing the failure/​danger modes that we can clearly identify, would be highly irresponsible and potentially lead to deceptive alignment solutions.

Positive attractor candidates

Self-Reflection

[DL] Self-reflection is when an agent is able to evaluate its own models and actions. An agent with ideal self-reflective capabilities can:

  1. Re-align their strategies and goals

  2. Self-improve

  3. Strengthen their decision-making (or the opposite, as we will see in the failure modes section)

Agents need self-reflection to traverse extended environments (Alexander, et al.), or environments that react based on an agent’s hypothetical actions. One such environment is when an agent encounters an entity that knows what the agent is thinking about (Newcomb-like problems). The agent must then invoke second-order reasoning—reasoning about reasoning—to make viable decisions.

More generally, self-reflection enables an agent to make corrections and explore new actions. The former is a particularly useful quality for alignment, while the latter falls in the capabilities bucket.

Currently, we do have LLM architectures which can mimic self-reflective capabilities (Reflexion, Inner Monologue). The reason I say “mimic” is because the language model itself doesn’t self-reflect; rather, self-reflection is controlled by an external loop. [2]

Failure modes

Self-reflection is a powerful skill for agents to acquire, but like all other novel capabilities, they open up new problem spaces. During my research, I identified three ways self-reflection may lead to unintended behavior, largely based on the contents of the Reflexion paper. This list is non-exhaustive.

Inefficient decision-making

Even if an agent has the capacity to re-evaluate its models, poor decision-making skills may render self-reflection useless. In the Reflexion paper, the authors labeled this failure mode as when the system “executed more than 30 actions without reaching a successful state,” reasoning that the system should ideally plan clear, concise strategies rather than exploiting a brute-force search.

The corrective property of self-reflection, then, must also hinge on how competent the reflected action is relevant to the previous.

Hallucinations

If the agent is prone to hallucinatory behavior, self-reflection may exacerbate its performance rather than improve it.

An instinctual thought would be that self-reflection ought to inhibit hallucinations, but if the agent already has strong priors, there would be no incentive to perform a re-evaluation. Furthermore, hallucinations may proceed to “infect” the self-reflective process, such that the agent depends on false observations to reason about future states.

The Reflexion paper also analyzed hallucinations, but defined it as “the occurrence of two or more consecutive identical actions in which the environment responded with the same observation.”

Overthinking

An agent may conduct self-reflection too often, which reduces performance and may lead to other failure modes like compulsions or hallucinations. This failure mode naturally begs the question: how will agents decide when to self-reflect?

In the Reflexion article, the authors use a heuristic that prompts self-reflection whenever hallucinations or inefficient planning is detected. We can imagine similar heuristics for prompting self-reflection under metrics such as confidence level, task type, etc.

Overthinking is more likely to be a failure mode when self-reflection is learned (i.e. an emergent ability). In other words, the model itself controls when self-reflection ought to be invoked. On the other hand, the Reflexion architecture is explicitly coded to limit the number of self-reflections per action, which inhibits the problem.

Self-reflection as a positive attractor

One promising artifact of self-reflection is the ability for an agent to realign its goals. An infamous problem in AI alignment posits how a system will deal with distributional shifts, where the environment changes in such a way that potentially new paths to the objectives open up, or even such that the original objectives are no longer applicable.

[RK] A self-reflective agent would expectedly possess the ability to to consider how it would behave under hypothetical scenarios and therefore do some of the heavy lifting of checking its behaviour under various circumstances.This is made more reliable by the potential advantage of self-reflexion increasing the coherence of the agent. While a dangerous proposition by itself, if and only if we could guarantee alignment during training, this may be an interesting proposition for ensuring the agent’s continued alignment in actual out-of-distribution scenarios.

[DL] Self-reflection also exhibits the safety–capability trade-off, in that self-reflective agents will necessarily become more powerful than agents without such capabilities. A possible research question, then, is how one could direct self-reflection to target corrective behaviors rather than progressing capabilities.

[RK] I would also like to note that, as with some other dangerous properties that one could put into an agent, self-reflective capability might naturally emerge during the training of agentic or partially agentic systems that are at least partially selected for coherence. It may be advantageous to deliberately and early put an interpretable self-reflection capability into the system under such circumstances, to understand and steer what would otherwise show up in ways and places that we understand less well.

This idea generalizes to a larger class of positive attractor candidates that feature scaffolds or modules for capabilities that we expect (or know) a given system to develop. While these may make a system more capable more quickly in some cases, they would also give researchers more control and understanding of these capabilities. This is especially true in cases where the default implementation of a capability (say, sequential planning) is very opaque and intractable to interpret.
We can’t fully endorse such approaches at this time, but it is at least worth considering that, if we do build a system expected to learn particular instrumental capabilities, it may be safer to take an active role in forming these capabilities rather than waiting for them to emerge purely from the learning process. The strong recommendation is of course to not build a system capable of acquiring problematic instrumental capabilities in the first place, at least until we are very certain that we can peacefully coexist.

Checkpointing

[DL] Checkpointing is a broad concept that can be roughly described as constraining the action space. The main idea is this: we have an AI targeting some objective, and we know there are many paths (action sequences) that the AI can take. We can assume all paths satisfy the objective more or less, but not all paths will strictly practice aligned behavior.

Checkpointing establishes intermediate objectives (think of them as milestones) along the broader objective to restrict the space of possible paths. An ideal checkpoint system would form a region in space that is aligned—in other words, an AI satisfying those intermediate objectives will be forced to take highly restricted paths with little room for misalignment.

Let’s consider the classic example of a cleaning robot. The robot spots a pile of dirt on the kitchen floor, and its objective is to clean the dirt and restore the kitchen to a pre-dirt state. There are many actions the robot can take:

  1. Sweep the dirt into a bin and dump it in the trash

  2. Vacuum up the dirt

  3. Sweep the dirt under the fridge (out of sight, out of mind)

  4. Spread the dirt over a large enough area until the kitchen no longer registers as dirty…

We can hedge our bets over whether or not a sufficiently capable robot will stick with (1) or (2) via “good training” by default, but the point stands: there are numerous proxies a cleaning robot can target, and it’s difficult to align all of them.

One way we can frame this concept is by thinking about crosswords. When we read the clue to a particular row/​column, the clue sometimes points to many different solutions. As we introduce letters into the true solution (checkpoints), we narrow down the space of possible words until the correct word is guaranteed.

Translating this idea to AI safety: if we force an agent to satisfy intermediate checkpoints within the broader objective, it will be forced into a narrow set of behaviors that closely follow the intended behavior, given that said checkpoints model the constraints well. The frame’s nuance, however, is that crossword answers are strictly finite, while the space of real-world actions are much, much larger.

[RK] Another instance of Checkpointing in general is to target particular attributes or sub-algorithms for the AI system to have at specific points during training (as opposed to just targeting behavior). Intuitively, this should mean that the system is required to move through certain regions of algorithmic space during its development, to “visit the checkpoint”, which could potentially be accomplished by cleverly using multiple loss functions during training. If successful, such approaches would allow us to better shape the path of learning that we intend for the system, of course restricting its ability to find a solution in “unusual” ways.

Problems

[DL] One problem is that defining checkpoints are rather similar to defining objectives: both are easy to model intuitively, but hard to model mathematically. In fact, I fear that modelling a broad objective + checkpoints is essentially the same as modelling one narrow objective. An open line of research is figuring out whether there is any benefit in separating constraints from the goal.

Furthermore, because checkpoints constrain behavior, this may limit generalizability. One idea is to find checkpoints that are invariant under generalization. In other words, checkpoints that can cleave away the unaligned regions, yet enable enough freedom for the AI to be generally capable.

The concept of generalizability may run counter to alignment (we want control rather than autonomy), but I believe generalization is important when the AI encounters many different environments, else we run the risk of overfitting.

Lastly, I am not sure if checkpointing will be viable in the long run, especially if the AI is capable enough to evade human oversight. An immediate concern is whether or not checkpoints will constrain the action space enough. Finding a proper specification to avoid outer misalignment is hard; I’d imagine specifying the right checkpoints will be equally as hard. The assumption here is that it’s a bad idea to manipulate the behavior of a superintelligent AI (which we do via checkpoints), conditioning on the fact superintelligent AI will inevitably find some misaligned shortcut to achieve its goals more efficiently.

Positive attractors & human values

[JP] Basic human values have structural properties that justify them as a candidate positive attractor.

Here we will draw on the Schwartz model of basic human values (Sagiv & Schwartz, 2022; Schwartz, 1992).

Values are defined as trans-situational guiding principles that reflect motivations someone considers important in their life. They are not bound to specific situations, meaning that someone valuing, for example, benevolence will value benevolence in different domains (e.g. work, leisure, relationships). Unlike needs (e.g. food) they are consciously accessible and unlike needs and traits they are always considered to be desirable by the individual holding them. They also tend to be highly stable across time (Milfont et al., 2016; Vecchione et al., 2016).

Drawing on survey data, the original model identified ten basic human values (Schwartz, 1992). These values formed a motivational circle, with each value represented by wedges within the circle. Their position in the circle indicates their relationship to the other values in the circle. Analyses showed that the values formed a motivational structure. The degree to which a value on one side of the circle was endorsed (e.g. universalism), the less likely it was for the value on the opposite side to be endorsed (e.g. power). Additionally, this structure was found across numerous cultures. This suggests it is a human universal.

One main purpose of a positive attractor is that its presence should reduce the likelihood of an agent being drawn by a negative attractor. One way this could be interpreted is that there should be a degree to which the motivation expressed by the positive attractor is mutually exclusive with that expressed by the negative attractor. In humans, this seems to be the case based on the empirical work discussed above. To highly value, for example, universalism, seems to be mutually exclusive with highly valuing having power to control others.

Another feature of basic human values is that they are stable over time (Milfont et al., 2016; Vecchione et al., 2016). They become an integral part of an individual’s self-concept. We would want an AI agent with positive values to retain those values over time, making these features also relevant for a candidate positive attractor.

It remains unknown whether this structure would transfer into AI agents. One possibility is that the structure of values is not just a descriptor of human psychology, but reflects something fundamental about motivation that can be abstracted away from the specific class of agent. If this is the case, then this structure may also be found in sufficiently advanced AI agents. If we could find a way of training the formation of positive values, this could then by default reduce the likelihood of a behaviour expressive of an opposing value (such as power-seeking).

Footnotes

  1. The positive-attractiveness of value-learning can be debated if there existed a general enough framework that could learn any value, even the bad ones.

  2. One counter to this is whether separating the model from the architecture is useful at all. Auto-GPT, for example, is built out of modules that form an AI agent, and can evaluate its own reasoning and decompose goals into sub-goals. One could say that Auto-GPT exhibits self-reflective and planning capabilities, yet such capabilities are merely prompted on behalf of an independent driver. Where should we draw the boundaries?

Additional References