Applying superintelligence without collusion

Eric Drexler8 Nov 2022 18:08 UTC

LW: 109 AF: 44

Epistemic status: The core ideas seem robust and stable after long reflection and many discussions.

Many researchers identify AI safety with control of a monolithic, superintelligent AI system, and if questioned about multicomponent alternatives, argue that multiple superintelligent-level systems would inevitably collude and act as one. This view seems quite wrong, yet has diverted attention from a rich and promising range of multicomponent strategies for AI safety — strategies that are well aligned with the actual trajectory of AI development.

___________________________

Adapted from Reframing Superintelligence, Section 20:

Collusion among superintelligent oracles
can readily be avoided

Because collusion among AI question-answering systems can readily be avoided, there is no obstacle to applying superintelligent-level AI resources to problems that include AI safety.

20.1 Summary

The difficulty of establishing successful collusion among actors tends to increase as their capabilities, knowledge, situations, and roles become more diverse and adversarial (think auditors, competitors, specialists, red-teams…), and increasing the number of actors can make collusive cooperation more difficult . In the context of AI systems (even more so than among human beings), these conditions can be readily implemented and are attractive for pragmatic reasons. Arguments that, absent preexisting alignment, high-level AI systems will inevitably collude are ill-founded. Instead, we should expect that interactions among multiple superintelligent-level systems can be applied to suppress deception and reduce risk by leveraging imperfect alignment achieved at the level of individual systems.

20.2 Trustworthiness can be an emergent property

Prospects for solving AI-safety problems would be greatly improved if we could safely apply superintelligent-level question-answering resources (“oracles”, or more generally, “systems that provide information in response to prompts”) to solving those problems. A familiar objection dismisses this potentially powerful approach as unsafe in itself, arguing that, absent solutions to difficult problems, individual superintelligent-level systems would be untrustworthy, and that attempts to establish checks and balances among multiple systems (for example, through superintelligent-level evaluation of potentially deceptive answers to questions) would inevitably be thwarted by collusive cooperation.

Identifying robust strategies for ensuring non-collusion among superintelligent question-answering systems would overcome this objection, inviting exploration of superintelligence-enabled strategies for managing potentially untrusted superintelligent AI systems. The present discussion argues that the robust non-emergence of deceptive collusion among imperfectly aligned systems can be ensured by structuring systems of systems with diverse capabilities, knowledge, situations, objectives, and roles.

This problem framing assumes the ability to develop systems that respond to questions with superintelligent-level competence, and will (as a consequence of the nature of digital systems) assume the ability to constrain information inputs to AI systems during their development and use.

In the problematic case, superintelligent question-answering systems might provide deceptive answers in pursuit of emergent, potentially undesirable goals. The aim here is to identify principles for architecting multicomponent systems that will act as non-deceptive oracles, while treating their components as actors that could in principle engage in deceptive collusion.

20.3 A range of conditions can make collusion robust or fragile

Potential problems of deceptive collusion can be addressed by architecting systems that make collusion difficult or effectively impossible.^[1] Some relevant conditions:

Conditions that tend to facilitate collusion:

A1) Small numbers of actors (the required scope of collusion is small)

A2) Insensitivity to defectors (partial collusion is sufficient)

A3) Similarity among actors (for example, shared histories and objectives)

A4) Free communication (information can flow between any pair of actors)

A5) Iterated decision processes with observable actions

A6) Common knowledge of the above circumstances

Contrasting conditions that tend to disrupt collusion:

B1) Large numbers of actors (the required scope of collusion is large)

B2) Sensitivity to defectors (collusion fails unless it is complete)

B3) Diversity among actors (for example, different histories, adversarial objectives)

B4) Constrained communication (information flows through limited causal channels)

B5) Single-move decision processes, limited observability

B6) Lack of shared knowledge of strategically-relevant circumstances

Implementation choices that would yield conditions (A1–A5) are unnecessary and often undesirable. By contrast, implementation choices that yield conditions (B1–B5) are practical and often well-motivated:

Characteristics of practical architectures:

C1) To improve the quality and reliability of answers, it is natural to implement multiple, diverse (and implicitly competing) systems to propose alternatives.

C2) To identify low-quality or misleading answers, it is natural to employ diverse critics (auditors, red teams, etc.), any one of which might disrupt deceptive collusion.

C3) Systems of diverse, competing proposers and critics naturally implement different and often adversarial objectives.

C4) To minimize unwanted dependencies, systems are designed such that components have a sparse, thoughtfully designed communication graph.

C5) To provide stable results, it is natural to apply fixed models to a series of problems, yielding what is in effect a series of history-blind, single-move decisions.^[2]

C6) In training, it is natural to provide differentiated, task-relevant information to systems solving different problems, often omitting knowledge of general circumstances.^[3]

Considerations (C1–C5) suggest that implementing systems that are resistant to the emergence of deceptive collusion aligns with the general strategy of employing ensembles of diverse, stable, task-directed system components to propose and assess answers to questions. This approach can pass the test of practicality and potentially widespread adoption.

20.4 Untrusted superintelligence can be applied to AI safety

It seems both possible and practical to establish conditions that would effectively preclude deceptive collusion among diverse, task-oriented, superintelligent-level question-answering systems, thereby addressing a key problem in safely applying superintelligent capabilities to problems (Bostrom, Superintelligence, p.145).^[4]

Prospects for obtaining trustworthy answers from sets of individually untrusted superintelligent systems call for deeper exploration of AI safety strategies premised on reliable access to superintelligent problem-solving capabilities. Capabilities of particular interest include the ability to run and evaluate the results of superintelligent-level red-team/blue-team competitions relevant to questions of global strategic stability.

TL;DR: Because collusion can be thwarted, it is reasonable to expect that applying multiple potentially untrustworthy superintelligent-level systems to problems can improve rather than degrade safety. This prospect calls for greater attention.

___________________________

Afterword:

Note that I have presented a there-exists argument against a for-all proposition. Responses of the form there-exists (“But I can think of something similar that won’t work”), don’t count as counterarguments here. If requiring a for-all counterargument (or at least a quite strong for-all flavor) across the space of potential compositions of AI systems seems like a high bar, well, that’s very much the point.

Fallback arguments of the form “But it would be safer to assume the worst — universal, inevitable collusion among AI systems” miss the key implication of the case made here: that multicomponent approaches could potentially increase safety. To (continue to) neglect potential solutions based on handwaving arguments is a really bad idea.

Paul Christiano (“On heterogeneous objectives”, 2015) considers the case for inevitable collusion; I agree with his gently-stated view that

…the implicit argument for the robustness of collusion makes a lot of implicit assumptions. If I saw an explicit argument I might be able to assess its explicit assumptions, but for now we don’t have one.

At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.^[5]

___________________________

Applying superintelligence without collusion

Adapted from Reframing Superintelligence, Section 20:

Collusion among superintelligent oracles can readily be avoided

20.1 Summary

20.2 Trustworthiness can be an emergent property

20.3 A range of conditions can make collusion robust or fragile

Conditions that tend to facilitate collusion:

Contrasting conditions that tend to disrupt collusion:

Characteristics of practical architectures:

20.4 Untrusted superintelligence can be applied to AI safety

Afterword:

Further Reading [in Reframing Superintelligence]

Collusion among superintelligent oracles
can readily be avoided