Jacob Pfau answers why assume AGIs will optimize for fixed goals?

Jacob Pfau 10 Jun 2022 16:12 UTC
6 points
I see three distinct reasons for the (non-)existence of terminal goals:
I. Disjoint proxy objectives
A scenario in which there seems to be reason to expect no global, single, terminal goal:
1. Outer loop pressure converges on multiple proxy objectives specialized to different sub-environments in a sufficiently diverse environment.
2. These proxy objectives will be activated in disjoint subsets of the environment.
3. Activation of proxy objectives is hard-coded by the outer loop. Information about when to activate a given proxy objective is under-determined at the inner loop level.
In this case, even if there is a goal-directed wrapper, it will face optimization pressure to leave the activation of proxy objectives described by 1-3 alone. Instead it will restrict itself to controlling other proxy objectives which do not fit the assumptions 1-3.
Reasons why this argument may fail:
- As capabilities increase, the goal-directed wrapper comes to realize when it lacks information relative to the information used in the outer loop. The optimization pressure for the wrapper not to interact with these ‘protected’ proxy objectives then dissipates, because the wrapper can intelligently interact with these objectives by recognizing its own limitations.
- As capabilities increase, one particular subroutine learns to self-modify and over-ride the original wrapper’s commands—where the original wrapper was content with multiple goals this new subroutine was optimized to only pursue a single proxy objective.
Conclusion: I’d expect a system described by points 1-3 to emerge before the counterarguments come into play. This initial system may already gradient hack to prevent further outer loop pressures. In such a case, the capabilities increase assumed in the two counter-argument bullets may never occur. Hence, it seems to me perfectly coherent to believe both (A) first transformative AI is unlikely to have a single terminal goal (B) sufficiently advanced AI would have a single terminal goal.
II. AI as market
If an AI is decentralized because of hardware constraints, or because decentralized/modular cognitive architectures are for some reason more efficient, then perhaps the AI will develop a sort of internal market for cognitive resources. In such a case, there need not be any pressure to converge to a coherent utility function. I am not familiar with this body of work, but John Wentworth claims that there are relevant theorems in the literature here: https://www.lesswrong.com/posts/L896Fp8hLSbh8Ryei/axrp-episode-15-natural-abstractions-with-john-wentworth#Agency_in_financial_markets_
III. Meta-preferences for self-modification (lowest confidence, not sure if this is confused. May be simply a reframing of reason I.)
Usually we imagine subagents as having conflicting preferences, and no meta-preferences. Instead imagine a system in which each subagent developed meta-preferences to prefer being displaced by other subagents under certain conditions.
In fact, we humans are probably examples of all I-III.