Overall Thesis: technical alignment is generally irrelevant to outcomes, and almost everyone in the AI Alignment field is stuck with this incorrect assumption, working on technical alignment of LLM models
(1) aligned superintelligence that is provably logically realizable [already proved]
(2) aligned superintelligence is not just logically but also physically realizable [TBD]
(3) ML interpretability/mechanistic interpretability cannot possibly be logically necessary for aligned superintelligence [TBD]
(4) ML interpretability/mechanistic interpretability cannot possibly be logically sufficient for aligned superintelligence [TBD]
(5) given certain minimal intelligence, minimal emulation ability of humans by AI (e.g. understands common-sense morality and cause and effect) and of AI by humans (humans can do multiplications etc) the internal details of AI models cannot possibly make a difference to the set of realizable good outcomes, though they can make a difference to the ease/efficiency of realizing them [TBD]
(6) given near-perfect or perfect technical alignment (=AI will do what the creators ask of it with correct intent) awful outcomes are Nash Equilibrium for rational agents [TBD]
(7) small or even large alignment deviations make no fundamental difference to outcomes—the boundary between good/bad is determined by game theory, mechanism design and initial conditions, and only by a satisficing condition on alignment fidelity which is below the level of alignment of current humans (and AIs) [TBD]
(8) There is no such thing as superintelligence anyway because intelligence factors into many specific expert systems rather than one all-encompassing general purpose thinker. No human has a job as a “thinker”—we are all quite specialized. Thus, it doesn’t make sense to talk about “aligning superintelligence”, but rather about “aligning civilization” (or some other entity which has the ability to control outcomes) [TBD]
it doesn’t make sense to talk about “aligning superintelligence”, but rather about “aligning civilization” (or some other entity which has the ability to control outcomes)
The key insight here is that
(1) “Entities which do in fact control outcomes”
and
(2) “Entities which are near-optimal at solving the specific problem of grabbing power and wielding it”
and
(3) “Entities which are good at correctly solving a broad range of information processing/optimization problems”
are three distinct sets of entities which the Yudkowsky/Bostrom/Russell paradigm of AI risk has smooshed into one (“The Godlike AI will be (3) so therefore it will be (2) so therefore it will be (1)!”). But reality may simply not work like that and if you look at the real world, (1), (2) and (3) are all distinct sets.
The gap between (3) and (2) is the advantage of specialization. Problem-solving is not a linear scale of goodness, it’s an expanding cone where advances in some directions are irrelevant to other directions.
The gap between (1) and (2) - the difference between being best at getting power and actually having the most power—is the advantage of the incumbent. Powerful incumbents can be highly suboptimal and still win because of things like network effects, agglomerative effects, defender’s advantage and so on.
There is also another gap here. It’s the gap between making entities that are generically obedient, and making a power-structure that produces good outcomes. What is that gap? Well, entities can be generically obedient but still end up producing bad outcomes because of:
(a) coordination problems (see World War I)
(b) information problems (see things like the promotion of lobotomies or HRT for middle-aged women)
(c) political economy problems (see things like NIMBYism, banning plastic straws, TurboTax corruption)
Problems of type (a) happen when everyone wants a good outcome, but they can’t coordinate on it and defection strategies are dominant so people get the bad Nash Equilibrium
Problems of type (b) happen when everyone obediently walks off a cliff together. Supporting things like HRT for middle-aged or drinking a glass of red wine per week women was backed by science, but the science was actually bunk. People like to copy each other and obedience makes this worse because dissenters are punished more. They’re being disobedient, you see!
Problems of type (c) happen because a small group of people actually benefit from making the world worse, and it often turns out that that small group are the ones who get to decide whether to perpetuate that particular way of making the world worse!
For an example of the crushing advantage of specialization, see this tweet about how a tiny LLM with specialized training for multiplication of large numbers is better at it than cutting-edge general purpose LLMs.
The Contrarian ‘AI Alignment’ Agenda
Overall Thesis: technical alignment is generally irrelevant to outcomes, and almost everyone in the AI Alignment field is stuck with this incorrect assumption, working on technical alignment of LLM models
(1) aligned superintelligence that is provably logically realizable [already proved]
(2) aligned superintelligence is not just logically but also physically realizable [TBD]
(3) ML interpretability/mechanistic interpretability cannot possibly be logically necessary for aligned superintelligence [TBD]
(4) ML interpretability/mechanistic interpretability cannot possibly be logically sufficient for aligned superintelligence [TBD]
(5) given certain minimal intelligence, minimal emulation ability of humans by AI (e.g. understands common-sense morality and cause and effect) and of AI by humans (humans can do multiplications etc) the internal details of AI models cannot possibly make a difference to the set of realizable good outcomes, though they can make a difference to the ease/efficiency of realizing them [TBD]
(6) given near-perfect or perfect technical alignment (=AI will do what the creators ask of it with correct intent) awful outcomes are Nash Equilibrium for rational agents [TBD]
(7) small or even large alignment deviations make no fundamental difference to outcomes—the boundary between good/bad is determined by game theory, mechanism design and initial conditions, and only by a satisficing condition on alignment fidelity which is below the level of alignment of current humans (and AIs) [TBD]
(8) There is no such thing as superintelligence anyway because intelligence factors into many specific expert systems rather than one all-encompassing general purpose thinker. No human has a job as a “thinker”—we are all quite specialized. Thus, it doesn’t make sense to talk about “aligning superintelligence”, but rather about “aligning civilization” (or some other entity which has the ability to control outcomes) [TBD]
No human has a job as scribe, because literacy is 90%+.
I don’t think that unipolar/multipolar scenarios differ greatly in outcomes.
Yes, correct. But people have jobs as copywriters, secretaries, etc. People specialize, because that is the optimal way to get stuff done.
Yes, but it doesn’t mean that unspecialized AGI is going to be worse than specialized human.
when did I say that?
The key insight here is that
(1) “Entities which do in fact control outcomes”
and
(2) “Entities which are near-optimal at solving the specific problem of grabbing power and wielding it”
and
(3) “Entities which are good at correctly solving a broad range of information processing/optimization problems”
are three distinct sets of entities which the Yudkowsky/Bostrom/Russell paradigm of AI risk has smooshed into one (“The Godlike AI will be (3) so therefore it will be (2) so therefore it will be (1)!”). But reality may simply not work like that and if you look at the real world, (1), (2) and (3) are all distinct sets.
The gap between (3) and (2) is the advantage of specialization. Problem-solving is not a linear scale of goodness, it’s an expanding cone where advances in some directions are irrelevant to other directions.
The gap between (1) and (2) - the difference between being best at getting power and actually having the most power—is the advantage of the incumbent. Powerful incumbents can be highly suboptimal and still win because of things like network effects, agglomerative effects, defender’s advantage and so on.
There is also another gap here. It’s the gap between making entities that are generically obedient, and making a power-structure that produces good outcomes. What is that gap? Well, entities can be generically obedient but still end up producing bad outcomes because of:
(a) coordination problems (see World War I)
(b) information problems (see things like the promotion of lobotomies or HRT for middle-aged women)
(c) political economy problems (see things like NIMBYism, banning plastic straws, TurboTax corruption)
Problems of type (a) happen when everyone wants a good outcome, but they can’t coordinate on it and defection strategies are dominant so people get the bad Nash Equilibrium
Problems of type (b) happen when everyone obediently walks off a cliff together. Supporting things like HRT for middle-aged or drinking a glass of red wine per week women was backed by science, but the science was actually bunk. People like to copy each other and obedience makes this worse because dissenters are punished more. They’re being disobedient, you see!
Problems of type (c) happen because a small group of people actually benefit from making the world worse, and it often turns out that that small group are the ones who get to decide whether to perpetuate that particular way of making the world worse!
For an example of the crushing advantage of specialization, see this tweet about how a tiny LLM with specialized training for multiplication of large numbers is better at it than cutting-edge general purpose LLMs.
https://twitter.com/yuntiandeng/status/1836114419480166585