Are there any impossibility theorems for the existence of AI that is both strong and safe? I think such theorems would be interesting because they could help to evaluate proposals for safe AI: we could ask “which assumption does this proposal break?”
I have a vague sense that a theorem of this sort might be able to be developed along the following lines:
1. The kind of strong AI that we want is a technological tool such that it’s easy to tell it what to do, and it can successfully do a wide variety of complex things when told
2. Simple instructions + complex results → AI has a lot of flexibility in its action
3. There are only a few ways to reliably achieve goals requiring complex behaviour e.g. something approximating expected utility maximisation
4. 2+3 + instrumental convergence → flexibility is likely to be exploited in dangerous ways
Do fleshed out versions of this argument exist? Do you have any other ideas about impossibility theorems?
(Disclaimer: I’ve only skimmed this paper)
You might be interested in Impossibility results in AI: A Survey (Brcic & Yampolskiy 2022).
I think this sort of counts?
https://www.lesswrong.com/posts/WCX3EwnWAx7eyucqH/corrigibility-can-be-vnm-incoherent
I’ve also sort of derived some informal arguments myself in the same vein, though I haven’t published them anywhere.
Basically, approximately all of the focus is on creating/aligning a consequentialist utility maximizer, but consequentialist utility maximizers don’t like being corrected, will tend to want to change your preferences, etc, which all seems bad for alignment.