Are there any impossibility theorems for the existence of AI that is both strong and safe? I think such theorems would be interesting because they could help to evaluate proposals for safe AI: we could ask “which assumption does this proposal break?”
I have a vague sense that a theorem of this sort might be able to be developed along the following lines: 1. The kind of strong AI that we want is a technological tool such that it’s easy to tell it what to do, and it can successfully do a wide variety of complex things when told 2. Simple instructions + complex results → AI has a lot of flexibility in its action 3. There are only a few ways to reliably achieve goals requiring complex behaviour e.g. something approximating expected utility maximisation 4. 2+3 + instrumental convergence → flexibility is likely to be exploited in dangerous ways
Do fleshed out versions of this argument exist? Do you have any other ideas about impossibility theorems?
[Question] Are there any impossibility theorems for strong and safe AI?
Are there any impossibility theorems for the existence of AI that is both strong and safe? I think such theorems would be interesting because they could help to evaluate proposals for safe AI: we could ask “which assumption does this proposal break?”
I have a vague sense that a theorem of this sort might be able to be developed along the following lines:
1. The kind of strong AI that we want is a technological tool such that it’s easy to tell it what to do, and it can successfully do a wide variety of complex things when told
2. Simple instructions + complex results → AI has a lot of flexibility in its action
3. There are only a few ways to reliably achieve goals requiring complex behaviour e.g. something approximating expected utility maximisation
4. 2+3 + instrumental convergence → flexibility is likely to be exploited in dangerous ways
Do fleshed out versions of this argument exist? Do you have any other ideas about impossibility theorems?