On the note of There Ain’t No Such Thing As A Free Buck:
Philosophy buck: One might want a seed FAI to provably self-modify to an FAI with above-humanly-possible level of philosophical ability/expressive power. But such a proof might require a significant amount of philosophical progress/expressive power from humans beforehand; we cannot rely on a given seed FAI that will FOOM to help us prove its philosophical ability. Different or preliminary artifices or computations (e.g. computers, calculators) will assist us, though.
Thanks, Knave! I’ll use ‘artificial superintelligence’ (ASI, or just SI) here, to distinguish this kind of AGI from non-superintelligent AGIs (including seed AIs, superintelligences-in-the-making that haven’t yet gone through a takeoff). Chalmers’ ‘AI++’ also works for singling out the SI from other kinds of AGI. ‘FAI’ doesn’t help, because it’s ambiguous whether we mean a Friendly SI, or a Friendly seed (i.e., a seed that will reliably produce a Friendly SI).
The dilemma is that we can safely use low-intelligence AGIs to help with Friendliness Theory, but they may not be smart enough to get the right answer; whereas high-intelligence AGIs will be more useful for Friendliness Theory, but also more dangerous given that we haven’t already solved Friendliness Theory.
In general, ‘provably Friendly’ might mean any of three different things:
Humans can prove, without using an SI, that the SI is (or will be) Friendly. (Other AGIs, ones that are stupid and non-explosive, may be useful here.)
The SI can prove to itself that it is Friendly. (This proof may be unintelligible to humans. This is important during self-modifications; any Friendly AI that’s about to enact a major edit to its own source code will first confirm that the edit will not make it Unfriendly.)
The SI can prove to humans that it is Friendly. (This doesn’t just mean persuading humans of its Friendliness, nor does it just mean giving them a sound proof; it means giving them a sound proof that persuades them specifically because of its soundness. So humans must be able to understand and verify the proof with an enormous amount of confidence.)
You’re talking about the first kind of provability. It might be that Friendliness decomposes into several different problems, some that are easy enough for humans to solve, and others that require an SI but are safe to rely on the SI for given that we solved the humanly possible component. Indirect normativity is an example of this; we solve Friendliness ourselves, but most of the specific behaviors and goals of the FAI are left up to the FAI to figure out. I don’t think most proponents of indirect normativity think provability-3 will realistically play a major role in verifying that a system is FAI-complete.
Brilliant post.
On the note of There Ain’t No Such Thing As A Free Buck:
Philosophy buck: One might want a seed FAI to provably self-modify to an FAI with above-humanly-possible level of philosophical ability/expressive power. But such a proof might require a significant amount of philosophical progress/expressive power from humans beforehand; we cannot rely on a given seed FAI that will FOOM to help us prove its philosophical ability. Different or preliminary artifices or computations (e.g. computers, calculators) will assist us, though.
Thanks, Knave! I’ll use ‘artificial superintelligence’ (ASI, or just SI) here, to distinguish this kind of AGI from non-superintelligent AGIs (including seed AIs, superintelligences-in-the-making that haven’t yet gone through a takeoff). Chalmers’ ‘AI++’ also works for singling out the SI from other kinds of AGI. ‘FAI’ doesn’t help, because it’s ambiguous whether we mean a Friendly SI, or a Friendly seed (i.e., a seed that will reliably produce a Friendly SI).
The dilemma is that we can safely use low-intelligence AGIs to help with Friendliness Theory, but they may not be smart enough to get the right answer; whereas high-intelligence AGIs will be more useful for Friendliness Theory, but also more dangerous given that we haven’t already solved Friendliness Theory.
In general, ‘provably Friendly’ might mean any of three different things:
Humans can prove, without using an SI, that the SI is (or will be) Friendly. (Other AGIs, ones that are stupid and non-explosive, may be useful here.)
The SI can prove to itself that it is Friendly. (This proof may be unintelligible to humans. This is important during self-modifications; any Friendly AI that’s about to enact a major edit to its own source code will first confirm that the edit will not make it Unfriendly.)
The SI can prove to humans that it is Friendly. (This doesn’t just mean persuading humans of its Friendliness, nor does it just mean giving them a sound proof; it means giving them a sound proof that persuades them specifically because of its soundness. So humans must be able to understand and verify the proof with an enormous amount of confidence.)
You’re talking about the first kind of provability. It might be that Friendliness decomposes into several different problems, some that are easy enough for humans to solve, and others that require an SI but are safe to rely on the SI for given that we solved the humanly possible component. Indirect normativity is an example of this; we solve Friendliness ourselves, but most of the specific behaviors and goals of the FAI are left up to the FAI to figure out. I don’t think most proponents of indirect normativity think provability-3 will realistically play a major role in verifying that a system is FAI-complete.