I mentioned that I expect proof-level guarantees will be easy once the conceptual problems are worked out. Strong interpretability is part of that: if we know how to “see whether the AI runs a check for whether it can deceive humans”, then I expect systems which provably don’t do that won’t be much extra work. So we might disagree less on that front than it first seemed.
The question of whether to model the AI as an open-ended optimizer is is one I figured would come up. I don’t think we need to think of it as truly open-ended in order to use any of the above arguments, especially the wish-granting analogy. The relevant point is that limited optimization implies limited wish-granting ability. In order to grant more “difficult” wishes, the AI needs to steer the universe into a smaller chunk of state-space - in other words, it needs to perform stronger optimization. So AI with limited optimization capability will be safer to exactly the extent that they are unable to grant unsafe wishes—i.e. the chunks of state-space which they can access just don’t contain really bad outcomes.
Perhaps the disagreement is in how hard it is to prove things vs. test them. I pretty strongly disagree with
if we know how to “see whether the AI runs a check for whether it can deceive humans”, then I expect systems which provably don’t do that won’t be much extra work.
The version based on testing has to look at a single input scenario to the AI, whereas the proof has to quantify over all possible scenarios. These seem wildly different. Compare to e.g. telling whether Alice is being manipulated by Bob by looking at interactions between Alice and Bob, vs. trying to prove that Bob will never be manipulative. The former seems possible, the latter doesn’t.
First, when I say “proof-level guarantees will be easy”, I mean “team of experts can predictably and reliably do it in a year or two”, not “hacker can do it over the weekend”.
Second, suppose we want to prove that a sorting algorithm always returns sorted output. We don’t do that by explicitly quantifying over all possible outputs. Rather, we do that using some insights into what it means for something to be sorted—e.g. expressing it in terms of a relatively small set of pairwise comparisons. Indeed, the insights needed for the proof are often exactly the same insights needed to design the algorithm. Once you’ve got the insights and the sorting algorithm in hand, the proof isn’t actually that much extra work, although it will still take some experts chewing on it a bit to make sure it’s correct.
That’s the sort of thing I expect to happen for friendly AI: we are missing some fundamental insights into what it means to be “aligned”. Once those are figured out, I don’t expect proofs to be much harder than algorithms. Coming back to the “see whether the AI runs a check for whether it can deceive humans” example, the proof wouldn’t involve writing the checker and then quantifying over all possible inputs. Rather, it would involve writing the AI in such a way that it always passes the check, by construction—just like we write sorting algorithms so that they will always pass an is_sorted() check by construction.
Third, continuing from the previous point: the question is not how hard it is to prove compared to test. The question is how hard it is to build a provably-correct algorithm, compared to an algorithm which happens to be correct even though we don’t have a proof.
First, when I say “proof-level guarantees will be easy”, I mean “team of experts can predictably and reliably do it in a year or two”, not “hacker can do it over the weekend”.
This was also what I was imagining. (Well, actually, I was also considering more than two years.)
we are missing some fundamental insights into what it means to be “aligned”.
It sounds like our disagreement is the one highlighted in Realism about rationality. When I say we could check whether the AI is deceiving humans, I don’t mean that we have a check that succeeds literally 100% of the time because we have formalized a definition of “deception” that gives us a perfect checker. I don’t think notions like “deception”, “aligned”, “want”, “optimize”, etc. have a clean formal definition that admits a 100% successful checker. I do think that these notions do tend to have extremes that can be reliably identified, even if there are edge cases where it is unclear. This makes testing easy, while proofs remain very difficult.
Jumping back to the original question, it sounds like the reason that you think that if we don’t have proofs we are doomed, is that conditional on us not having proofs, we must not have had any other methods of gaining confidence (such as testing), and so we must be flying blind. Is that right?
If so, how do you square this with other engineering disciplines, which typically place most of the confidence in safety on comprehensive, expensive testing (think wind tunnels for rockets or crash tests for cars)? Perhaps this is also explained by realism about rationality—maybe physical phenomena aren’t amenable to crisp formal definitions, but “alignment” is.
It does sound like our disagreement is the same thing outlined in Realism about Rationality (although I disagree with almost all of the “realism about rationality” examples in that post—e.g. I don’t think AGI will necessarily be an “agent”, I don’t think Turing machines or Kolmogorov complexity are useful foundations for epistemology, I’m not bothered by moral intuitions containing contradictions, etc).
I would also describe my “no proofs ⇒ doomed” view, not as the proofs being causally important, but as the proofs being evidence of understanding. If we don’t have the proofs, it’s highly unlikely that we understand the system well enough to usefully predict whether it is safe—but the proofs themselves play a relatively minor role.
I do not know of any engineering discipline which places most of the confidence in safety on comprehensive, expensive testing. Every single engineering discipline I have ever studied starts from understanding the system under design, the principles which govern its function, and designs a system which is expected to be safe based on that understanding. As long as those underlying principles are understood, the most likely errors are either simple mistakes (e.g. metric/standard units mixup) or missing some fundamental phenomenon (e.g. aerodynamics of a bridge). Those are the sort of problems which testing is good at catching. Testing is a double-check that we haven’t missed something critical; it is not the primary basis for thinking the system is safe.
A simple example, in contrast to AI: every engineering discipline I know of uses “safety factors”—i.e. make a beam twice as strong as it needs to be, give a wire twice the current capacity it needs, etc. A safety factor of 2 is typical in a wide variety of engineering fields. In AI, we cannot use safety factors because we do not even know what number we could double to make the AI more safe. Today, given any particular aspect of an AI system, we do not know whether adjusting any particular parameter will make the AI more or less reliable/risky.
I mentioned that I expect proof-level guarantees will be easy once the conceptual problems are worked out. Strong interpretability is part of that: if we know how to “see whether the AI runs a check for whether it can deceive humans”, then I expect systems which provably don’t do that won’t be much extra work. So we might disagree less on that front than it first seemed.
The question of whether to model the AI as an open-ended optimizer is is one I figured would come up. I don’t think we need to think of it as truly open-ended in order to use any of the above arguments, especially the wish-granting analogy. The relevant point is that limited optimization implies limited wish-granting ability. In order to grant more “difficult” wishes, the AI needs to steer the universe into a smaller chunk of state-space - in other words, it needs to perform stronger optimization. So AI with limited optimization capability will be safer to exactly the extent that they are unable to grant unsafe wishes—i.e. the chunks of state-space which they can access just don’t contain really bad outcomes.
Perhaps the disagreement is in how hard it is to prove things vs. test them. I pretty strongly disagree with
The version based on testing has to look at a single input scenario to the AI, whereas the proof has to quantify over all possible scenarios. These seem wildly different. Compare to e.g. telling whether Alice is being manipulated by Bob by looking at interactions between Alice and Bob, vs. trying to prove that Bob will never be manipulative. The former seems possible, the latter doesn’t.
Three possibly-relevant points here.
First, when I say “proof-level guarantees will be easy”, I mean “team of experts can predictably and reliably do it in a year or two”, not “hacker can do it over the weekend”.
Second, suppose we want to prove that a sorting algorithm always returns sorted output. We don’t do that by explicitly quantifying over all possible outputs. Rather, we do that using some insights into what it means for something to be sorted—e.g. expressing it in terms of a relatively small set of pairwise comparisons. Indeed, the insights needed for the proof are often exactly the same insights needed to design the algorithm. Once you’ve got the insights and the sorting algorithm in hand, the proof isn’t actually that much extra work, although it will still take some experts chewing on it a bit to make sure it’s correct.
That’s the sort of thing I expect to happen for friendly AI: we are missing some fundamental insights into what it means to be “aligned”. Once those are figured out, I don’t expect proofs to be much harder than algorithms. Coming back to the “see whether the AI runs a check for whether it can deceive humans” example, the proof wouldn’t involve writing the checker and then quantifying over all possible inputs. Rather, it would involve writing the AI in such a way that it always passes the check, by construction—just like we write sorting algorithms so that they will always pass an is_sorted() check by construction.
Third, continuing from the previous point: the question is not how hard it is to prove compared to test. The question is how hard it is to build a provably-correct algorithm, compared to an algorithm which happens to be correct even though we don’t have a proof.
This was also what I was imagining. (Well, actually, I was also considering more than two years.)
It sounds like our disagreement is the one highlighted in Realism about rationality. When I say we could check whether the AI is deceiving humans, I don’t mean that we have a check that succeeds literally 100% of the time because we have formalized a definition of “deception” that gives us a perfect checker. I don’t think notions like “deception”, “aligned”, “want”, “optimize”, etc. have a clean formal definition that admits a 100% successful checker. I do think that these notions do tend to have extremes that can be reliably identified, even if there are edge cases where it is unclear. This makes testing easy, while proofs remain very difficult.
Jumping back to the original question, it sounds like the reason that you think that if we don’t have proofs we are doomed, is that conditional on us not having proofs, we must not have had any other methods of gaining confidence (such as testing), and so we must be flying blind. Is that right?
If so, how do you square this with other engineering disciplines, which typically place most of the confidence in safety on comprehensive, expensive testing (think wind tunnels for rockets or crash tests for cars)? Perhaps this is also explained by realism about rationality—maybe physical phenomena aren’t amenable to crisp formal definitions, but “alignment” is.
It does sound like our disagreement is the same thing outlined in Realism about Rationality (although I disagree with almost all of the “realism about rationality” examples in that post—e.g. I don’t think AGI will necessarily be an “agent”, I don’t think Turing machines or Kolmogorov complexity are useful foundations for epistemology, I’m not bothered by moral intuitions containing contradictions, etc).
I would also describe my “no proofs ⇒ doomed” view, not as the proofs being causally important, but as the proofs being evidence of understanding. If we don’t have the proofs, it’s highly unlikely that we understand the system well enough to usefully predict whether it is safe—but the proofs themselves play a relatively minor role.
I do not know of any engineering discipline which places most of the confidence in safety on comprehensive, expensive testing. Every single engineering discipline I have ever studied starts from understanding the system under design, the principles which govern its function, and designs a system which is expected to be safe based on that understanding. As long as those underlying principles are understood, the most likely errors are either simple mistakes (e.g. metric/standard units mixup) or missing some fundamental phenomenon (e.g. aerodynamics of a bridge). Those are the sort of problems which testing is good at catching. Testing is a double-check that we haven’t missed something critical; it is not the primary basis for thinking the system is safe.
A simple example, in contrast to AI: every engineering discipline I know of uses “safety factors”—i.e. make a beam twice as strong as it needs to be, give a wire twice the current capacity it needs, etc. A safety factor of 2 is typical in a wide variety of engineering fields. In AI, we cannot use safety factors because we do not even know what number we could double to make the AI more safe. Today, given any particular aspect of an AI system, we do not know whether adjusting any particular parameter will make the AI more or less reliable/risky.