I think most people’s intuitions come from more everyday experiences like:
It’s easier to review papers than to write them.
Fraud is often caught using a tiny fraction of the effort required to perpetrate it.
I can tell that a piece of software is useful for me more easily than I can write it.
These observations seem relevant to questions like “can we delegate work to AI” because they are ubiquitous in everyday situations where we want to delegate work.
The claim in this post seems to be: sometimes it’s easier to create an object with property P than to decide whether a borderline instance satisfies property P. You chose a complicated example but you just as well have used something very mundane like “Make a pile of sand more than 6 inches tall.” I can do the task by making a 12 inch pile of sand, but if someone gives me a pile of sand that is 6.0000001 inches I’m going to need very precise measurement devices and philosophical clarification about what “tall”means.
I don’t think this observation undermines the claim that “it is easier to verify that someone has made a tall pile of sand than to do it yourself.” If someone gives me a 6.000001 inch tall pile of sand I can say “could you make it taller?” And if I can ask for a program that halts and someone gives me a program that looks for a proof of false in PA, I can just say “try again.”
I do think there are plenty of examples where verification is not easier than generation (and certainly where verification is non-trivial). It’s less clear what the relevance of that is.
I don’t think the generalization of the OP is quite “sometimes it’s easier to create an object with property P than to decide whether a borderline instance satisfies property P”. Rather, the halting example suggests that verification is likely to be harder than generation specifically when there is some (possibly implicit) adversary. What makes verification potentially hard is the part where we have to quantify over all possible inputs—the verifier must work for any input.
Borderline cases are an issue for that quantifier, but more generally any sort of adversarial pressure is a potential problem.
Under that perspective, the “just ask it to try again on borderline cases” strategy doesn’t look so promising, because a potential adversary is potentially optimizing against me—i.e. looking for cases which will fool not only my verifier, but myself.
As for the everyday experiences you list: I agree that such experiences seem to be where peoples’ intuitions on the matter often come from. Much like the case in the OP, I think people select for problems for which verification is easy—after all, those are the problems which are most legible, easiest to outsource (and therefore most likely to be economically profitable), etc. On the other hand, once we actively look for cases where adversarial pressure makes verification hard, or where there’s a “for all” quantifier, it’s easy to find such cases. For instance, riffing on your own examples:
It’s only easier to review papers than to write them because reviewers do not actually need to catch all problems. If missing a problem in a paper review resulted in a death sentence, I expect almost nobody would consider themselves competent to review papers.
Likewise with fraud: it’s only easier to catch than to perpetrate if we’re not expected to catch all of it, or even most of it.
It’s easier to write secure software than to verify that a piece of software is secure.
If including an error in a paper resulted in a death sentence, no one would be competent to write papers either.
For fraud, I agree that “tractable fraud has a meaningful probability of being caught,” and not “tractable fraud has a very high probability of being caught.” But “meaningful probability of being caught” is just what we need for AI delegation.
Verifying that arbitrary software is secure (even if it’s actually secure) is much harder than writing secure software. But verifiable and delegatable work is still extremely useful for the process of writing secure software.
To the extent that any of these problems are hard to verify, I think it’s almost entirely because of the “position of the interior” where an attacker can focus their effort on hiding an attack in a single place but a defender needs to spread their effort out over the whole attack surface.
But in that case we just apply verification vs generation again. It’s extremely hard to tell if code has a security problem, but in practice it’s quite easy to verify a correct claim that code has a security problem. And that’s what’s relevant to AI delegation, since in fact we will be using AI systems to help oversee in this way.
If you want to argue that e.g. writing secure software is fundamentally hard to verify, I think it would be much more interesting and helpful to exhibit a case of software with a vulnerability where it’s really hard for someone to verify the claim that the vulnerability exists.
Rather, the halting example suggests that verification is likely to be harder than generation specifically when there is some (possibly implicit) adversary.
Rice’s theorem says there are a lot of programs where you can’t tell if they will halt. But if I want to write a program that will/won’t halt, I’m just going to write a program for which it’s obvious. And if I asked you to write a program that will/won’t halt and you write the kind of program where I can’t tell, I’m just going to send it back.
Now that could still be hard. You could put a subtle problem in your code that makes it so it halts eventually even though it looks like it obviously doesn’t. But Rice’s theorem doesn’t say anything about that.
And reiterating the previous point, if someone wrote down a program that looks like it obviously doesn’t halt, but secretly it does because of an adversarial trick, then I would very strongly expect that someone could point out my mistake to me and I would conclude that it is no longer obvious whether it halts. Counterexamples to this kind of optimism would be way more impactful.
But in that case we just apply verification vs generation again. It’s extremely hard to tell if code has a security problem, but in practice it’s quite easy to verify a correct claim that code has a security problem. And that’s what’s relevant to AI delegation, since in fact we will be using AI systems to help oversee in this way.
I know you said that you’re not going to respond but in case you feel like giving a clarification I’d like to point out that I’m confused here.
Yes it usually easy to verify that a specific problem exists if the exact problem is pointed out to you[1].
But it’s much harder to verify claim that there are no problems, this code is doing exactly what you want.
And AFAIK staying in a loop:
1) AI tells us “here’s a specific problem”
2) We fix the problem then
3) Go back to step 1)
Doesn’t help with anything? We want to be in a state where AI says “This is doing exactly what you want” and we have reasons to trust that (and that is hard to verify).
EDIT to add: I think I didn’t make it clear enough what clarification I’m asking for.
Do you think it’s possible to use AI which will point out problems (but which we can’t trust when it says everything is ok) to “win”? It would be very interesting if you did and I’d love to learn more.
Do you think that we could trust AI when it says that everything is ok? Again that’d be very interesting.
Did I miss something? I’m curious to learn what but that’s just me being wrong (but that’s not new path to win interesting).
Also it’s possible that there are two problems, each problem is easy to fix on its own but it’s really hard to fix them both at the same time (simple example: it’s trivial to have 0 false positives or 0 false negatives when testing for a disease; it’s much harder to eliminate both at the same time).
[1] Well it can be hard to reliably reproduce problem, even if you know exactly what the problem is (I know because I couldn’t write e2e tests to verify some bug fixes).
I think it would be much more interesting and helpful to exhibit a case of software with a vulnerability where it’s really hard for someone to verify the claim that the vulnerability exists.
Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors.
I get the impression that you have noticed the lack of observed counterexamples, and updated that counterexamples are rare, without noticing that you would also mostly not observe counterexamples even if they were common. (Though of course this is subject to the usual qualifiers about how it’s difficult to guess other peoples’ mental processes, you have better information than I about whether you indeed updated in such a way, etc.)
That said, if I were to actively look for such counterexamples in the context of software, the obfuscated C code competition would be one natural target.
We can also get indirect bits of evidence on the matter. For instance, we can look at jury trials, and notice that they are notoriously wildly unreliable in practice. That suggests that, relative to the cognition of a median-ish human, there must exist situations in which one lawyer can point out the problem in another’s logic/evidence, and the the median-ish human will not be able verify it. Now, one could argue that this is merely because median-ish humans are not very bright (a claim I’d agree with), but then it’s rather a large jump to claim that e.g. you or I is so smart that analogous problems are not common for us.
For instance, we can look at jury trials, and notice that they are notoriously wildly unreliable in practice. That suggests that, relative to the cognition of a median-ish human, there must exist situations in which one lawyer can point out the problem in another’s logic/evidence, and the the median-ish human will not be able verify it.
This is something of a tangent, but juries’ unreliability does not particularly suggest that conclusion to me. I immediately see three possible reasons for juries to be unreliable:
The courts may not reliably communicate to juries the criteria by which they are supposed to decide the case
The jurors may decide to ignore the official criteria and do something else instead
The jurors may know the official criteria and make a sincere attempt to follow them, but fail in some way
You’re supposing that the third reason dominates. I haven’t made a serious study of how juries work in practice, but my priors say the third reason is probably the least significant, so this is not very convincing to me.
(I also note that you’d need to claim that juries are inconsistent relative to the lawyers’ arguments, not merely inconsistent relative to the factual details of the case, and it’s not at all obvious to me that juries’ reputation for unreliability is actually controlled in that way.)
Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors.
You can check whether there are examples where it takes an hour to notice a problem, or 10 hours, or 100 hours… You can check whether there are examples that require lots of expertise to evaluate. And so on. the question isn’t whether there is some kind of magical example that is literally impossible to notice, it’s whether there are cases where verification is hard relative to generation!
You can check whether you can generate examples, or whether other people believe that they can generate examples. The question is about whether a slightly superhuman AI can find examples, not whether they exist (and indeed whether they exist is more unfalsifiable, not because of the difficulty of recognizing them but because of the difficulty of finding them).
You can look for examples in domains where the ground truth is available. E.g. we can debate about the existence of bugs or vulnerabilities in software, and then ultimately settle the question by running the code and having someone demonstrate a vulnerability. If Alice claims something is a vulnerability but I can’t verify her reasoning, then she can still demonstrate that it was correct by going and attacking the system.
I’ve looked at e.g. some results from the underhanded C competition and they are relatively easy for laypeople to recognize in a short amount of time when the attack is pointed out. I have not seen examples of attacks that are hard to recognize as plausible attacks without significant expertise or time, and I am legitimately interested in them.
I’m bowing out here, you are welcome to the last word.
What makes verification potentially hard is the part where we have to quantify over all possible inputs—the verifier must work for any input.
I feel like it should be possible to somehow factor this discussion into two orthogonal claims, where one of the claims is something like “doing something for all inputs is harder than doing it for just one input” and the other is something like “it is harder to identify one particular example as noteworthy than it is to verify the noteworthiness of that particular example”.
And it seems likely to me that both claims are true if you separate them like that.
I think most people’s intuitions come from more everyday experiences like:
It’s easier to review papers than to write them.
Fraud is often caught using a tiny fraction of the effort required to perpetrate it.
I can tell that a piece of software is useful for me more easily than I can write it.
These observations seem relevant to questions like “can we delegate work to AI” because they are ubiquitous in everyday situations where we want to delegate work.
The claim in this post seems to be: sometimes it’s easier to create an object with property P than to decide whether a borderline instance satisfies property P. You chose a complicated example but you just as well have used something very mundane like “Make a pile of sand more than 6 inches tall.” I can do the task by making a 12 inch pile of sand, but if someone gives me a pile of sand that is 6.0000001 inches I’m going to need very precise measurement devices and philosophical clarification about what “tall”means.
I don’t think this observation undermines the claim that “it is easier to verify that someone has made a tall pile of sand than to do it yourself.” If someone gives me a 6.000001 inch tall pile of sand I can say “could you make it taller?” And if I can ask for a program that halts and someone gives me a program that looks for a proof of false in PA, I can just say “try again.”
I do think there are plenty of examples where verification is not easier than generation (and certainly where verification is non-trivial). It’s less clear what the relevance of that is.
I don’t think the generalization of the OP is quite “sometimes it’s easier to create an object with property P than to decide whether a borderline instance satisfies property P”. Rather, the halting example suggests that verification is likely to be harder than generation specifically when there is some (possibly implicit) adversary. What makes verification potentially hard is the part where we have to quantify over all possible inputs—the verifier must work for any input.
Borderline cases are an issue for that quantifier, but more generally any sort of adversarial pressure is a potential problem.
Under that perspective, the “just ask it to try again on borderline cases” strategy doesn’t look so promising, because a potential adversary is potentially optimizing against me—i.e. looking for cases which will fool not only my verifier, but myself.
As for the everyday experiences you list: I agree that such experiences seem to be where peoples’ intuitions on the matter often come from. Much like the case in the OP, I think people select for problems for which verification is easy—after all, those are the problems which are most legible, easiest to outsource (and therefore most likely to be economically profitable), etc. On the other hand, once we actively look for cases where adversarial pressure makes verification hard, or where there’s a “for all” quantifier, it’s easy to find such cases. For instance, riffing on your own examples:
It’s only easier to review papers than to write them because reviewers do not actually need to catch all problems. If missing a problem in a paper review resulted in a death sentence, I expect almost nobody would consider themselves competent to review papers.
Likewise with fraud: it’s only easier to catch than to perpetrate if we’re not expected to catch all of it, or even most of it.
It’s easier to write secure software than to verify that a piece of software is secure.
If including an error in a paper resulted in a death sentence, no one would be competent to write papers either.
For fraud, I agree that “tractable fraud has a meaningful probability of being caught,” and not “tractable fraud has a very high probability of being caught.” But “meaningful probability of being caught” is just what we need for AI delegation.
Verifying that arbitrary software is secure (even if it’s actually secure) is much harder than writing secure software. But verifiable and delegatable work is still extremely useful for the process of writing secure software.
To the extent that any of these problems are hard to verify, I think it’s almost entirely because of the “position of the interior” where an attacker can focus their effort on hiding an attack in a single place but a defender needs to spread their effort out over the whole attack surface.
But in that case we just apply verification vs generation again. It’s extremely hard to tell if code has a security problem, but in practice it’s quite easy to verify a correct claim that code has a security problem. And that’s what’s relevant to AI delegation, since in fact we will be using AI systems to help oversee in this way.
If you want to argue that e.g. writing secure software is fundamentally hard to verify, I think it would be much more interesting and helpful to exhibit a case of software with a vulnerability where it’s really hard for someone to verify the claim that the vulnerability exists.
Rice’s theorem says there are a lot of programs where you can’t tell if they will halt. But if I want to write a program that will/won’t halt, I’m just going to write a program for which it’s obvious. And if I asked you to write a program that will/won’t halt and you write the kind of program where I can’t tell, I’m just going to send it back.
Now that could still be hard. You could put a subtle problem in your code that makes it so it halts eventually even though it looks like it obviously doesn’t. But Rice’s theorem doesn’t say anything about that.
And reiterating the previous point, if someone wrote down a program that looks like it obviously doesn’t halt, but secretly it does because of an adversarial trick, then I would very strongly expect that someone could point out my mistake to me and I would conclude that it is no longer obvious whether it halts. Counterexamples to this kind of optimism would be way more impactful.
I know you said that you’re not going to respond but in case you feel like giving a clarification I’d like to point out that I’m confused here.
Yes it usually easy to verify that a specific problem exists if the exact problem is pointed out to you[1].
But it’s much harder to verify claim that there are no problems, this code is doing exactly what you want.
And AFAIK staying in a loop:
1) AI tells us “here’s a specific problem”
2) We fix the problem then
3) Go back to step 1)
Doesn’t help with anything? We want to be in a state where AI says “This is doing exactly what you want” and we have reasons to trust that (and that is hard to verify).
EDIT to add: I think I didn’t make it clear enough what clarification I’m asking for.
Do you think it’s possible to use AI which will point out problems (but which we can’t trust when it says everything is ok) to “win”? It would be very interesting if you did and I’d love to learn more.
Do you think that we could trust AI when it says that everything is ok? Again that’d be very interesting.
Did I miss something? I’m curious to learn what but that’s just me being wrong (but that’s not new path to win interesting).
Also it’s possible that there are two problems, each problem is easy to fix on its own but it’s really hard to fix them both at the same time (simple example: it’s trivial to have 0 false positives or 0 false negatives when testing for a disease; it’s much harder to eliminate both at the same time).
[1] Well it can be hard to reliably reproduce problem, even if you know exactly what the problem is (I know because I couldn’t write e2e tests to verify some bug fixes).
Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors.
I get the impression that you have noticed the lack of observed counterexamples, and updated that counterexamples are rare, without noticing that you would also mostly not observe counterexamples even if they were common. (Though of course this is subject to the usual qualifiers about how it’s difficult to guess other peoples’ mental processes, you have better information than I about whether you indeed updated in such a way, etc.)
That said, if I were to actively look for such counterexamples in the context of software, the obfuscated C code competition would be one natural target.
We can also get indirect bits of evidence on the matter. For instance, we can look at jury trials, and notice that they are notoriously wildly unreliable in practice. That suggests that, relative to the cognition of a median-ish human, there must exist situations in which one lawyer can point out the problem in another’s logic/evidence, and the the median-ish human will not be able verify it. Now, one could argue that this is merely because median-ish humans are not very bright (a claim I’d agree with), but then it’s rather a large jump to claim that e.g. you or I is so smart that analogous problems are not common for us.
This is something of a tangent, but juries’ unreliability does not particularly suggest that conclusion to me. I immediately see three possible reasons for juries to be unreliable:
The courts may not reliably communicate to juries the criteria by which they are supposed to decide the case
The jurors may decide to ignore the official criteria and do something else instead
The jurors may know the official criteria and make a sincere attempt to follow them, but fail in some way
You’re supposing that the third reason dominates. I haven’t made a serious study of how juries work in practice, but my priors say the third reason is probably the least significant, so this is not very convincing to me.
(I also note that you’d need to claim that juries are inconsistent relative to the lawyers’ arguments, not merely inconsistent relative to the factual details of the case, and it’s not at all obvious to me that juries’ reputation for unreliability is actually controlled in that way.)
You can check whether there are examples where it takes an hour to notice a problem, or 10 hours, or 100 hours… You can check whether there are examples that require lots of expertise to evaluate. And so on. the question isn’t whether there is some kind of magical example that is literally impossible to notice, it’s whether there are cases where verification is hard relative to generation!
You can check whether you can generate examples, or whether other people believe that they can generate examples. The question is about whether a slightly superhuman AI can find examples, not whether they exist (and indeed whether they exist is more unfalsifiable, not because of the difficulty of recognizing them but because of the difficulty of finding them).
You can look for examples in domains where the ground truth is available. E.g. we can debate about the existence of bugs or vulnerabilities in software, and then ultimately settle the question by running the code and having someone demonstrate a vulnerability. If Alice claims something is a vulnerability but I can’t verify her reasoning, then she can still demonstrate that it was correct by going and attacking the system.
I’ve looked at e.g. some results from the underhanded C competition and they are relatively easy for laypeople to recognize in a short amount of time when the attack is pointed out. I have not seen examples of attacks that are hard to recognize as plausible attacks without significant expertise or time, and I am legitimately interested in them.
I’m bowing out here, you are welcome to the last word.
I feel like it should be possible to somehow factor this discussion into two orthogonal claims, where one of the claims is something like “doing something for all inputs is harder than doing it for just one input” and the other is something like “it is harder to identify one particular example as noteworthy than it is to verify the noteworthiness of that particular example”.
And it seems likely to me that both claims are true if you separate them like that.