If including an error in a paper resulted in a death sentence, no one would be competent to write papers either.
For fraud, I agree that “tractable fraud has a meaningful probability of being caught,” and not “tractable fraud has a very high probability of being caught.” But “meaningful probability of being caught” is just what we need for AI delegation.
Verifying that arbitrary software is secure (even if it’s actually secure) is much harder than writing secure software. But verifiable and delegatable work is still extremely useful for the process of writing secure software.
To the extent that any of these problems are hard to verify, I think it’s almost entirely because of the “position of the interior” where an attacker can focus their effort on hiding an attack in a single place but a defender needs to spread their effort out over the whole attack surface.
But in that case we just apply verification vs generation again. It’s extremely hard to tell if code has a security problem, but in practice it’s quite easy to verify a correct claim that code has a security problem. And that’s what’s relevant to AI delegation, since in fact we will be using AI systems to help oversee in this way.
If you want to argue that e.g. writing secure software is fundamentally hard to verify, I think it would be much more interesting and helpful to exhibit a case of software with a vulnerability where it’s really hard for someone to verify the claim that the vulnerability exists.
Rather, the halting example suggests that verification is likely to be harder than generation specifically when there is some (possibly implicit) adversary.
Rice’s theorem says there are a lot of programs where you can’t tell if they will halt. But if I want to write a program that will/won’t halt, I’m just going to write a program for which it’s obvious. And if I asked you to write a program that will/won’t halt and you write the kind of program where I can’t tell, I’m just going to send it back.
Now that could still be hard. You could put a subtle problem in your code that makes it so it halts eventually even though it looks like it obviously doesn’t. But Rice’s theorem doesn’t say anything about that.
And reiterating the previous point, if someone wrote down a program that looks like it obviously doesn’t halt, but secretly it does because of an adversarial trick, then I would very strongly expect that someone could point out my mistake to me and I would conclude that it is no longer obvious whether it halts. Counterexamples to this kind of optimism would be way more impactful.
But in that case we just apply verification vs generation again. It’s extremely hard to tell if code has a security problem, but in practice it’s quite easy to verify a correct claim that code has a security problem. And that’s what’s relevant to AI delegation, since in fact we will be using AI systems to help oversee in this way.
I know you said that you’re not going to respond but in case you feel like giving a clarification I’d like to point out that I’m confused here.
Yes it usually easy to verify that a specific problem exists if the exact problem is pointed out to you[1].
But it’s much harder to verify claim that there are no problems, this code is doing exactly what you want.
And AFAIK staying in a loop:
1) AI tells us “here’s a specific problem”
2) We fix the problem then
3) Go back to step 1)
Doesn’t help with anything? We want to be in a state where AI says “This is doing exactly what you want” and we have reasons to trust that (and that is hard to verify).
EDIT to add: I think I didn’t make it clear enough what clarification I’m asking for.
Do you think it’s possible to use AI which will point out problems (but which we can’t trust when it says everything is ok) to “win”? It would be very interesting if you did and I’d love to learn more.
Do you think that we could trust AI when it says that everything is ok? Again that’d be very interesting.
Did I miss something? I’m curious to learn what but that’s just me being wrong (but that’s not new path to win interesting).
Also it’s possible that there are two problems, each problem is easy to fix on its own but it’s really hard to fix them both at the same time (simple example: it’s trivial to have 0 false positives or 0 false negatives when testing for a disease; it’s much harder to eliminate both at the same time).
[1] Well it can be hard to reliably reproduce problem, even if you know exactly what the problem is (I know because I couldn’t write e2e tests to verify some bug fixes).
I think it would be much more interesting and helpful to exhibit a case of software with a vulnerability where it’s really hard for someone to verify the claim that the vulnerability exists.
Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors.
I get the impression that you have noticed the lack of observed counterexamples, and updated that counterexamples are rare, without noticing that you would also mostly not observe counterexamples even if they were common. (Though of course this is subject to the usual qualifiers about how it’s difficult to guess other peoples’ mental processes, you have better information than I about whether you indeed updated in such a way, etc.)
That said, if I were to actively look for such counterexamples in the context of software, the obfuscated C code competition would be one natural target.
We can also get indirect bits of evidence on the matter. For instance, we can look at jury trials, and notice that they are notoriously wildly unreliable in practice. That suggests that, relative to the cognition of a median-ish human, there must exist situations in which one lawyer can point out the problem in another’s logic/evidence, and the the median-ish human will not be able verify it. Now, one could argue that this is merely because median-ish humans are not very bright (a claim I’d agree with), but then it’s rather a large jump to claim that e.g. you or I is so smart that analogous problems are not common for us.
For instance, we can look at jury trials, and notice that they are notoriously wildly unreliable in practice. That suggests that, relative to the cognition of a median-ish human, there must exist situations in which one lawyer can point out the problem in another’s logic/evidence, and the the median-ish human will not be able verify it.
This is something of a tangent, but juries’ unreliability does not particularly suggest that conclusion to me. I immediately see three possible reasons for juries to be unreliable:
The courts may not reliably communicate to juries the criteria by which they are supposed to decide the case
The jurors may decide to ignore the official criteria and do something else instead
The jurors may know the official criteria and make a sincere attempt to follow them, but fail in some way
You’re supposing that the third reason dominates. I haven’t made a serious study of how juries work in practice, but my priors say the third reason is probably the least significant, so this is not very convincing to me.
(I also note that you’d need to claim that juries are inconsistent relative to the lawyers’ arguments, not merely inconsistent relative to the factual details of the case, and it’s not at all obvious to me that juries’ reputation for unreliability is actually controlled in that way.)
Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors.
You can check whether there are examples where it takes an hour to notice a problem, or 10 hours, or 100 hours… You can check whether there are examples that require lots of expertise to evaluate. And so on. the question isn’t whether there is some kind of magical example that is literally impossible to notice, it’s whether there are cases where verification is hard relative to generation!
You can check whether you can generate examples, or whether other people believe that they can generate examples. The question is about whether a slightly superhuman AI can find examples, not whether they exist (and indeed whether they exist is more unfalsifiable, not because of the difficulty of recognizing them but because of the difficulty of finding them).
You can look for examples in domains where the ground truth is available. E.g. we can debate about the existence of bugs or vulnerabilities in software, and then ultimately settle the question by running the code and having someone demonstrate a vulnerability. If Alice claims something is a vulnerability but I can’t verify her reasoning, then she can still demonstrate that it was correct by going and attacking the system.
I’ve looked at e.g. some results from the underhanded C competition and they are relatively easy for laypeople to recognize in a short amount of time when the attack is pointed out. I have not seen examples of attacks that are hard to recognize as plausible attacks without significant expertise or time, and I am legitimately interested in them.
I’m bowing out here, you are welcome to the last word.
If including an error in a paper resulted in a death sentence, no one would be competent to write papers either.
For fraud, I agree that “tractable fraud has a meaningful probability of being caught,” and not “tractable fraud has a very high probability of being caught.” But “meaningful probability of being caught” is just what we need for AI delegation.
Verifying that arbitrary software is secure (even if it’s actually secure) is much harder than writing secure software. But verifiable and delegatable work is still extremely useful for the process of writing secure software.
To the extent that any of these problems are hard to verify, I think it’s almost entirely because of the “position of the interior” where an attacker can focus their effort on hiding an attack in a single place but a defender needs to spread their effort out over the whole attack surface.
But in that case we just apply verification vs generation again. It’s extremely hard to tell if code has a security problem, but in practice it’s quite easy to verify a correct claim that code has a security problem. And that’s what’s relevant to AI delegation, since in fact we will be using AI systems to help oversee in this way.
If you want to argue that e.g. writing secure software is fundamentally hard to verify, I think it would be much more interesting and helpful to exhibit a case of software with a vulnerability where it’s really hard for someone to verify the claim that the vulnerability exists.
Rice’s theorem says there are a lot of programs where you can’t tell if they will halt. But if I want to write a program that will/won’t halt, I’m just going to write a program for which it’s obvious. And if I asked you to write a program that will/won’t halt and you write the kind of program where I can’t tell, I’m just going to send it back.
Now that could still be hard. You could put a subtle problem in your code that makes it so it halts eventually even though it looks like it obviously doesn’t. But Rice’s theorem doesn’t say anything about that.
And reiterating the previous point, if someone wrote down a program that looks like it obviously doesn’t halt, but secretly it does because of an adversarial trick, then I would very strongly expect that someone could point out my mistake to me and I would conclude that it is no longer obvious whether it halts. Counterexamples to this kind of optimism would be way more impactful.
I know you said that you’re not going to respond but in case you feel like giving a clarification I’d like to point out that I’m confused here.
Yes it usually easy to verify that a specific problem exists if the exact problem is pointed out to you[1].
But it’s much harder to verify claim that there are no problems, this code is doing exactly what you want.
And AFAIK staying in a loop:
1) AI tells us “here’s a specific problem”
2) We fix the problem then
3) Go back to step 1)
Doesn’t help with anything? We want to be in a state where AI says “This is doing exactly what you want” and we have reasons to trust that (and that is hard to verify).
EDIT to add: I think I didn’t make it clear enough what clarification I’m asking for.
Do you think it’s possible to use AI which will point out problems (but which we can’t trust when it says everything is ok) to “win”? It would be very interesting if you did and I’d love to learn more.
Do you think that we could trust AI when it says that everything is ok? Again that’d be very interesting.
Did I miss something? I’m curious to learn what but that’s just me being wrong (but that’s not new path to win interesting).
Also it’s possible that there are two problems, each problem is easy to fix on its own but it’s really hard to fix them both at the same time (simple example: it’s trivial to have 0 false positives or 0 false negatives when testing for a disease; it’s much harder to eliminate both at the same time).
[1] Well it can be hard to reliably reproduce problem, even if you know exactly what the problem is (I know because I couldn’t write e2e tests to verify some bug fixes).
Conditional on such counterexamples existing, I would usually expect to not notice them. Even if someone displayed such a counterexample, it would presumably be quite difficult to verify that it is a counterexample. Therefore a lack of observation of such counterexamples is, at most, very weak evidence against their existence; we are forced to fall back on priors.
I get the impression that you have noticed the lack of observed counterexamples, and updated that counterexamples are rare, without noticing that you would also mostly not observe counterexamples even if they were common. (Though of course this is subject to the usual qualifiers about how it’s difficult to guess other peoples’ mental processes, you have better information than I about whether you indeed updated in such a way, etc.)
That said, if I were to actively look for such counterexamples in the context of software, the obfuscated C code competition would be one natural target.
We can also get indirect bits of evidence on the matter. For instance, we can look at jury trials, and notice that they are notoriously wildly unreliable in practice. That suggests that, relative to the cognition of a median-ish human, there must exist situations in which one lawyer can point out the problem in another’s logic/evidence, and the the median-ish human will not be able verify it. Now, one could argue that this is merely because median-ish humans are not very bright (a claim I’d agree with), but then it’s rather a large jump to claim that e.g. you or I is so smart that analogous problems are not common for us.
This is something of a tangent, but juries’ unreliability does not particularly suggest that conclusion to me. I immediately see three possible reasons for juries to be unreliable:
The courts may not reliably communicate to juries the criteria by which they are supposed to decide the case
The jurors may decide to ignore the official criteria and do something else instead
The jurors may know the official criteria and make a sincere attempt to follow them, but fail in some way
You’re supposing that the third reason dominates. I haven’t made a serious study of how juries work in practice, but my priors say the third reason is probably the least significant, so this is not very convincing to me.
(I also note that you’d need to claim that juries are inconsistent relative to the lawyers’ arguments, not merely inconsistent relative to the factual details of the case, and it’s not at all obvious to me that juries’ reputation for unreliability is actually controlled in that way.)
You can check whether there are examples where it takes an hour to notice a problem, or 10 hours, or 100 hours… You can check whether there are examples that require lots of expertise to evaluate. And so on. the question isn’t whether there is some kind of magical example that is literally impossible to notice, it’s whether there are cases where verification is hard relative to generation!
You can check whether you can generate examples, or whether other people believe that they can generate examples. The question is about whether a slightly superhuman AI can find examples, not whether they exist (and indeed whether they exist is more unfalsifiable, not because of the difficulty of recognizing them but because of the difficulty of finding them).
You can look for examples in domains where the ground truth is available. E.g. we can debate about the existence of bugs or vulnerabilities in software, and then ultimately settle the question by running the code and having someone demonstrate a vulnerability. If Alice claims something is a vulnerability but I can’t verify her reasoning, then she can still demonstrate that it was correct by going and attacking the system.
I’ve looked at e.g. some results from the underhanded C competition and they are relatively easy for laypeople to recognize in a short amount of time when the attack is pointed out. I have not seen examples of attacks that are hard to recognize as plausible attacks without significant expertise or time, and I am legitimately interested in them.
I’m bowing out here, you are welcome to the last word.