Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.
Note I’m referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person’s values.
I think there are many other cases where verification and generation are both extremely difficult, including ones where verification is much harder than generation. A few examples:
Such and such a software system is not vulnerable to hacking[1].
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’. Further, there are probably many cases where a misaligned AI may happen to know a fact about the world, one which the evaluator doesn’t know or hasn’t noticed, that means that A will have very large negative effects.
This is the classic blue-team / red-team dichotomy; the defender has to think of and prevent every attack; the attacker only has to come up with a single vulnerability. Or in cryptography, Schneier’s Law: ‘Anyone can invent a security system so clever that she or he can’t think of how to break it.’
Collatz conjecture is true can’t be a statement that is harder to prove than it is to verify for the reason given by Vanessa Kosoy here, though you might be right that in practice it’s absolutely hard to verify the proof that was generated:
The same response can be given to the 4th example here.
On election outcomes, the polls on Labor Day are actually reasonably predictive of what happens on November, mostly because at this point voters hear a lot more about the prospective candidates and are starting to form opinions.
For the SB-1047 case, the one prediction I will make right now is that the law has essentially no positive or negative affect, for a lot of values, solely because it’s a rather weak AI bill after amendments.
I usually don’t focus on the case where we try to align an AI that is already misaligned, but rather trying to get the model into a basin of alignment early via data.
Re Schneier’s Law and security mindset, I’ve become more skeptical of security mindset being useful in general, for 2 reasons:
I think that there are enough disanalogies, like the fact that you can randomly change some of the parameters in the model and you still get the same or improved performance in most cases, that notably doesn’t exist in our actual security field or even fields that have to deal with highly fragile systems.
There is good reason to believe a lot of the security exploits discovered that seem magical to not actually matter in practice, because of ridiculous pre-conditions, and the computer security field is selection biased to say that this exploit dooms us forever (even if it cannot):
These posts and comments are helpful pointers to my view:
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
True, but I do actually think there is actually real traction on the problem already, and IMO one of the cooler results is Pretraining Language Models from Human feedback, and note that even a problem is in NP can get really intractable in the worst case (though we don’t have proof of that)
So there’s a strained analogy to be made here.
For this:
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’.
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
Indeed, one of the traps of social reformers IRL is to think that just because verifying something is correct or wrong is easy, generating a new social outcome, perhaps via norms must also be easy, but it isn’t, because the verification side is much easier than the generation side.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’.
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
I’m talking about something a bit different, though: claiming in advance that A will have net-positive consequences vs verifying in advance that A will have net-positive consequences. I think that’s a very real problem; a theoretical misaligned AI can hand us a million lines of code and say, ‘Run this, it’ll generate a cure for cancer and definitely not do bad things’, and in many cases it would be difficult-to-impossible to confirm that.
We could, as Tegmark and Omohundro propose, insist that it provide us a legible and machine-checkable proof of safety before we run it, but then we’re back to counting on all players to behave responsibly. (although I can certainly imagine legislation / treaties that would help a lot there).
Thanks for the thoughtful responses.
I think there are many other cases where verification and generation are both extremely difficult, including ones where verification is much harder than generation. A few examples:
The Collatz conjecture is true.
The net effect of SB-1047 will be positive [given x values].
Trump will win the upcoming election.
The 10th Busy Beaver number is <number>.
Such and such a software system is not vulnerable to hacking[1].
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’. Further, there are probably many cases where a misaligned AI may happen to know a fact about the world, one which the evaluator doesn’t know or hasn’t noticed, that means that A will have very large negative effects.
This is the classic blue-team / red-team dichotomy; the defender has to think of and prevent every attack; the attacker only has to come up with a single vulnerability. Or in cryptography, Schneier’s Law: ‘Anyone can invent a security system so clever that she or he can’t think of how to break it.’
To address your examples:
Collatz conjecture is true can’t be a statement that is harder to prove than it is to verify for the reason given by Vanessa Kosoy here, though you might be right that in practice it’s absolutely hard to verify the proof that was generated:
https://www.lesswrong.com/posts/2PDC69DDJuAx6GANa/verification-is-not-easier-than-generation-in-general#feTSDufEqXozChSbB
The same response can be given to the 4th example here.
On election outcomes, the polls on Labor Day are actually reasonably predictive of what happens on November, mostly because at this point voters hear a lot more about the prospective candidates and are starting to form opinions.
For the SB-1047 case, the one prediction I will make right now is that the law has essentially no positive or negative affect, for a lot of values, solely because it’s a rather weak AI bill after amendments.
I usually don’t focus on the case where we try to align an AI that is already misaligned, but rather trying to get the model into a basin of alignment early via data.
Re Schneier’s Law and security mindset, I’ve become more skeptical of security mindset being useful in general, for 2 reasons:
I think that there are enough disanalogies, like the fact that you can randomly change some of the parameters in the model and you still get the same or improved performance in most cases, that notably doesn’t exist in our actual security field or even fields that have to deal with highly fragile systems.
There is good reason to believe a lot of the security exploits discovered that seem magical to not actually matter in practice, because of ridiculous pre-conditions, and the computer security field is selection biased to say that this exploit dooms us forever (even if it cannot):
These posts and comments are helpful pointers to my view:
https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment
https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/#ogt6CZkMNZ6oReuTk
https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/#MFqdexvnuuRKY6Tbx
On this:
True, but I do actually think there is actually real traction on the problem already, and IMO one of the cooler results is Pretraining Language Models from Human feedback, and note that even a problem is in NP can get really intractable in the worst case (though we don’t have proof of that)
So there’s a strained analogy to be made here.
For this:
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
Indeed, one of the traps of social reformers IRL is to think that just because verifying something is correct or wrong is easy, generating a new social outcome, perhaps via norms must also be easy, but it isn’t, because the verification side is much easier than the generation side.
I’m talking about something a bit different, though: claiming in advance that A will have net-positive consequences vs verifying in advance that A will have net-positive consequences. I think that’s a very real problem; a theoretical misaligned AI can hand us a million lines of code and say, ‘Run this, it’ll generate a cure for cancer and definitely not do bad things’, and in many cases it would be difficult-to-impossible to confirm that.
We could, as Tegmark and Omohundro propose, insist that it provide us a legible and machine-checkable proof of safety before we run it, but then we’re back to counting on all players to behave responsibly. (although I can certainly imagine legislation / treaties that would help a lot there).