The core difficulty isn’t with how hard reward models are to train, it’s with specifying a reward function in the first place in a way that’s robust enough to capture all the behavior and trade-offs we want. LLMs aren’t a good example of that, because the RL is a relatively thin layer on top of pretraining (which has a trivially specified loss function). o1 arguably shifts that balance enough that it’ll be interesting to see how prosaically-aligned it is.
This is actually right, but I think this is actually addressable by making large synthetic datasets, and I also think that we can in practice define reward functions densely enough such that we can capture al of the behavior we want in practice.
We have very many examples of reward misspecification and goal misgeneralization in RL; it’s historically been quite difficult to adequately specify a reward function for agents acting in environments.
I agree with this, but I will also say that the examples listed point to a strong reason why RL also wasn’t as capable as people thought, and a lot of the hacks also decreased capabilities as they decreased alignment, so any solution to that problem would help capabilities and alignment massively.
That said, I do certainly agree that LLMs are reasonably good at understanding human values; maybe it’s enough to have such an LLM judge proposed agent goals and actions on that basis and issue reward. It’s not obvious to me that that works in practice, or is efficient enough to be practical.
Yeah, I think the big question for my views is whether the LLM solution has low enough taxes to be practical, and my answer is at this point is probable, but not a sure thing, as it requires them to slow down in the race a little (but training runs will get longer, so there’s a countervailing force to this.)
I think there are reasons to be optimistic here, mainly due to updating against evopsych views on how humans got their capabilities and values, combined with updating against complexity and fragility of value due to LLM successes, though it will require real work to bring about.
I’m pretty skeptical of: ‘...it is significantly asymptotically easier to e.g. verify a proof than generate a new one...and this to some extent maps to the distinction between alignment and capabilities.’ I think there’s a lot of missing work there to be able to claim that mapping.
I think that the verification-generation gap is pervasive in a lot of fields, from workers in many industries being verified by bosses to make sure their job is done right, to people who buy air conditioners being able to find efficient air-conditioning for their needs despite not verifying very hard, to researchers verifying papers that were generated, to social reformers having correct critiques of various aspects of society but not being able to generate a new societal norm, and more.
Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.
‘Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa.’ I think this is false. Consider ‘Biden (/Trump) was a great president.’ The world is full of situations where humans differ wildly on whether they’re good or bad.
Note I’m referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person’s values.
I think the crux is whether verification is easier than generation in alignment, since if verification is just as hard as generation, this hurts plans like scalable oversight a lot.
Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.
Note I’m referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person’s values.
I think there are many other cases where verification and generation are both extremely difficult, including ones where verification is much harder than generation. A few examples:
Such and such a software system is not vulnerable to hacking[1].
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’. Further, there are probably many cases where a misaligned AI may happen to know a fact about the world, one which the evaluator doesn’t know or hasn’t noticed, that means that A will have very large negative effects.
This is the classic blue-team / red-team dichotomy; the defender has to think of and prevent every attack; the attacker only has to come up with a single vulnerability. Or in cryptography, Schneier’s Law: ‘Anyone can invent a security system so clever that she or he can’t think of how to break it.’
Collatz conjecture is true can’t be a statement that is harder to prove than it is to verify for the reason given by Vanessa Kosoy here, though you might be right that in practice it’s absolutely hard to verify the proof that was generated:
The same response can be given to the 4th example here.
On election outcomes, the polls on Labor Day are actually reasonably predictive of what happens on November, mostly because at this point voters hear a lot more about the prospective candidates and are starting to form opinions.
For the SB-1047 case, the one prediction I will make right now is that the law has essentially no positive or negative affect, for a lot of values, solely because it’s a rather weak AI bill after amendments.
I usually don’t focus on the case where we try to align an AI that is already misaligned, but rather trying to get the model into a basin of alignment early via data.
Re Schneier’s Law and security mindset, I’ve become more skeptical of security mindset being useful in general, for 2 reasons:
I think that there are enough disanalogies, like the fact that you can randomly change some of the parameters in the model and you still get the same or improved performance in most cases, that notably doesn’t exist in our actual security field or even fields that have to deal with highly fragile systems.
There is good reason to believe a lot of the security exploits discovered that seem magical to not actually matter in practice, because of ridiculous pre-conditions, and the computer security field is selection biased to say that this exploit dooms us forever (even if it cannot):
These posts and comments are helpful pointers to my view:
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
True, but I do actually think there is actually real traction on the problem already, and IMO one of the cooler results is Pretraining Language Models from Human feedback, and note that even a problem is in NP can get really intractable in the worst case (though we don’t have proof of that)
So there’s a strained analogy to be made here.
For this:
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’.
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
Indeed, one of the traps of social reformers IRL is to think that just because verifying something is correct or wrong is easy, generating a new social outcome, perhaps via norms must also be easy, but it isn’t, because the verification side is much easier than the generation side.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’.
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
I’m talking about something a bit different, though: claiming in advance that A will have net-positive consequences vs verifying in advance that A will have net-positive consequences. I think that’s a very real problem; a theoretical misaligned AI can hand us a million lines of code and say, ‘Run this, it’ll generate a cure for cancer and definitely not do bad things’, and in many cases it would be difficult-to-impossible to confirm that.
We could, as Tegmark and Omohundro propose, insist that it provide us a legible and machine-checkable proof of safety before we run it, but then we’re back to counting on all players to behave responsibly. (although I can certainly imagine legislation / treaties that would help a lot there).
Some thoughts on this comment:
This is actually right, but I think this is actually addressable by making large synthetic datasets, and I also think that we can in practice define reward functions densely enough such that we can capture al of the behavior we want in practice.
I agree with this, but I will also say that the examples listed point to a strong reason why RL also wasn’t as capable as people thought, and a lot of the hacks also decreased capabilities as they decreased alignment, so any solution to that problem would help capabilities and alignment massively.
Yeah, I think the big question for my views is whether the LLM solution has low enough taxes to be practical, and my answer is at this point is probable, but not a sure thing, as it requires them to slow down in the race a little (but training runs will get longer, so there’s a countervailing force to this.)
I think there are reasons to be optimistic here, mainly due to updating against evopsych views on how humans got their capabilities and values, combined with updating against complexity and fragility of value due to LLM successes, though it will require real work to bring about.
I think that the verification-generation gap is pervasive in a lot of fields, from workers in many industries being verified by bosses to make sure their job is done right, to people who buy air conditioners being able to find efficient air-conditioning for their needs despite not verifying very hard, to researchers verifying papers that were generated, to social reformers having correct critiques of various aspects of society but not being able to generate a new societal norm, and more.
Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.
Note I’m referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person’s values.
I think the crux is whether verification is easier than generation in alignment, since if verification is just as hard as generation, this hurts plans like scalable oversight a lot.
Thanks for the thoughtful responses.
I think there are many other cases where verification and generation are both extremely difficult, including ones where verification is much harder than generation. A few examples:
The Collatz conjecture is true.
The net effect of SB-1047 will be positive [given x values].
Trump will win the upcoming election.
The 10th Busy Beaver number is <number>.
Such and such a software system is not vulnerable to hacking[1].
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’. Further, there are probably many cases where a misaligned AI may happen to know a fact about the world, one which the evaluator doesn’t know or hasn’t noticed, that means that A will have very large negative effects.
This is the classic blue-team / red-team dichotomy; the defender has to think of and prevent every attack; the attacker only has to come up with a single vulnerability. Or in cryptography, Schneier’s Law: ‘Anyone can invent a security system so clever that she or he can’t think of how to break it.’
To address your examples:
Collatz conjecture is true can’t be a statement that is harder to prove than it is to verify for the reason given by Vanessa Kosoy here, though you might be right that in practice it’s absolutely hard to verify the proof that was generated:
https://www.lesswrong.com/posts/2PDC69DDJuAx6GANa/verification-is-not-easier-than-generation-in-general#feTSDufEqXozChSbB
The same response can be given to the 4th example here.
On election outcomes, the polls on Labor Day are actually reasonably predictive of what happens on November, mostly because at this point voters hear a lot more about the prospective candidates and are starting to form opinions.
For the SB-1047 case, the one prediction I will make right now is that the law has essentially no positive or negative affect, for a lot of values, solely because it’s a rather weak AI bill after amendments.
I usually don’t focus on the case where we try to align an AI that is already misaligned, but rather trying to get the model into a basin of alignment early via data.
Re Schneier’s Law and security mindset, I’ve become more skeptical of security mindset being useful in general, for 2 reasons:
I think that there are enough disanalogies, like the fact that you can randomly change some of the parameters in the model and you still get the same or improved performance in most cases, that notably doesn’t exist in our actual security field or even fields that have to deal with highly fragile systems.
There is good reason to believe a lot of the security exploits discovered that seem magical to not actually matter in practice, because of ridiculous pre-conditions, and the computer security field is selection biased to say that this exploit dooms us forever (even if it cannot):
These posts and comments are helpful pointers to my view:
https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment
https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/#ogt6CZkMNZ6oReuTk
https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/#MFqdexvnuuRKY6Tbx
On this:
True, but I do actually think there is actually real traction on the problem already, and IMO one of the cooler results is Pretraining Language Models from Human feedback, and note that even a problem is in NP can get really intractable in the worst case (though we don’t have proof of that)
So there’s a strained analogy to be made here.
For this:
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
Indeed, one of the traps of social reformers IRL is to think that just because verifying something is correct or wrong is easy, generating a new social outcome, perhaps via norms must also be easy, but it isn’t, because the verification side is much easier than the generation side.
I’m talking about something a bit different, though: claiming in advance that A will have net-positive consequences vs verifying in advance that A will have net-positive consequences. I think that’s a very real problem; a theoretical misaligned AI can hand us a million lines of code and say, ‘Run this, it’ll generate a cure for cancer and definitely not do bad things’, and in many cases it would be difficult-to-impossible to confirm that.
We could, as Tegmark and Omohundro propose, insist that it provide us a legible and machine-checkable proof of safety before we run it, but then we’re back to counting on all players to behave responsibly. (although I can certainly imagine legislation / treaties that would help a lot there).