This model naturally predicts things like “it’s intractably hard/fragile to get GPT-4 to help people with stuff.” Sure, the model doesn’t predict this with probability 1, but it’s definitely an obvious prediction.
Another point is that I think GPT-4 straightforwardly implies that various naive supervision techniques work pretty well. Let me explain.
From the perspective of 2019, it was plausible to me that getting GPT-4-level behavioral alignment would have been pretty hard, and might have needed something like AI safety via debate or other proposals that people had at the time. The claim here is not that we would never reach GPT-4-level alignment abilities before the end, but rather that a lot of conceptual and empirical work would be needed in order to get models to:
Reliably perform tasks how I intended as opposed to what I literally asked for
Have negligible negative side effects on the world in the course of its operation
Responsibly handle unexpected ethical dilemmas in a way that is human-reasonable
Well, to the surprise of my 2019-self, it turns out that naive RLHF with a cautious supervisor designing the reward model seems basically sufficient to do all of these things in a reasonably adequate way. That doesn’t mean that RLHF scales all the way to superintelligence, but it’s very significant nonetheless and interesting that it scales as far as it does.
You might think “why does this matter? We know RLHF will break down at some point” but I think that’s missing the point. Suppose right now, you learned that RLHF scales reasonably well all the way to John von Neumann-level AI. Or, even more boldly, say, you learned it scaled to 20 IQ points past John von Neumann. 100 points? Are you saying you wouldn’t update even a little bit on that knowledge?
The point at which RLHF breaks down is enormously important to overall alignment difficulty. If it breaks down at some point before the human range, that would be terrible IMO. If it breaks down at some point past the human range, that would be great. To see why, consider that if RLHF breaks down at some point past the human range, that implies that we could build aligned human-level AIs, who could then help us align slighter smarter AIs!
If you’re not updating at all on observations about when RLHF breaks down, then you probably either (1) think it doesn’t matter when RLHF breaks down, or (2) you already knew in advance exactly when it would break down. I think position 1 is just straight-up unreasonable, and I’m highly skeptical of most people who claim position 2. This basic perspective is a large part of why I’m making such a fuss about how people should update on current observations.
On the other hand, it could be considered bad news that IDA/Debate/etc. haven’t been deployed yet, or even that RLHF is (at least apparently) working as well as it is. To quote a 2017 post by Paul Christiano (later reposted in 2018 and 2019):
As in the previous sections, it’s easy to be too optimistic about exactly when a non-scalable alignment scheme will break down. It’s much easier to keep ourselves honest if we actually hold ourselves to producing scalable systems.
It seems that AI labs are not yet actually holding themselves to producing scalable systems, and it may well be better if RLHF broke down in some obvious way before we reach potentially dangerous capabilities, to force them to do that.
(I’ve pointed Paul to this thread to get his own take, but haven’t gotten a response yet.)
ETA: I should also note that there is a lot of debate about whether IDA and Debate are actually scalable or not, so some could consider even deployment of IDA or Debate (or these techniques appearing to work well) to be bad news. I’ve tended to argue on the “they are too risky” side in the past, but am conflicted because maybe they are just the best that we can realistically hope for and at least an improvement over RLHF?
I think these methods are pretty clearly not indefinitely scalable, but they might be pretty scalable. E.g., perhaps scalable to somewhat smarter than human level AI. See the ELK report for more discussion on why these methods aren’t indefinitely scalable.
A while ago, I think Paul had maybe 50% that with simple-ish tweaks IDA could be literally indefinitely scalable. (I’m not aware of an online source for this, but I’m pretty confident this or something similar is true.) IMO, this seems very predictably wrong.
TBC, I don’t think we should necessarily care very much about whether a method is indefinitely scalable.
Sometimes people do seem to think that debate or IDA could be indefinitely scalable, but this just seems pretty wrong to me (what is your debate about alphafold going to look like...).
I agree that if RLHF scaled all the way to von neumann then we’d probably be fine. I agree that the point at which RLHF breaks down is enormously important to overall alignment difficulty.
I think if you had described to me in 2019 how GPT4 was trained, I would have correctly predicted its current qualitative behavior. I would not have said that it would do 1, 2, or 3 to a greater extent than it currently does.
I’m in neither category (1) or (2); it’s a false dichotomy.
I’m in neither category (1) or (2); it’s a false dichotomy.
The categories were conditioned on whether you’re “not updating at all on observations about when RLHF breaks down”. Assuming you are updating, then I think you’re not really the the type of person who I’m responding to in my original comment.
But if you’re not updating, or aren’t updating significantly, then perhaps you can predict now when you expect RLHF to “break down”? Is there some specific prediction that you would feel comfortable making at this time, such that we could look back on this conversation in 2-10 years and say “huh, he really knew broadly what would happen in the future, specifically re: when alignment would start getting hard”?
(The caveat here is that I’d be kind of disappointed by an answer like “RLHF will break down at superintelligence” since, well, yeah, duh. And that would not be very specific.)
I’m not updating significantly because things have gone basically exactly as I expected.
As for when RLHF will break down, two points:
(1) I’m not sure, but I expect it to happen for highly situationally aware, highly agentic opaque systems. Our current systems like GPT4 are opaque but not very agentic and their level of situational awareness is probably medium. (Also: This is not a special me-take. This is basically the standard take, no? I feel like this is what Risks from Learned Optimization predicts too.)
(2) When it breaks down I do not expect it to look like the failures you described—e.g. it stupidly carries out your requests to the letter and ignores their spirit, and thus makes a fool of itself and is generally thought to be a bad chatbot. Why would it fail in that way? That would be stupid. It’s not stupid.
(Related question: I’m pretty sure on r/chatgpt you can find examples of all three failures. They just don’t happen often enough, and visibly enough, to be a serious problem. Is this also your understanding? When you say these kinds of failures don’t happen, you mean they don’t happen frequently enough to make ChatGPT a bad chatbot?)
Re: Elaborating: Sure, happy to, but not sure where to begin. All of this has been explained before e.g. in Ajeya’s Training Game report for example. Also Joe Carlsmith’s thing. Also the original mesaoptimizers paper, though I guess it didn’t talk about situational awareness idk. Would you like me to say more about what situational awareness is, or what agency is, or why I think both of those together are big risk factors for RLHF breaking down?
From a technical perspective I’m not certain if Direct Preference Optimization is theoretically that much different from RLHF beyond being much quicker and lower friction at what it does, but so far it seems like it has some notable performance gains over RLHF in ways that might indicate a qualitative difference in effectiveness. Running a local model with a bit of light DPO training feels more intent-aligned compared to its non-DPO brethren in a pretty meaningful way. So I’d probably be considering also how DPO scales, at this point. If there is a big theoretical difference, it’s likely in not training a separate model, and removing whatever friction or loss of potential performance that causes.
I’ve been struggling with whether to upvote or downvote this comment btw. I think the point about how it’s really important when RLHF breaks down and more attention needs to be paid to this is great. But the other point about how RLHF hasn’t broke yet and this is evidence against the standard misalignment stories is very wrong IMO. For now I’ll neither upvote nor downvote.
Another point is that I think GPT-4 straightforwardly implies that various naive supervision techniques work pretty well. Let me explain.
From the perspective of 2019, it was plausible to me that getting GPT-4-level behavioral alignment would have been pretty hard, and might have needed something like AI safety via debate or other proposals that people had at the time. The claim here is not that we would never reach GPT-4-level alignment abilities before the end, but rather that a lot of conceptual and empirical work would be needed in order to get models to:
Reliably perform tasks how I intended as opposed to what I literally asked for
Have negligible negative side effects on the world in the course of its operation
Responsibly handle unexpected ethical dilemmas in a way that is human-reasonable
Well, to the surprise of my 2019-self, it turns out that naive RLHF with a cautious supervisor designing the reward model seems basically sufficient to do all of these things in a reasonably adequate way. That doesn’t mean that RLHF scales all the way to superintelligence, but it’s very significant nonetheless and interesting that it scales as far as it does.
You might think “why does this matter? We know RLHF will break down at some point” but I think that’s missing the point. Suppose right now, you learned that RLHF scales reasonably well all the way to John von Neumann-level AI. Or, even more boldly, say, you learned it scaled to 20 IQ points past John von Neumann. 100 points? Are you saying you wouldn’t update even a little bit on that knowledge?
The point at which RLHF breaks down is enormously important to overall alignment difficulty. If it breaks down at some point before the human range, that would be terrible IMO. If it breaks down at some point past the human range, that would be great. To see why, consider that if RLHF breaks down at some point past the human range, that implies that we could build aligned human-level AIs, who could then help us align slighter smarter AIs!
If you’re not updating at all on observations about when RLHF breaks down, then you probably either (1) think it doesn’t matter when RLHF breaks down, or (2) you already knew in advance exactly when it would break down. I think position 1 is just straight-up unreasonable, and I’m highly skeptical of most people who claim position 2. This basic perspective is a large part of why I’m making such a fuss about how people should update on current observations.
What did you think would happen, exactly? I’m curious to learn what your 2019-self was thinking would happen, that didn’t happen.
On the other hand, it could be considered bad news that IDA/Debate/etc. haven’t been deployed yet, or even that RLHF is (at least apparently) working as well as it is. To quote a 2017 post by Paul Christiano (later reposted in 2018 and 2019):
It seems that AI labs are not yet actually holding themselves to producing scalable systems, and it may well be better if RLHF broke down in some obvious way before we reach potentially dangerous capabilities, to force them to do that.
(I’ve pointed Paul to this thread to get his own take, but haven’t gotten a response yet.)
ETA: I should also note that there is a lot of debate about whether IDA and Debate are actually scalable or not, so some could consider even deployment of IDA or Debate (or these techniques appearing to work well) to be bad news. I’ve tended to argue on the “they are too risky” side in the past, but am conflicted because maybe they are just the best that we can realistically hope for and at least an improvement over RLHF?
I think these methods are pretty clearly not indefinitely scalable, but they might be pretty scalable. E.g., perhaps scalable to somewhat smarter than human level AI. See the ELK report for more discussion on why these methods aren’t indefinitely scalable.
A while ago, I think Paul had maybe 50% that with simple-ish tweaks IDA could be literally indefinitely scalable. (I’m not aware of an online source for this, but I’m pretty confident this or something similar is true.) IMO, this seems very predictably wrong.
TBC, I don’t think we should necessarily care very much about whether a method is indefinitely scalable.
Sometimes people do seem to think that debate or IDA could be indefinitely scalable, but this just seems pretty wrong to me (what is your debate about alphafold going to look like...).
I think the first presentation of the argument that IDA/Debate aren’t indefinitely scalable was in Inaccessible Information, fwiw.
I agree that if RLHF scaled all the way to von neumann then we’d probably be fine. I agree that the point at which RLHF breaks down is enormously important to overall alignment difficulty.
I think if you had described to me in 2019 how GPT4 was trained, I would have correctly predicted its current qualitative behavior. I would not have said that it would do 1, 2, or 3 to a greater extent than it currently does.
I’m in neither category (1) or (2); it’s a false dichotomy.
The categories were conditioned on whether you’re “not updating at all on observations about when RLHF breaks down”. Assuming you are updating, then I think you’re not really the the type of person who I’m responding to in my original comment.
But if you’re not updating, or aren’t updating significantly, then perhaps you can predict now when you expect RLHF to “break down”? Is there some specific prediction that you would feel comfortable making at this time, such that we could look back on this conversation in 2-10 years and say “huh, he really knew broadly what would happen in the future, specifically re: when alignment would start getting hard”?
(The caveat here is that I’d be kind of disappointed by an answer like “RLHF will break down at superintelligence” since, well, yeah, duh. And that would not be very specific.)
I’m not updating significantly because things have gone basically exactly as I expected.
As for when RLHF will break down, two points:
(1) I’m not sure, but I expect it to happen for highly situationally aware, highly agentic opaque systems. Our current systems like GPT4 are opaque but not very agentic and their level of situational awareness is probably medium. (Also: This is not a special me-take. This is basically the standard take, no? I feel like this is what Risks from Learned Optimization predicts too.)
(2) When it breaks down I do not expect it to look like the failures you described—e.g. it stupidly carries out your requests to the letter and ignores their spirit, and thus makes a fool of itself and is generally thought to be a bad chatbot. Why would it fail in that way? That would be stupid. It’s not stupid.
(Related question: I’m pretty sure on r/chatgpt you can find examples of all three failures. They just don’t happen often enough, and visibly enough, to be a serious problem. Is this also your understanding? When you say these kinds of failures don’t happen, you mean they don’t happen frequently enough to make ChatGPT a bad chatbot?)
Re: Missing the point: How?
Re: Elaborating: Sure, happy to, but not sure where to begin. All of this has been explained before e.g. in Ajeya’s Training Game report for example. Also Joe Carlsmith’s thing. Also the original mesaoptimizers paper, though I guess it didn’t talk about situational awareness idk. Would you like me to say more about what situational awareness is, or what agency is, or why I think both of those together are big risk factors for RLHF breaking down?
From a technical perspective I’m not certain if Direct Preference Optimization is theoretically that much different from RLHF beyond being much quicker and lower friction at what it does, but so far it seems like it has some notable performance gains over RLHF in ways that might indicate a qualitative difference in effectiveness. Running a local model with a bit of light DPO training feels more intent-aligned compared to its non-DPO brethren in a pretty meaningful way. So I’d probably be considering also how DPO scales, at this point. If there is a big theoretical difference, it’s likely in not training a separate model, and removing whatever friction or loss of potential performance that causes.
I’ve been struggling with whether to upvote or downvote this comment btw. I think the point about how it’s really important when RLHF breaks down and more attention needs to be paid to this is great. But the other point about how RLHF hasn’t broke yet and this is evidence against the standard misalignment stories is very wrong IMO. For now I’ll neither upvote nor downvote.