Like, yeah, I expect it to look great, until it explodes. Similarly I expect AI to look pretty great until it explodes. That seems like kind of a core part of the argument for difficulty for me.
If your hypothesis smears probability over a wider range of outcomes than mine, while I can more sharply predict events using my theory of how alignment works—that constitutes a Bayes-update towards my theory and away from yours. Right?
“Anything can happen before the explosion” is not a strength for a theory. It’s a vulnerability. If probability is better-concentrated by any other theories which make claims about both the present and the future of AI, then the noncommittal theory gets dropped.
Sure, yeah, though like, I don’t super understand. My model will probably make the same predictions as your model in the short term. So we both get equal Bayes points. The evidence that distinguishes our models seems further out, and in a territory where there is a decent chance that we will be dead, which sucks, but isn’t in any way contradictory with Bayes rule. I don’t think I would have put that much probability on us being dead at this point, so I don’t think that loses much of any bayes points. I agree that if we are still alive in 20-30 years, then that’s definitely bayes points, and I am happy to take that into account then, but I’ve never had timelines or models that predicted things to look that different from now (or like, where there were other world models that clearly predicted things much better).
My model will probably make the same predictions as your model in the short term.
No, I don’t think so. My model(s) I use for AGI risk is an outgrowth of the model I use for normal AI research, and so it makes tons of detailed predictions. That’s why my I have weekly fluctuations in my beliefs about alignment difficulty.
Overall question I’m interested in: What, if any, catastrophic risks are posed by advanced AI? By what mechanisms do they arise, and by what solutions can risks be addressed?
Making different predictions. The most extreme prediction of AI x-risk is that AI presents, well, an x-risk. But theories gain and lose points not just on their most extreme predictions, but on all their relevant predictions.
I have a bunch of uncertainty about how agentic/transformative systems will look, but I put at least 50% on “They’ll be some scaffolding + natural outgrowth of LLMs.” I’ll focus on that portion of my uncertainty in order to avoid meta-discussions on what to think of unknown future systems.
I don’t know what your model of AGI risk is, but I’m going to point to a cluster of adjacent models and memes which have been popular on LW and point out a bunch of predictions they make, and why I think my views tend to do far better.
Format:
Historical claim or meme relevant to models of AI ruin. [Exposition]
[Comparison of model predictions]
The historical value misspecification argument.Consider a model which involves the claim “it’s really laborious and fragile to specify complex human goals to systems, such that the systems actually do what you want.”
This model naturally predicts things like “it’s intractably hard/fragile to get GPT-4 to help people with stuff.” Sure, the model doesn’t predict this with probability 1, but it’s definitely an obvious prediction. (As an intuition pump, if we observed the above, we’d obviously update towards fragility/complexity of value; so since we don’t observe the above, we have to update away from that.)
My models involve things like “most of the system’s alignment properties will come from the training data” (and not e.g. from initialization or architecture), and also “there are a few SGD-plausible generalizations of any large dataset data” and also “to first order, overparameterized LLMs generalize how a naive person would expect after seeing the training behavior” (IE “edge instantiation isn’t a big problem.”) Also “the reward model doesn’t have to be perfect or even that good in order to elicit desired behavior from the policy.” Also noticing that DL just generalizes really well, despite classical statistical learning theory pointing out that almost all expressive models will misgeneralize!
(All of these models offer testable predictions!)
So overall, I think the second view predicts reality much more strongly than the first view.
It’s important to make large philosophical progress on an AI reasoning about its own future outputs. In Constitutional AI, an English-language “constitutional principle” (like “Be nice”) is chosen for each potential future training datapoint. The LLM then considers whether the datapoint is in line with that constitutional principle. The datapoint is later trained on if and only if the LLM concludes that the datapoint accords with the principle. The AI is, in effect, reasoning about its future training process, which will affect its future cognition.
The above “embedded agency=hard/confusing” model would naturally predict that reflection is hard and that we’d need to put in a lot of work to solve the “reflection problem.” While this setup is obviously a simple, crude form of reflection, it’s still valid. Therefore, the model predicts with increased confidence that constitutional AI would go poorly. But… Constitutional AI worked pretty well! RL from AI feedback also works well! There are a bunch of nice self-supervised alignment-boosting methods (one recent one I read about is RAIN).
One reason this matters: Under the “AGI from scaffolded super-LLMs” model, the scaffolding will probably prompt the LLM to evaluate its own plans. If we observe that current models do a good job of self-evaluation,[1] that’s strong evidence that future models will too. If strong models do a good job of moral and accurate self-evaluation, that decreases the chance that the future AI will execute immoral / bad plans.
I expect AIs to do very well here because AIs will reliably pick up a lot of nice “values” from the training corpus. Empirically that seems to happen, and theoretically you’d get some of the way there from natural abstractions + “there are a few meaningful generalizations” + “if you train the AI to do thing X when you prompt it, it will do thing X when prompted.”
Intelligence is a “package deal” / tool AI won’t work well / intelligence comes in service of goals. There isn’t a way to take AIXI and lop off the “dangerous capabilities” part of the algorithm and then have an AI which can still do clever stuff on your behalf. It’s all part of the argmax, which holds both the promise and peril of these (unrealistic) AIXI agents. Is this true for LLMs?
So I think the “intelligence is a package deal” philosophy isn’t holding up that great. (And we had an in-person conversation where I had predicted these steering vector results, and you had expected the opposite.)
The steering vectors were in fact derived using shard theory reasoning (about activating certain shards more or less strongly by adding a direction to the latent space). So this is a strong prediction of my models.
If intelligence isn’t a package deal, then tool AI becomes far more technically probable (but still maybe not commercially probable). This means we can maybe extract reasonably consequentialist reasoning with “deontological compulsions” against e.g. powerseeking, and have that make the AI agent not want to seek power.
There are certain training assumptions which are likely to be met by future systems but not present systems, by default and for all powerful systems we expect to know how to build build), the AI will develop internal goals which it pursues ~coherently across situations.[2] (This would be a knock against smart tool AI.)
Risks from Learned Optimization posited that a “simple” way to “do well in training” is to learn a unified goal and then a bunch of generalized machinery to achieve that goal. This model naturally predicts that when you train overparameterized networks on a wide range of tasks, then . Even if that network isn’t an AGI.
That’s a misprediction of the “unified motivations are simple” frame; if we have the theoretical precision to describe the simplicity biases of unknown future systems, that model should crank out good predictions for modern systems too.
I’m happy to bet on any additional experiments related to the above.
There are probably a bunch of other things, and I might come back with more, but I’m getting tired of writing this comment. The main point is that common components of threat models regularly make meaningful mispredictions. My models often do better (though once I misread some data and strongly updated against my models, so I think I’m amenable to counterevidence here). Therefore, I’m able to refine my models of AGI risk. I certainly don’t think we’re in the dark and unable to find experimental evidence.
I expect you to basically disagree about future AI being a separate magisterium or something, but I don’t know why that’d be true.
Often the claimed causes of future doom imply models which make pre-takeoff predictions, as shown above (e.g. fragility of value). But even if your model doesn’t make pre-takeoff predictions… Since my model is unified for both present and future AI, I can keep gaining Bayes points and refining my model! This happens whether or not your model makes predictions here. This is useful insofar as the observations I’m updating on actually update me on mechanisms in my model which are relevant for AGI alignment.
If you think I just listed a bunch of irrelevant stuff, well… I guess I super disagree! But I’ll keep updating anyways.
The Emulated Finetuning paper found that GPT-4 is superhuman at grading helpfulness/harmlessness. In the cases of disagreements between GPT-4 and humans, a more careful analysis revealed that 80% of the time the disagreement was caused by errors in the human judgment, rather than GPT-4’s analysis.
This model naturally predicts things like “it’s intractably hard/fragile to get GPT-4 to help people with stuff.” Sure, the model doesn’t predict this with probability 1, but it’s definitely an obvious prediction.
Another point is that I think GPT-4 straightforwardly implies that various naive supervision techniques work pretty well. Let me explain.
From the perspective of 2019, it was plausible to me that getting GPT-4-level behavioral alignment would have been pretty hard, and might have needed something like AI safety via debate or other proposals that people had at the time. The claim here is not that we would never reach GPT-4-level alignment abilities before the end, but rather that a lot of conceptual and empirical work would be needed in order to get models to:
Reliably perform tasks how I intended as opposed to what I literally asked for
Have negligible negative side effects on the world in the course of its operation
Responsibly handle unexpected ethical dilemmas in a way that is human-reasonable
Well, to the surprise of my 2019-self, it turns out that naive RLHF with a cautious supervisor designing the reward model seems basically sufficient to do all of these things in a reasonably adequate way. That doesn’t mean that RLHF scales all the way to superintelligence, but it’s very significant nonetheless and interesting that it scales as far as it does.
You might think “why does this matter? We know RLHF will break down at some point” but I think that’s missing the point. Suppose right now, you learned that RLHF scales reasonably well all the way to John von Neumann-level AI. Or, even more boldly, say, you learned it scaled to 20 IQ points past John von Neumann. 100 points? Are you saying you wouldn’t update even a little bit on that knowledge?
The point at which RLHF breaks down is enormously important to overall alignment difficulty. If it breaks down at some point before the human range, that would be terrible IMO. If it breaks down at some point past the human range, that would be great. To see why, consider that if RLHF breaks down at some point past the human range, that implies that we could build aligned human-level AIs, who could then help us align slighter smarter AIs!
If you’re not updating at all on observations about when RLHF breaks down, then you probably either (1) think it doesn’t matter when RLHF breaks down, or (2) you already knew in advance exactly when it would break down. I think position 1 is just straight-up unreasonable, and I’m highly skeptical of most people who claim position 2. This basic perspective is a large part of why I’m making such a fuss about how people should update on current observations.
On the other hand, it could be considered bad news that IDA/Debate/etc. haven’t been deployed yet, or even that RLHF is (at least apparently) working as well as it is. To quote a 2017 post by Paul Christiano (later reposted in 2018 and 2019):
As in the previous sections, it’s easy to be too optimistic about exactly when a non-scalable alignment scheme will break down. It’s much easier to keep ourselves honest if we actually hold ourselves to producing scalable systems.
It seems that AI labs are not yet actually holding themselves to producing scalable systems, and it may well be better if RLHF broke down in some obvious way before we reach potentially dangerous capabilities, to force them to do that.
(I’ve pointed Paul to this thread to get his own take, but haven’t gotten a response yet.)
ETA: I should also note that there is a lot of debate about whether IDA and Debate are actually scalable or not, so some could consider even deployment of IDA or Debate (or these techniques appearing to work well) to be bad news. I’ve tended to argue on the “they are too risky” side in the past, but am conflicted because maybe they are just the best that we can realistically hope for and at least an improvement over RLHF?
I think these methods are pretty clearly not indefinitely scalable, but they might be pretty scalable. E.g., perhaps scalable to somewhat smarter than human level AI. See the ELK report for more discussion on why these methods aren’t indefinitely scalable.
A while ago, I think Paul had maybe 50% that with simple-ish tweaks IDA could be literally indefinitely scalable. (I’m not aware of an online source for this, but I’m pretty confident this or something similar is true.) IMO, this seems very predictably wrong.
TBC, I don’t think we should necessarily care very much about whether a method is indefinitely scalable.
Sometimes people do seem to think that debate or IDA could be indefinitely scalable, but this just seems pretty wrong to me (what is your debate about alphafold going to look like...).
I agree that if RLHF scaled all the way to von neumann then we’d probably be fine. I agree that the point at which RLHF breaks down is enormously important to overall alignment difficulty.
I think if you had described to me in 2019 how GPT4 was trained, I would have correctly predicted its current qualitative behavior. I would not have said that it would do 1, 2, or 3 to a greater extent than it currently does.
I’m in neither category (1) or (2); it’s a false dichotomy.
I’m in neither category (1) or (2); it’s a false dichotomy.
The categories were conditioned on whether you’re “not updating at all on observations about when RLHF breaks down”. Assuming you are updating, then I think you’re not really the the type of person who I’m responding to in my original comment.
But if you’re not updating, or aren’t updating significantly, then perhaps you can predict now when you expect RLHF to “break down”? Is there some specific prediction that you would feel comfortable making at this time, such that we could look back on this conversation in 2-10 years and say “huh, he really knew broadly what would happen in the future, specifically re: when alignment would start getting hard”?
(The caveat here is that I’d be kind of disappointed by an answer like “RLHF will break down at superintelligence” since, well, yeah, duh. And that would not be very specific.)
I’m not updating significantly because things have gone basically exactly as I expected.
As for when RLHF will break down, two points:
(1) I’m not sure, but I expect it to happen for highly situationally aware, highly agentic opaque systems. Our current systems like GPT4 are opaque but not very agentic and their level of situational awareness is probably medium. (Also: This is not a special me-take. This is basically the standard take, no? I feel like this is what Risks from Learned Optimization predicts too.)
(2) When it breaks down I do not expect it to look like the failures you described—e.g. it stupidly carries out your requests to the letter and ignores their spirit, and thus makes a fool of itself and is generally thought to be a bad chatbot. Why would it fail in that way? That would be stupid. It’s not stupid.
(Related question: I’m pretty sure on r/chatgpt you can find examples of all three failures. They just don’t happen often enough, and visibly enough, to be a serious problem. Is this also your understanding? When you say these kinds of failures don’t happen, you mean they don’t happen frequently enough to make ChatGPT a bad chatbot?)
Re: Elaborating: Sure, happy to, but not sure where to begin. All of this has been explained before e.g. in Ajeya’s Training Game report for example. Also Joe Carlsmith’s thing. Also the original mesaoptimizers paper, though I guess it didn’t talk about situational awareness idk. Would you like me to say more about what situational awareness is, or what agency is, or why I think both of those together are big risk factors for RLHF breaking down?
From a technical perspective I’m not certain if Direct Preference Optimization is theoretically that much different from RLHF beyond being much quicker and lower friction at what it does, but so far it seems like it has some notable performance gains over RLHF in ways that might indicate a qualitative difference in effectiveness. Running a local model with a bit of light DPO training feels more intent-aligned compared to its non-DPO brethren in a pretty meaningful way. So I’d probably be considering also how DPO scales, at this point. If there is a big theoretical difference, it’s likely in not training a separate model, and removing whatever friction or loss of potential performance that causes.
I’ve been struggling with whether to upvote or downvote this comment btw. I think the point about how it’s really important when RLHF breaks down and more attention needs to be paid to this is great. But the other point about how RLHF hasn’t broke yet and this is evidence against the standard misalignment stories is very wrong IMO. For now I’ll neither upvote nor downvote.
In fact, it had redundant internal representations of the goal square! Due to how CNNs work, that should be literally meaningless!
What does this mean? I don’t know as much about CNNs as you—are you saying that their architecture allows for the reuse of internal representations, such that redundancy should never arise? Or are you saying that the goal square shouldn’t be representable by this architecture?
If your hypothesis smears probability over a wider range of outcomes than mine, while I can more sharply predict events using my theory of how alignment works—that constitutes a Bayes-update towards my theory and away from yours. Right?
He didn’t say “anything can happen before AI explodes”. He said “I expect AI to look pretty great until it explodes.” And he didn’t say that his model about AGI safety generated that prediction; maybe his model about AGI safety generates some long-run predictions and then he’s using other models to make the “look pretty great” prediction.
“Anything can happen before the explosion” is not a strength for a theory.
This is why I hate a lot of mathematical universe hypothesis/simulation hypothesis discourse, since they both predict anything, which is not a strength for these theories, even though I do think they’re true, they’re just too trivial as theories to work.
If your hypothesis smears probability over a wider range of outcomes than mine, while I can more sharply predict events using my theory of how alignment works—that constitutes a Bayes-update towards my theory and away from yours.
There is a reference class judgement in this. If I have a theory of good moves in Go (and absently dabble in chess a little bit), while you have a great theory of chess, looking at some move in chess shouldn’t lead to a Bayes-update against ability of my theory to reason about Go. The scope of classical alignment worries is typically about the post-AGI situation. If it manages to say something uninformed about the pre-AGI situation, that’s something out of its natural scope, and shouldn’t be meaningful evidence either way.
I think the correct way of defeating classical alignment worries (about the post-AGI situation) is on priors, looking at the arguments themselves, not on observations where the theory doesn’t expect to have clear or good predictions (and empirically doesn’t). If the arguments appear weak, there is no recourse without observation of the post-AGI world, it remains weak at least until then. Even if it happened to have made good predictions about the current situation, it shouldn’t count in its favor.
If your hypothesis smears probability over a wider range of outcomes than mine, while I can more sharply predict events using my theory of how alignment works—that constitutes a Bayes-update towards my theory and away from yours. Right?
“Anything can happen before the explosion” is not a strength for a theory. It’s a vulnerability. If probability is better-concentrated by any other theories which make claims about both the present and the future of AI, then the noncommittal theory gets dropped.
Sure, yeah, though like, I don’t super understand. My model will probably make the same predictions as your model in the short term. So we both get equal Bayes points. The evidence that distinguishes our models seems further out, and in a territory where there is a decent chance that we will be dead, which sucks, but isn’t in any way contradictory with Bayes rule. I don’t think I would have put that much probability on us being dead at this point, so I don’t think that loses much of any bayes points. I agree that if we are still alive in 20-30 years, then that’s definitely bayes points, and I am happy to take that into account then, but I’ve never had timelines or models that predicted things to look that different from now (or like, where there were other world models that clearly predicted things much better).
No, I don’t think so. My model(s) I use for AGI risk is an outgrowth of the model I use for normal AI research, and so it makes tons of detailed predictions. That’s why my I have weekly fluctuations in my beliefs about alignment difficulty.
Overall question I’m interested in: What, if any, catastrophic risks are posed by advanced AI? By what mechanisms do they arise, and by what solutions can risks be addressed?
Making different predictions. The most extreme prediction of AI x-risk is that AI presents, well, an x-risk. But theories gain and lose points not just on their most extreme predictions, but on all their relevant predictions.
I have a bunch of uncertainty about how agentic/transformative systems will look, but I put at least 50% on “They’ll be some scaffolding + natural outgrowth of LLMs.” I’ll focus on that portion of my uncertainty in order to avoid meta-discussions on what to think of unknown future systems.
I don’t know what your model of AGI risk is, but I’m going to point to a cluster of adjacent models and memes which have been popular on LW and point out a bunch of predictions they make, and why I think my views tend to do far better.
Format:
Historical claim or meme relevant to models of AI ruin. [Exposition]
[Comparison of model predictions]
The historical value misspecification argument. Consider a model which involves the claim “it’s really laborious and fragile to specify complex human goals to systems, such that the systems actually do what you want.”
This model naturally predicts things like “it’s intractably hard/fragile to get GPT-4 to help people with stuff.” Sure, the model doesn’t predict this with probability 1, but it’s definitely an obvious prediction. (As an intuition pump, if we observed the above, we’d obviously update towards fragility/complexity of value; so since we don’t observe the above, we have to update away from that.)
My models involve things like “most of the system’s alignment properties will come from the training data” (and not e.g. from initialization or architecture), and also “there are a few SGD-plausible generalizations of any large dataset data” and also “to first order, overparameterized LLMs generalize how a naive person would expect after seeing the training behavior” (IE “edge instantiation isn’t a big problem.”) Also “the reward model doesn’t have to be perfect or even that good in order to elicit desired behavior from the policy.” Also noticing that DL just generalizes really well, despite classical statistical learning theory pointing out that almost all expressive models will misgeneralize!
(All of these models offer testable predictions!)
So overall, I think the second view predicts reality much more strongly than the first view.
It’s important to make large philosophical progress on an AI reasoning about its own future outputs. In Constitutional AI, an English-language “constitutional principle” (like “Be nice”) is chosen for each potential future training datapoint. The LLM then considers whether the datapoint is in line with that constitutional principle. The datapoint is later trained on if and only if the LLM concludes that the datapoint accords with the principle. The AI is, in effect, reasoning about its future training process, which will affect its future cognition.
The above “embedded agency=hard/confusing” model would naturally predict that reflection is hard and that we’d need to put in a lot of work to solve the “reflection problem.” While this setup is obviously a simple, crude form of reflection, it’s still valid. Therefore, the model predicts with increased confidence that constitutional AI would go poorly. But… Constitutional AI worked pretty well! RL from AI feedback also works well! There are a bunch of nice self-supervised alignment-boosting methods (one recent one I read about is RAIN).
One reason this matters: Under the “AGI from scaffolded super-LLMs” model, the scaffolding will probably prompt the LLM to evaluate its own plans. If we observe that current models do a good job of self-evaluation,[1] that’s strong evidence that future models will too. If strong models do a good job of moral and accurate self-evaluation, that decreases the chance that the future AI will execute immoral / bad plans.
I expect AIs to do very well here because AIs will reliably pick up a lot of nice “values” from the training corpus. Empirically that seems to happen, and theoretically you’d get some of the way there from natural abstractions + “there are a few meaningful generalizations” + “if you train the AI to do thing X when you prompt it, it will do thing X when prompted.”
Intelligence is a “package deal” / tool AI won’t work well / intelligence comes in service of goals. There isn’t a way to take AIXI and lop off the “dangerous capabilities” part of the algorithm and then have an AI which can still do clever stuff on your behalf. It’s all part of the argmax, which holds both the promise and peril of these (unrealistic) AIXI agents. Is this true for LLMs?
But what if you just subtract a “sycophancy vector” and add a “truth vector” and maybe subtract a power-seeking vector? According to current empirical results, these modularly control those properties, with minimal apparent reduction in capabilities!
So I think the “intelligence is a package deal” philosophy isn’t holding up that great. (And we had an in-person conversation where I had predicted these steering vector results, and you had expected the opposite.)
The steering vectors were in fact derived using shard theory reasoning (about activating certain shards more or less strongly by adding a direction to the latent space). So this is a strong prediction of my models.
If intelligence isn’t a package deal, then tool AI becomes far more technically probable (but still maybe not commercially probable). This means we can maybe extract reasonably consequentialist reasoning with “deontological compulsions” against e.g. powerseeking, and have that make the AI agent not want to seek power.
There are certain training assumptions which are likely to be met by future systems but not present systems, by default and for all powerful systems we expect to know how to build build), the AI will develop internal goals which it pursues ~coherently across situations.[2] (This would be a knock against smart tool AI.)
Risks from Learned Optimization posited that a “simple” way to “do well in training” is to learn a unified goal and then a bunch of generalized machinery to achieve that goal. This model naturally predicts that when you train overparameterized networks on a wide range of tasks, then . Even if that network isn’t an AGI.
My MATS 3.0 team and I partially interpreted an overparameterized maze-solving network which was trained to convergence on a wide range of mazes. However, we didn’t find any “simple, unified” goal representation. In fact, it had redundant internal representations of the goal square! Due to how CNNs work, that should be literally meaningless!
That’s a misprediction of the “unified motivations are simple” frame; if we have the theoretical precision to describe the simplicity biases of unknown future systems, that model should crank out good predictions for modern systems too.
I’m happy to bet on any additional experiments related to the above.
There are probably a bunch of other things, and I might come back with more, but I’m getting tired of writing this comment. The main point is that common components of threat models regularly make meaningful mispredictions. My models often do better (though once I misread some data and strongly updated against my models, so I think I’m amenable to counterevidence here). Therefore, I’m able to refine my models of AGI risk. I certainly don’t think we’re in the dark and unable to find experimental evidence.
I expect you to basically disagree about future AI being a separate magisterium or something, but I don’t know why that’d be true.
Often the claimed causes of future doom imply models which make pre-takeoff predictions, as shown above (e.g. fragility of value). But even if your model doesn’t make pre-takeoff predictions… Since my model is unified for both present and future AI, I can keep gaining Bayes points and refining my model! This happens whether or not your model makes predictions here. This is useful insofar as the observations I’m updating on actually update me on mechanisms in my model which are relevant for AGI alignment.
If you think I just listed a bunch of irrelevant stuff, well… I guess I super disagree! But I’ll keep updating anyways.
The Emulated Finetuning paper found that GPT-4 is superhuman at grading helpfulness/harmlessness. In the cases of disagreements between GPT-4 and humans, a more careful analysis revealed that 80% of the time the disagreement was caused by errors in the human judgment, rather than GPT-4’s analysis.
I recently explained more of my skepticism of the coherent-inner-goal claim.
Another point is that I think GPT-4 straightforwardly implies that various naive supervision techniques work pretty well. Let me explain.
From the perspective of 2019, it was plausible to me that getting GPT-4-level behavioral alignment would have been pretty hard, and might have needed something like AI safety via debate or other proposals that people had at the time. The claim here is not that we would never reach GPT-4-level alignment abilities before the end, but rather that a lot of conceptual and empirical work would be needed in order to get models to:
Reliably perform tasks how I intended as opposed to what I literally asked for
Have negligible negative side effects on the world in the course of its operation
Responsibly handle unexpected ethical dilemmas in a way that is human-reasonable
Well, to the surprise of my 2019-self, it turns out that naive RLHF with a cautious supervisor designing the reward model seems basically sufficient to do all of these things in a reasonably adequate way. That doesn’t mean that RLHF scales all the way to superintelligence, but it’s very significant nonetheless and interesting that it scales as far as it does.
You might think “why does this matter? We know RLHF will break down at some point” but I think that’s missing the point. Suppose right now, you learned that RLHF scales reasonably well all the way to John von Neumann-level AI. Or, even more boldly, say, you learned it scaled to 20 IQ points past John von Neumann. 100 points? Are you saying you wouldn’t update even a little bit on that knowledge?
The point at which RLHF breaks down is enormously important to overall alignment difficulty. If it breaks down at some point before the human range, that would be terrible IMO. If it breaks down at some point past the human range, that would be great. To see why, consider that if RLHF breaks down at some point past the human range, that implies that we could build aligned human-level AIs, who could then help us align slighter smarter AIs!
If you’re not updating at all on observations about when RLHF breaks down, then you probably either (1) think it doesn’t matter when RLHF breaks down, or (2) you already knew in advance exactly when it would break down. I think position 1 is just straight-up unreasonable, and I’m highly skeptical of most people who claim position 2. This basic perspective is a large part of why I’m making such a fuss about how people should update on current observations.
What did you think would happen, exactly? I’m curious to learn what your 2019-self was thinking would happen, that didn’t happen.
On the other hand, it could be considered bad news that IDA/Debate/etc. haven’t been deployed yet, or even that RLHF is (at least apparently) working as well as it is. To quote a 2017 post by Paul Christiano (later reposted in 2018 and 2019):
It seems that AI labs are not yet actually holding themselves to producing scalable systems, and it may well be better if RLHF broke down in some obvious way before we reach potentially dangerous capabilities, to force them to do that.
(I’ve pointed Paul to this thread to get his own take, but haven’t gotten a response yet.)
ETA: I should also note that there is a lot of debate about whether IDA and Debate are actually scalable or not, so some could consider even deployment of IDA or Debate (or these techniques appearing to work well) to be bad news. I’ve tended to argue on the “they are too risky” side in the past, but am conflicted because maybe they are just the best that we can realistically hope for and at least an improvement over RLHF?
I think these methods are pretty clearly not indefinitely scalable, but they might be pretty scalable. E.g., perhaps scalable to somewhat smarter than human level AI. See the ELK report for more discussion on why these methods aren’t indefinitely scalable.
A while ago, I think Paul had maybe 50% that with simple-ish tweaks IDA could be literally indefinitely scalable. (I’m not aware of an online source for this, but I’m pretty confident this or something similar is true.) IMO, this seems very predictably wrong.
TBC, I don’t think we should necessarily care very much about whether a method is indefinitely scalable.
Sometimes people do seem to think that debate or IDA could be indefinitely scalable, but this just seems pretty wrong to me (what is your debate about alphafold going to look like...).
I think the first presentation of the argument that IDA/Debate aren’t indefinitely scalable was in Inaccessible Information, fwiw.
I agree that if RLHF scaled all the way to von neumann then we’d probably be fine. I agree that the point at which RLHF breaks down is enormously important to overall alignment difficulty.
I think if you had described to me in 2019 how GPT4 was trained, I would have correctly predicted its current qualitative behavior. I would not have said that it would do 1, 2, or 3 to a greater extent than it currently does.
I’m in neither category (1) or (2); it’s a false dichotomy.
The categories were conditioned on whether you’re “not updating at all on observations about when RLHF breaks down”. Assuming you are updating, then I think you’re not really the the type of person who I’m responding to in my original comment.
But if you’re not updating, or aren’t updating significantly, then perhaps you can predict now when you expect RLHF to “break down”? Is there some specific prediction that you would feel comfortable making at this time, such that we could look back on this conversation in 2-10 years and say “huh, he really knew broadly what would happen in the future, specifically re: when alignment would start getting hard”?
(The caveat here is that I’d be kind of disappointed by an answer like “RLHF will break down at superintelligence” since, well, yeah, duh. And that would not be very specific.)
I’m not updating significantly because things have gone basically exactly as I expected.
As for when RLHF will break down, two points:
(1) I’m not sure, but I expect it to happen for highly situationally aware, highly agentic opaque systems. Our current systems like GPT4 are opaque but not very agentic and their level of situational awareness is probably medium. (Also: This is not a special me-take. This is basically the standard take, no? I feel like this is what Risks from Learned Optimization predicts too.)
(2) When it breaks down I do not expect it to look like the failures you described—e.g. it stupidly carries out your requests to the letter and ignores their spirit, and thus makes a fool of itself and is generally thought to be a bad chatbot. Why would it fail in that way? That would be stupid. It’s not stupid.
(Related question: I’m pretty sure on r/chatgpt you can find examples of all three failures. They just don’t happen often enough, and visibly enough, to be a serious problem. Is this also your understanding? When you say these kinds of failures don’t happen, you mean they don’t happen frequently enough to make ChatGPT a bad chatbot?)
Re: Missing the point: How?
Re: Elaborating: Sure, happy to, but not sure where to begin. All of this has been explained before e.g. in Ajeya’s Training Game report for example. Also Joe Carlsmith’s thing. Also the original mesaoptimizers paper, though I guess it didn’t talk about situational awareness idk. Would you like me to say more about what situational awareness is, or what agency is, or why I think both of those together are big risk factors for RLHF breaking down?
From a technical perspective I’m not certain if Direct Preference Optimization is theoretically that much different from RLHF beyond being much quicker and lower friction at what it does, but so far it seems like it has some notable performance gains over RLHF in ways that might indicate a qualitative difference in effectiveness. Running a local model with a bit of light DPO training feels more intent-aligned compared to its non-DPO brethren in a pretty meaningful way. So I’d probably be considering also how DPO scales, at this point. If there is a big theoretical difference, it’s likely in not training a separate model, and removing whatever friction or loss of potential performance that causes.
I’ve been struggling with whether to upvote or downvote this comment btw. I think the point about how it’s really important when RLHF breaks down and more attention needs to be paid to this is great. But the other point about how RLHF hasn’t broke yet and this is evidence against the standard misalignment stories is very wrong IMO. For now I’ll neither upvote nor downvote.
What does this mean? I don’t know as much about CNNs as you—are you saying that their architecture allows for the reuse of internal representations, such that redundancy should never arise? Or are you saying that the goal square shouldn’t be representable by this architecture?
He didn’t say “anything can happen before AI explodes”. He said “I expect AI to look pretty great until it explodes.” And he didn’t say that his model about AGI safety generated that prediction; maybe his model about AGI safety generates some long-run predictions and then he’s using other models to make the “look pretty great” prediction.
Thinking about this:
This is why I hate a lot of mathematical universe hypothesis/simulation hypothesis discourse, since they both predict anything, which is not a strength for these theories, even though I do think they’re true, they’re just too trivial as theories to work.
There is a reference class judgement in this. If I have a theory of good moves in Go (and absently dabble in chess a little bit), while you have a great theory of chess, looking at some move in chess shouldn’t lead to a Bayes-update against ability of my theory to reason about Go. The scope of classical alignment worries is typically about the post-AGI situation. If it manages to say something uninformed about the pre-AGI situation, that’s something out of its natural scope, and shouldn’t be meaningful evidence either way.
I think the correct way of defeating classical alignment worries (about the post-AGI situation) is on priors, looking at the arguments themselves, not on observations where the theory doesn’t expect to have clear or good predictions (and empirically doesn’t). If the arguments appear weak, there is no recourse without observation of the post-AGI world, it remains weak at least until then. Even if it happened to have made good predictions about the current situation, it shouldn’t count in its favor.