Opinion: disagreements about OpenAI’s strategy are substantially empirical.
I think that some of the main reasons why people in the alignment community might disagree with OpenAI’s strategy are largely disagreements about empirical facts. In particular, compared to people in the alignment community, OpenAI leadership tend to put more likelihood on slow takeoff, are more optimistic about the possibility of solving alignment, especially via empirical methods that rely on capabilities, and are more concerned about bad actors developing and misusing AGI. I would expect OpenAI leadership to change their mind on these questions given clear enough evidence to the contrary.
See, this is exactly the problem. Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late. That is the fundamental reason why alignment is harder than other scientific fields. Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, “getting what you measure” in slow takeoff, however you frame it the issue is the same: things look fine early on, and go wrong later.
And as far as I can tell, OpenAI as an org just totally ignores that whole class of issues/arguments, and charges ahead assuming that if they don’t see a problem then there isn’t a problem (and meanwhile does things which actively select for hiding problems, like e.g. RLHF).
To clarify, by “empirical” I meant “relating to differences in predictions” as opposed to “relating to differences in values” (perhaps “epistemic” would have been better). I did not mean to distinguish between experimental versus conceptual evidence. I would expect OpenAI leadership to put more weight on experimental evidence than you, but to be responsive to evidence of all kinds. I think that OpenAI leadership are aware of most of the arguments you cite, but came to different conclusions after considering them than you did.
[First of all, many thanks for writing the post; it seems both useful and the kind of thing that’ll predictably attract criticism]
I’m not quite sure what you mean to imply here (please correct me if my impression is inaccurate—I’m describing how-it-looks-to-me, and I may well be wrong):
I would expect OpenAI leadership to put more weight on experimental evidence than you...
Specifically, John’s model (and mine) has: X = [Class of high-stakes problems on which we’ll get experimental evidence before it’s too late] Y = [Class of high-stakes problems on which we’ll get no experimental evidence before it’s too late]
Unless we expect Y to be empty, when we’re talking about Y-problems the weighting is irrelevant: we get no experimental evidence.
Weighting of evidence is an issue when dealing with a fixed problem. It seems here as if it’s being used to select the problem: we’re going to focus on X-problems because we put a lot of weight on experimental evidence. (obviously silly, so I don’t imagine anyone consciously thinks like this—but out-of-distribution intuitions may be at work)
What kind of evidence do you imagine would lead OpenAI leadership to change their minds/approach? Do you / your-model-of-leadership believe that there exist Y-problems?
I don’t think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a “problem”. Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.
To rephrase, it seems to me that in some sense all evidence is experimental. What changes is the degree of generalisation/abstraction required to apply it to a particular problem.
Once we make the distinction between experimental and non-experimental evidence, then we allow for problems on which we only get the “non-experimental” kind—i.e. the kind requiring sufficient generalisation/abstraction that we’d no longer tend to think of it as experimental.
So the question on Y-problems becomes something like:
Given some characterisation of [experimental evidence] (e.g. whatever you meant that OpenAI leadership would tend to put more weight on than John)...
...do you believe there are high-stakes problems for which we’ll get no decision-relevant [experimental evidence] before it’s too late?
Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late.
I don’t think this is the core reason that alignment is hard—even if we had access to a bunch of evidence about AGI misbehavior now, I think it’d still be hard to convert that into a solution for alignment. Nor do I believe we’ll see no empirical evidence of power-seeking behavior before it’s too late (and I think opinions amongst alignment researchers are pretty divided on this question).
I don’t think this is the core reason that alignment is hard—even if we had access to a bunch of evidence about AGI misbehavior now, I think it’d be very hard to convert that into a solution for alignment.
If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn’t be harder than the core problems of any other field of science/engineering. It wouldn’t be unusually hard, by the standards of technical research.
Of course, “empirical evidence of power-seeking behavior” is a lot weaker than a magical box. With only that level of empirical evidence, most of the “no empirical feedback” problem would still be present. More on that next.
Nor do I believe we’ll see no empirical evidence of power-seeking behavior before it’s too late (and I think opinions amongst alignment researchers are pretty divided on this question).
The key “lack of empirical feedback” property in Goodhart, deceptive alignment, hard left turn, get what you measure, etc, is this: for any given AI, it will look fine early on (e.g. in training or when optimization power is low) and then things will fall apart later on. If we are lucky enough to be in a very-slow-takeoff world, then an empirically-minded person might still notice that their AIs keep falling apart in deployment, and conclude that alignment is a problem. I don’t put very high probability on that (partly because of the very-slow-takeoff assumption and partly because scenarios like getting what we measure don’t necessarily look like a problem with the AI), but I buy it as a basically-plausible story.
But that doesn’t really change the problem that much, for multiple reasons (any one of which is sufficient):
Even if we put only a low probability on not getting a warning shot, we probably don’t want to pursue a strategy in which humanity goes extinct if we don’t get a fire alarm. Thinking we’ll probably get a warning shot makes sense to me; relying on a warning shot while plowing ahead building AGI is idiotic. The downside is far too large for the risk to make sense unless we are unrealistically confident that there will definitely be a warning shot.
Training AI in ways which will obviously incentivize it to hide problems (i.e. RLHF), and therefore make a warning shot less likely, is similarly foolish even if we think we’ll probably get a warning shot.
The failure mode I actually think is most likely: we do get a warning shot, and then people try to train away the problems until the problems cease to be visible. And that fails because of Goodheart, deception, hard left turn, getting what you measure, etc.
Psychologizing a bit: I suspect both RLHF and reliance on warning shots are symptoms of a more general cognitive pattern where people just don’t believe in anything they can’t see, and “iterate until we can’t see any problem” is very much the sort of strategy I expect such people to use. (I believe it’s also the strategy suggested by OP’s phrase “empirical methods that rely on capabilities”.) It’s not just about “not getting empirical evidence” in terms of a warning shot, it’s about not getting empirical evidence about alignment of any given powerful AGI until it’s too late, and that problem interacts very poorly with a mindset where people iterate a lot and don’t believe in problems they can’t see.
That’s the sort of problem which doesn’t apply in most scientific/engineering fields. In most fields, “iterate until we can’t see any problem” is a totally reasonable strategy. Alignment as a field is unusually hard because we can’t use that strategy; the core failure modes we’re worried about all involve problems which aren’t visible until later on the AI at hand.
Huh, I thought you agreed with statements like “if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier”.
My model is that John is talking about “evidence on whether an AI alignment solution is sufficient”, and you understood him to say “evidence on whether the AI Alignment problem is real/difficult”. My guess is you both agree on the former, but I am not confident.
Huh, I thought you agreed with statements like “if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier”.
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn’t help you very much if you have no idea how rockets work).
I don’t really know what “reliable empirical feedback” means in this context—if you have sufficiently reliable feedback mechanisms, then you’ve solved most of the alignment problem. But, out of the things John listed:
Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, “getting what you measure” in slow takeoff
I expect that we’ll observe a bunch of empirical examples of each of these things happening (except for the hard takeoff phase change), and not know how to fix them.
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn’t help you very much if you have no idea how rockets work).
I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.
I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a good understanding of newtonian mechanics and some aspects of relativity, but I don’t know, if I could just launch a rocket every day without bad consequences, I am pretty sure I wouldn’t really need a deep understanding of either of those, or would easily figure out the relevant bits as I kept experimenting.
The reason why rocket science relies so much on having solid theoretical models is because we have to get things right in only a few shots. I don’t think you really needed any particularly good theory to build trains for example. Just a lot of attempts and tinkering.
At a sufficiently high level of abstraction, I agree that “cost of experimenting” could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like “our inability to coordinate as a civilization” or “the power of intelligence” or “a lack of interpretability”, etc. Given this, John’s comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.
Also, I think that “on our first try” thing isn’t a great framing, because there are always precursors (e.g. we landed a man on the moon “on our first try” but also had plenty of tries at something kinda similar). Then the question is how similar, and how relevant, the precursors are—something where I expect our differing attitudes about the value of empiricism to be the key crux.
Well you could probably build a rocket that looks like it works, anyways. Could you build one you would want to try to travel to the moon in? (Are you imagining you get to fly in these rockets? Or just launch and watch from ground? I was imagining the 2nd...)
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn’t help you very much if you have no idea how rockets work).
I basically buy that argument, though I do still think lack of shots is the main factor which makes alignment harder than most other technical fields in their preparadigmatic stage.
“Harder” can have two meanings: “the program (of design, and the proof) is longer” and “the program is less likely to be generated in the real world”. These meanings are correlated, but not identical.
See, this is exactly the problem. Alignment as a field is hard precisely because we do not expect to see empirical evidence before it is too late. That is the fundamental reason why alignment is harder than other scientific fields. Goodhart problems in outer alignment, deception in inner alignment, phase change in hard takeoff, “getting what you measure” in slow takeoff, however you frame it the issue is the same: things look fine early on, and go wrong later.
And as far as I can tell, OpenAI as an org just totally ignores that whole class of issues/arguments, and charges ahead assuming that if they don’t see a problem then there isn’t a problem (and meanwhile does things which actively select for hiding problems, like e.g. RLHF).
To clarify, by “empirical” I meant “relating to differences in predictions” as opposed to “relating to differences in values” (perhaps “epistemic” would have been better). I did not mean to distinguish between experimental versus conceptual evidence. I would expect OpenAI leadership to put more weight on experimental evidence than you, but to be responsive to evidence of all kinds. I think that OpenAI leadership are aware of most of the arguments you cite, but came to different conclusions after considering them than you did.
[First of all, many thanks for writing the post; it seems both useful and the kind of thing that’ll predictably attract criticism]
I’m not quite sure what you mean to imply here (please correct me if my impression is inaccurate—I’m describing how-it-looks-to-me, and I may well be wrong):
Specifically, John’s model (and mine) has:
X = [Class of high-stakes problems on which we’ll get experimental evidence before it’s too late]
Y = [Class of high-stakes problems on which we’ll get no experimental evidence before it’s too late]
Unless we expect Y to be empty, when we’re talking about Y-problems the weighting is irrelevant: we get no experimental evidence.
Weighting of evidence is an issue when dealing with a fixed problem.
It seems here as if it’s being used to select the problem: we’re going to focus on X-problems because we put a lot of weight on experimental evidence. (obviously silly, so I don’t imagine anyone consciously thinks like this—but out-of-distribution intuitions may be at work)
What kind of evidence do you imagine would lead OpenAI leadership to change their minds/approach?
Do you / your-model-of-leadership believe that there exist Y-problems?
I don’t think I understand your question about Y-problems, since it seems to depend entirely on how specific something can be and still count as a “problem”. Obviously there is already experimental evidence that informs predictions about existential risk from AI in general, but we will get no experimental evidence of any exact situation that occurs beforehand. My claim was more of a vague impression about how OpenAI leadership and John tend to respond to different kinds of evidence in general, and I do not hold it strongly.
To rephrase, it seems to me that in some sense all evidence is experimental. What changes is the degree of generalisation/abstraction required to apply it to a particular problem.
Once we make the distinction between experimental and non-experimental evidence, then we allow for problems on which we only get the “non-experimental” kind—i.e. the kind requiring sufficient generalisation/abstraction that we’d no longer tend to think of it as experimental.
So the question on Y-problems becomes something like:
Given some characterisation of [experimental evidence] (e.g. whatever you meant that OpenAI leadership would tend to put more weight on than John)...
...do you believe there are high-stakes problems for which we’ll get no decision-relevant [experimental evidence] before it’s too late?
I don’t think this is the core reason that alignment is hard—even if we had access to a bunch of evidence about AGI misbehavior now, I think it’d still be hard to convert that into a solution for alignment. Nor do I believe we’ll see no empirical evidence of power-seeking behavior before it’s too late (and I think opinions amongst alignment researchers are pretty divided on this question).
If I imagine that we magically had a boxing setup which let us experiment with powerful AGI alignment without dying, I do agree it would still be hard to solve alignment. But it wouldn’t be harder than the core problems of any other field of science/engineering. It wouldn’t be unusually hard, by the standards of technical research.
Of course, “empirical evidence of power-seeking behavior” is a lot weaker than a magical box. With only that level of empirical evidence, most of the “no empirical feedback” problem would still be present. More on that next.
The key “lack of empirical feedback” property in Goodhart, deceptive alignment, hard left turn, get what you measure, etc, is this: for any given AI, it will look fine early on (e.g. in training or when optimization power is low) and then things will fall apart later on. If we are lucky enough to be in a very-slow-takeoff world, then an empirically-minded person might still notice that their AIs keep falling apart in deployment, and conclude that alignment is a problem. I don’t put very high probability on that (partly because of the very-slow-takeoff assumption and partly because scenarios like getting what we measure don’t necessarily look like a problem with the AI), but I buy it as a basically-plausible story.
But that doesn’t really change the problem that much, for multiple reasons (any one of which is sufficient):
Even if we put only a low probability on not getting a warning shot, we probably don’t want to pursue a strategy in which humanity goes extinct if we don’t get a fire alarm. Thinking we’ll probably get a warning shot makes sense to me; relying on a warning shot while plowing ahead building AGI is idiotic. The downside is far too large for the risk to make sense unless we are unrealistically confident that there will definitely be a warning shot.
Training AI in ways which will obviously incentivize it to hide problems (i.e. RLHF), and therefore make a warning shot less likely, is similarly foolish even if we think we’ll probably get a warning shot.
The failure mode I actually think is most likely: we do get a warning shot, and then people try to train away the problems until the problems cease to be visible. And that fails because of Goodheart, deception, hard left turn, getting what you measure, etc.
Psychologizing a bit: I suspect both RLHF and reliance on warning shots are symptoms of a more general cognitive pattern where people just don’t believe in anything they can’t see, and “iterate until we can’t see any problem” is very much the sort of strategy I expect such people to use. (I believe it’s also the strategy suggested by OP’s phrase “empirical methods that rely on capabilities”.) It’s not just about “not getting empirical evidence” in terms of a warning shot, it’s about not getting empirical evidence about alignment of any given powerful AGI until it’s too late, and that problem interacts very poorly with a mindset where people iterate a lot and don’t believe in problems they can’t see.
That’s the sort of problem which doesn’t apply in most scientific/engineering fields. In most fields, “iterate until we can’t see any problem” is a totally reasonable strategy. Alignment as a field is unusually hard because we can’t use that strategy; the core failure modes we’re worried about all involve problems which aren’t visible until later on the AI at hand.
Huh, I thought you agreed with statements like “if we had many shots at AI Alignment and could get reliable empirical feedback on whether an AI Alignment solution is working, AI Alignment would be much easier”.
My model is that John is talking about “evidence on whether an AI alignment solution is sufficient”, and you understood him to say “evidence on whether the AI Alignment problem is real/difficult”. My guess is you both agree on the former, but I am not confident.
I agree that having many shots is helpful, but lacking them is not the core difficulty (just as having many shots to launch a rocket doesn’t help you very much if you have no idea how rockets work).
I don’t really know what “reliable empirical feedback” means in this context—if you have sufficiently reliable feedback mechanisms, then you’ve solved most of the alignment problem. But, out of the things John listed:
I expect that we’ll observe a bunch of empirical examples of each of these things happening (except for the hard takeoff phase change), and not know how to fix them.
I do really feel like it would have been really extremely hard to build rockets if we had to get it right on the very first try.
I think for rockets the fact that it is so costly to experiment with stuff, explains the majority of the difficulty of rocket engineering. I agree you also have very little chance to build a successful space rocket without having a good understanding of newtonian mechanics and some aspects of relativity, but I don’t know, if I could just launch a rocket every day without bad consequences, I am pretty sure I wouldn’t really need a deep understanding of either of those, or would easily figure out the relevant bits as I kept experimenting.
The reason why rocket science relies so much on having solid theoretical models is because we have to get things right in only a few shots. I don’t think you really needed any particularly good theory to build trains for example. Just a lot of attempts and tinkering.
At a sufficiently high level of abstraction, I agree that “cost of experimenting” could be seen as the core difficulty. But at a very high level of abstraction, many other things could also be seen as the core difficulty, like “our inability to coordinate as a civilization” or “the power of intelligence” or “a lack of interpretability”, etc. Given this, John’s comment seemed like mainly rhetorical flourishing rather than a contentful claim about the structure of the difficult parts of the alignment problem.
Also, I think that “on our first try” thing isn’t a great framing, because there are always precursors (e.g. we landed a man on the moon “on our first try” but also had plenty of tries at something kinda similar). Then the question is how similar, and how relevant, the precursors are—something where I expect our differing attitudes about the value of empiricism to be the key crux.
Well you could probably build a rocket that looks like it works, anyways. Could you build one you would want to try to travel to the moon in? (Are you imagining you get to fly in these rockets? Or just launch and watch from ground? I was imagining the 2nd...)
I basically buy that argument, though I do still think lack of shots is the main factor which makes alignment harder than most other technical fields in their preparadigmatic stage.
“Harder” can have two meanings: “the program (of design, and the proof) is longer” and “the program is less likely to be generated in the real world”. These meanings are correlated, but not identical.