(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan? Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their “alignment plan” make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making things actively worse, not better. Therefore, their plan is perhaps not unarguably harmful, but certainly irresponsible.[2] For this reason, I find it unlikely that the post is endorsed as a strategy by OpenAI’s leadership.
(III)[3] My assumption: To make sense of the text, I will from now assume that the post is endorsed by OpenAI’s alignment team only, and that the team is in a position where they cannot affect the actions of OpenAI’s capabilities team in any way. (Perhaps except to the extent that their proposals would only incur a near-negligible alignment tax.) They are simply determined to make the best use of the research that would happen anyway. (I don’t have any inside knowledge into OpenAI. This assumption seems plausible to me, and very sad.)
(IV) A general comment that I would otherwise need to repeat essentially ever point I make is the following: OpenAI should set up a system that will (1) let them notice if their assumptions turn out to be mistaken and (2) force them to course-correct if it happens. In several places, the post explicitly states, or at least implies, critical assumptions about the nature of AI, AI alignment, or other topics. However, it does not include any ways of noticing if these assumptions turn out to not hold. To act responsibly, OpenAI should (at the minimum): (A) Make these assumptions explicit. (B) Make these hypotheses falsifiable by publicizing predictions, or other criteria they could use to check the assumptions. (C) Set up a system for actually checking (B), and course-correcting if the assumptions turn out false.
Assumptions implied by OpenAI’s plans, with my reactions:
(V) Easy alignment / warning shots for misaligned AGI: ”Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: [...]” My biggest objection with the whole plan is already regarding the second sentence of the post: relying on a trial-and-error approach. I assume OpenAI believes either: (1) The proposed alignment plan is so unlikely to fail that we don’t need to worry about the worlds where it does fail. Or (2) In the worlds where the plan fails, we will have a clear warning shots. (I personally believe this is suicidal. I don’t expect people to automatically agree, but with everything at stake, they should be open to signs of being wrong.)
(VI) “AGI alignment” isn’t “AGI complete”: This is already acknowledged in the post: “It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems.” However, it isn’t exactly clear what precise assumptions are being made here. Moreover, there is no vision for how to monitor whether the assumptions hold or not. Do we keep iterating on AI capabilities, each time hoping that “this time, it will be powerful enough to help with alignment”?
(VII) Related assumption: No lethal discontinuities: The whole post suggest the workflow “new version V of AI-capabilities ==> capabilities ppl start working on V+1 & (simultaneously) alignment people use V for alignment research ==> alignment(V) gets used on V, or informs V+1”. (Like with GPT-3.) This requires the assumption that either you can hold off research on V+1 until alignment(V) is ready, or the assumption that deployed V will not kill you before you solve alignment(V). Which of the assumptions is being made here? I currently don’t see evidence for “ability to hold off on capabilities research”. What are the organizational procedures allowing this?
(VIII) [Point intentionally removed. I endorse the sentiment that treating these types of lists as complete is suicidal. In line with this, I initially wrote 7 points and then randomly deleted one. This is, obviously, in addition to all the points that I failed to come up with at all, or that I didn’t mention because I didn’t have enough original thoughts on them and it would seem too much like parroting MIRI. And in addition to the points that nobody came up with yet...]
(IX) Regarding “outer alignment ≠ alignment”: Other people solving the remaining issues. Or having warning shots & the ability to hold off capabilities research until OpenAI solves them: It is good to at least acknowledge that there might be other parts of AI alignment than just “figuring out learning from human feedback (& human-feedback augmentation)”. However, even if this ingredient is necessary, the plan assumes that if it turns out not-sufficient, you will (a) notice and (b) have enough time to fix the issue.
(X) Ability to differentially use capabilities progress towards alignment progress: The plan involves training AI assistants to help with alignment research. This seems to assume that either (i) the AI assistants will only be able to help with alignment research, or (ii) they will be general, but OpenAI can keep their use restricted to alignment research only, or (iii) they will be general and generally used, but somehow we will have enough time to do the alignment research anyway. Personally, I think all three of these assumptions are false --- (i) because it seems unlikely they won’t also be usable on capabilities research, (ii) based on track record so far, and (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
(XI) Creating an aligned AI is sufficient for getting AI to go well: The plan doesn’t say anything about what to do with the hypothetical aligned AGI. Is the assumption that OpenAI can just release the seems-safe-so-far AGI through their API, $1 for 10,000 tokens, and we will all live happily ever after? Or is the plan to, uhm, offer it to all governments of the world for assistance in decision-making? Or something else inside the Overton window? If so, what exactly, and what is the theory of change for it? I think there could be many moral & responsible plans outside of the Overton window, just because public discource these days tends to be tricky. Having a specific strategy like that seems fine and reasonable. But I am afraid there is simultaneously (a) the desire to stick to the Overton window strategies and (b) no theory of change for how this prevents misaligned AGI by other actors, or other failure modes, (c) no “explicit assumptions & detection system & course-correction-procedure” for “nothing will go wrong if we just do (b)”.
General complaint: The plan is not a plan at all! It’s just a meta-plan.
(XII) Ultimately, I would paraphrase the plan-as-stated as: “We don’t know how to solve alignment. It seems hard. Let’s first build an AI to make us smarter, and then try again.” I think OpenAI should clarify whether this is literally true, or whether there is some idea for how the object-level AI alignment plan looks like—and if so, what is it.
(XIII) For example, the post mentions that “robustness and interpretability research [is important for the plan]”. However, this is not at all apparent from the plan. (This is acknowledged in the post, but that doesn’t make it any less of an issue!) This means that the plan is not detailed enough. As an analogy, suppose you have a mathematical theorem that makes an assumption X. And then you look at the proof, and you can’t see the step that would fail if X was untrue. This doesn’t say anything good about your proof.
Eliezer adds: “For this reason, please note explicitly if you’re saying things that you heard from a MIRI person at a gathering, or the like.”
As far as I know, I came up with points (I), (III), and (XII) myself and I don’t remember reading those points before. On the other hand, (IV), (IX), and (XI) are (afaik) pretty much direct ripoffs of MIRI arguments. The status of the remaining 7 points is unclear. (I read most of MIRI’s publicly available content, and attended some MIRI-affiliated events pre-covid. And I think all of my alignment thinking is heavily MIRI-inspired. So the remaining points are probably inspired by something I read. Perhaps I would be able to derive 2-3 out of 7 if MIRI disappeared 6 years ago?)
(II) For example, consider the following claim: “We believe the best way to learn as much as possible about how to make AI-assisted evaluation work in practice is to build AI assistants.” My reaction: Yes, technically speaking this is true. But likewise—please excuse the jarring analogy—the best way to learn as much as possible about how to treat radiation exposure is to drop a nuclear bomb somewhere and then study the affected population. And yeees, if people are going to be dropping nuclear bombs, you might as well study the results. But wouldn’t it be even better if you personally didn’t plan to drop bombs on people? Maybe you could even try coordinating with other bomb-posessing people on not dropping them on people :-).
(iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
Either I misunderstand this or it seems incorrect.
It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.
The use of AI assistants for alignment : capabilities doesn’t have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest lead are safety conscious, I expect the ratio of alignment : capabilities research they produce to be much higher (compared to now) right before AGI. See here.
> (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
Either I misunderstand this or it seems incorrect.
Hm, I think you are right—as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment—will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:
To get a more realistic assumption, perhaps we could want to talk about (speedup) “how much are AI assistants able to speed up alignment vs capability” and (proliferation prevention) “how much can OpenAI prevent them from proliferating to capabilities research”.[1] And then the corresponding more realistic version of the claims would be that:
either (i’) AI assistants will fundamentally be able to speed up alignment much more than capabilities
or (ii’) the potential speedup ratios will be comparable, but OpenAI will be able to significantly restrict the proliferation of AI assistants for capabilities research
or (iii’) both the potential speedup ratios and adoption rates of AI assistants will be comparable for capabilities research will be, but somehow we will have enough time to solve alignment anyway.
Comments:
Regarding (iii’): It seems that in the worlds where (iii’) holds, you could just as well solve alignment without developing AI assistants.
Regarding (i’): Personally I don’t buy this assumption. But you could argue for it on the grounds that perhaps alignment is just impossible to solve for unassisted humans. (Otherwise arguing for (i’) seems rather hard to me.)
Regarding (ii’): As before, this seems implausible based on the track record :-).
This implicitly assumes that if OpenAI develops the AI assistants technology and restrict proliferation, you will get similar adoption in capabilities vs alignment. This seems realistic.
Makes sense. FWIW, based on Jan’s comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn’t seem like Jan believes in this plan; personally I don’t believe in this plan.
4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we’ll see limited alignment research acceleration before ML research automation happens.
I don’t know how to link to the specific comment, but here somewhere. Also:
We can focus on tasks differentially useful to alignment research
Your pessimism about iii still seems a bit off to me. I agree that if you were coordinating well between all the actors than yeah you could just hold off on AI assistants. But the actual decision the OpenAI alignment team is facing could be more like “use LLMs to help with alignment research or get left behind when ML research gets automated”. If facing such choices I might produce a plan like theirs, but notably I would be much more pessimistic about it. When the universe limits you to one option, you shouldn’t expect it to be particularly good. The option “everybody agrees to not build AI assistants and we can do alignment research first” is maybe not on the table, or at least it probably doesn’t feel like it is to the alignment team at OpenAI.
Oh, I think I agree—if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.
But suppose they only work kind-of-poorly—and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)
(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn’t be viewed as negative judgement on the people involved :-).)
My ~2-hour reaction to the challenge:[1]
(I) I have a general point of confusion regarding the post: To the extent that this is an officially endorsed plan, who endorses the plan?
Reason for confusion / observations: If someone told me they are in charge of an organization that plans to build AGI, and this is their plan, I would immediately object that the arguments ignore the part where progress on their “alignment plan” make a significant contribution to capabilities research. Thereforey, in the worlds where the proposed strategy fails, they are making things actively worse, not better. Therefore, their plan is perhaps not unarguably harmful, but certainly irresponsible.[2] For this reason, I find it unlikely that the post is endorsed as a strategy by OpenAI’s leadership.
(III)[3] My assumption: To make sense of the text, I will from now assume that the post is endorsed by OpenAI’s alignment team only, and that the team is in a position where they cannot affect the actions of OpenAI’s capabilities team in any way. (Perhaps except to the extent that their proposals would only incur a near-negligible alignment tax.) They are simply determined to make the best use of the research that would happen anyway. (I don’t have any inside knowledge into OpenAI. This assumption seems plausible to me, and very sad.)
(IV) A general comment that I would otherwise need to repeat essentially ever point I make is the following: OpenAI should set up a system that will (1) let them notice if their assumptions turn out to be mistaken and (2) force them to course-correct if it happens. In several places, the post explicitly states, or at least implies, critical assumptions about the nature of AI, AI alignment, or other topics. However, it does not include any ways of noticing if these assumptions turn out to not hold. To act responsibly, OpenAI should (at the minimum): (A) Make these assumptions explicit. (B) Make these hypotheses falsifiable by publicizing predictions, or other criteria they could use to check the assumptions. (C) Set up a system for actually checking (B), and course-correcting if the assumptions turn out false.
Assumptions implied by OpenAI’s plans, with my reactions:
(V) Easy alignment / warning shots for misaligned AGI:
”Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: [...]” My biggest objection with the whole plan is already regarding the second sentence of the post: relying on a trial-and-error approach. I assume OpenAI believes either: (1) The proposed alignment plan is so unlikely to fail that we don’t need to worry about the worlds where it does fail. Or (2) In the worlds where the plan fails, we will have a clear warning shots. (I personally believe this is suicidal. I don’t expect people to automatically agree, but with everything at stake, they should be open to signs of being wrong.)
(VI) “AGI alignment” isn’t “AGI complete”:
This is already acknowledged in the post: “It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems.” However, it isn’t exactly clear what precise assumptions are being made here. Moreover, there is no vision for how to monitor whether the assumptions hold or not. Do we keep iterating on AI capabilities, each time hoping that “this time, it will be powerful enough to help with alignment”?
(VII) Related assumption: No lethal discontinuities:
The whole post suggest the workflow “new version V of AI-capabilities ==> capabilities ppl start working on V+1 & (simultaneously) alignment people use V for alignment research ==> alignment(V) gets used on V, or informs V+1”. (Like with GPT-3.) This requires the assumption that either you can hold off research on V+1 until alignment(V) is ready, or the assumption that deployed V will not kill you before you solve alignment(V). Which of the assumptions is being made here? I currently don’t see evidence for “ability to hold off on capabilities research”. What are the organizational procedures allowing this?
(VIII) [Point intentionally removed. I endorse the sentiment that treating these types of lists as complete is suicidal. In line with this, I initially wrote 7 points and then randomly deleted one. This is, obviously, in addition to all the points that I failed to come up with at all, or that I didn’t mention because I didn’t have enough original thoughts on them and it would seem too much like parroting MIRI. And in addition to the points that nobody came up with yet...]
(IX) Regarding “outer alignment ≠ alignment”: Other people solving the remaining issues. Or having warning shots & the ability to hold off capabilities research until OpenAI solves them:
It is good to at least acknowledge that there might be other parts of AI alignment than just “figuring out learning from human feedback (& human-feedback augmentation)”. However, even if this ingredient is necessary, the plan assumes that if it turns out not-sufficient, you will (a) notice and (b) have enough time to fix the issue.
(X) Ability to differentially use capabilities progress towards alignment progress:
The plan involves training AI assistants to help with alignment research. This seems to assume that either (i) the AI assistants will only be able to help with alignment research, or (ii) they will be general, but OpenAI can keep their use restricted to alignment research only, or (iii) they will be general and generally used, but somehow we will have enough time to do the alignment research anyway. Personally, I think all three of these assumptions are false --- (i) because it seems unlikely they won’t also be usable on capabilities research, (ii) based on track record so far, and (iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
(XI) Creating an aligned AI is sufficient for getting AI to go well:
The plan doesn’t say anything about what to do with the hypothetical aligned AGI. Is the assumption that OpenAI can just release the seems-safe-so-far AGI through their API, $1 for 10,000 tokens, and we will all live happily ever after? Or is the plan to, uhm, offer it to all governments of the world for assistance in decision-making? Or something else inside the Overton window? If so, what exactly, and what is the theory of change for it? I think there could be many moral & responsible plans outside of the Overton window, just because public discource these days tends to be tricky. Having a specific strategy like that seems fine and reasonable. But I am afraid there is simultaneously (a) the desire to stick to the Overton window strategies and (b) no theory of change for how this prevents misaligned AGI by other actors, or other failure modes, (c) no “explicit assumptions & detection system & course-correction-procedure” for “nothing will go wrong if we just do (b)”.
General complaint: The plan is not a plan at all! It’s just a meta-plan.
(XII) Ultimately, I would paraphrase the plan-as-stated as: “We don’t know how to solve alignment. It seems hard. Let’s first build an AI to make us smarter, and then try again.” I think OpenAI should clarify whether this is literally true, or whether there is some idea for how the object-level AI alignment plan looks like—and if so, what is it.
(XIII) For example, the post mentions that “robustness and interpretability research [is important for the plan]”. However, this is not at all apparent from the plan. (This is acknowledged in the post, but that doesn’t make it any less of an issue!) This means that the plan is not detailed enough.
As an analogy, suppose you have a mathematical theorem that makes an assumption X. And then you look at the proof, and you can’t see the step that would fail if X was untrue. This doesn’t say anything good about your proof.
As far as I know, I came up with points (I), (III), and (XII) myself and I don’t remember reading those points before. On the other hand, (IV), (IX), and (XI) are (afaik) pretty much direct ripoffs of MIRI arguments. The status of the remaining 7 points is unclear. (I read most of MIRI’s publicly available content, and attended some MIRI-affiliated events pre-covid. And I think all of my alignment thinking is heavily MIRI-inspired. So the remaining points are probably inspired by something I read. Perhaps I would be able to derive 2-3 out of 7 if MIRI disappeared 6 years ago?)
(II) For example, consider the following claim: “We believe the best way to learn as much as possible about how to make AI-assisted evaluation work in practice is to build AI assistants.” My reaction: Yes, technically speaking this is true. But likewise—please excuse the jarring analogy—the best way to learn as much as possible about how to treat radiation exposure is to drop a nuclear bomb somewhere and then study the affected population. And yeees, if people are going to be dropping nuclear bombs, you might as well study the results. But wouldn’t it be even better if you personally didn’t plan to drop bombs on people? Maybe you could even try coordinating with other bomb-posessing people on not dropping them on people :-).
Apologies for the inconsistent numbering. I had to give footnote [2] number (II) to get to the nice round total of 13 points :-).
Either I misunderstand this or it seems incorrect.
It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.
The use of AI assistants for alignment : capabilities doesn’t have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest lead are safety conscious, I expect the ratio of alignment : capabilities research they produce to be much higher (compared to now) right before AGI. See here.
Hm, I think you are right—as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment—will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense:
To get a more realistic assumption, perhaps we could want to talk about (speedup) “how much are AI assistants able to speed up alignment vs capability” and (proliferation prevention) “how much can OpenAI prevent them from proliferating to capabilities research”.[1] And then the corresponding more realistic version of the claims would be that:
either (i’) AI assistants will fundamentally be able to speed up alignment much more than capabilities
or (ii’) the potential speedup ratios will be comparable, but OpenAI will be able to significantly restrict the proliferation of AI assistants for capabilities research
or (iii’) both the potential speedup ratios and adoption rates of AI assistants will be comparable for capabilities research will be, but somehow we will have enough time to solve alignment anyway.
Comments:
Regarding (iii’): It seems that in the worlds where (iii’) holds, you could just as well solve alignment without developing AI assistants.
Regarding (i’): Personally I don’t buy this assumption. But you could argue for it on the grounds that perhaps alignment is just impossible to solve for unassisted humans. (Otherwise arguing for (i’) seems rather hard to me.)
Regarding (ii’): As before, this seems implausible based on the track record :-).
This implicitly assumes that if OpenAI develops the AI assistants technology and restrict proliferation, you will get similar adoption in capabilities vs alignment. This seems realistic.
Makes sense. FWIW, based on Jan’s comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn’t seem like Jan believes in this plan; personally I don’t believe in this plan.
I don’t know how to link to the specific comment, but here somewhere. Also:
Your pessimism about iii still seems a bit off to me. I agree that if you were coordinating well between all the actors than yeah you could just hold off on AI assistants. But the actual decision the OpenAI alignment team is facing could be more like “use LLMs to help with alignment research or get left behind when ML research gets automated”. If facing such choices I might produce a plan like theirs, but notably I would be much more pessimistic about it. When the universe limits you to one option, you shouldn’t expect it to be particularly good. The option “everybody agrees to not build AI assistants and we can do alignment research first” is maybe not on the table, or at least it probably doesn’t feel like it is to the alignment team at OpenAI.
Oh, I think I agree—if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them.
But suppose they only work kind-of-poorly—and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)
(And to be clear: I also strongly endorse writing up the alignment plan. Big thanks and kudus for that! The critical comments shouldn’t be viewed as negative judgement on the people involved :-).)