I very much agree with your top-level claim: analyzing different alignment targets well before we use them is a really good idea.
But I don’t think those are the right alignment targets to analyze. I think none of those are very likely to actually be deployed as alignment targets for the first real AGIs. I think that Instruction-following AGI is easier and more likely than value aligned AGIΩ or roughly equivalently (and better-framed for the agent foundations crowd), Corrigibility as Singular Target is far superior to anything else. I think it’s so superior that anyone sitting down and thinking about the topic, for instance just before launching something they viscerally believe might actually be able to learn and self-improve, will likely see it the same way.
On top of that logic, the people actually building the stuff would rather have it aligned to their goals than everyones.
I do think that it’s important to analyse alignment targets like these. Given the severe problems that all of these alignment targets suffer from, I certainly hope that you are right about them being unlikely. I certainly hope that nothing along the lines of a Group AI will ever be successfully implemented. But I do not think that it is safe to assume this. The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. The CEV arbital page actually assumes that the path to a Group AI goes through an initial limited AI (referred to as a Task AI). In other words: the classical proposed path to an AI that implements the CEV of Humanity actually starts with an initial AI that is not an AI Sovereign (and such an AI could for example be the type of instruction following AI that you mention). In yet other words: your proposed AI is not an alternative to a Group AI. Its successful implementation does not prevent the later implementation of a Group AI. Your proposed AI is in fact one step in the classical (and still fairly popular) proposed path to a Group AI.
I actually have two previousposts that were devoted to making the case for analysing the types of alignment targets that the present post is focusing on. The present post is instead focusing on doing such analysis. This previous post outlined a comprehensive argument in favour of analysing these types of alignment targets. Another previous post specifically focused on illustrating that Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure. See also this comment where I discuss the difference between proposing solutions on the one hand, and pointing out problems on the other hand.
Charbel-Raphaël responded to my post by arguing that no Sovereign AI should ever be created. My reply pointed out that this is mostly irrelevant to the question at hand. The only relevant question is whether or not a Sovereign AI might be successfully implemented eventually. If that is the case, then one can reduce the probability of some very bad outcomes by doing the type of Alignment Target Analysis that my previous two posts were arguing for (and that the present post is an example of). The second half of this reply (later in the same thread) includes a description of an additional scenario where an initial limited AI is followed by a Sovereign AI (and this Sovereign AI is implemented without significant time spent on analysing the specific proposal, due to Internal Time Pressure).
Regarding Corrigibility as a Singular Target:
I don’t think that one can rely on this idea to prevent the outcome where a dangerous Sovereign AI proposal is successfully implemented at some later time (for example after an initial AI has been used to buy time). One issue is the difficulty of defining critical concepts such as Explanation and Understanding. I previously discussed this with Max Harms here, and with Nathan Helm-Burger here. Both of those comments are discussing attempts to make an AI pursue Corrigibility as a Singular Target (which should not be confused with my post on Corrigibility, which discussed a different type of Corrigibility).
Regarding what the designers might want:
The people actually building the stuff might not be the ones deciding what should be built. For example: if a messy coalition of governments enforces a global AI pause, then this coalition might be able to decide what will eventually be built. If a coalition is capable of successfully enforcing a global AI pause, then I don’t think that we can rule out the possibility that they will be able to enforce a decision to build a specific type of AI Sovereign (they could for example do this as a second step, after first managing to gain effective control over an initial instruction following AI). If that is the case, then the proposal to build something along the lines of a Group AI might very well be one of the politically feasible options (this was previously discussed in this post and in this comment).
Intent alignment as a stepping-stone to value alignment on eventually building sovereign ASI using intent-aligned (IF or Harms-corrigible) AGI to help with alignment. Wentworth recently pointed out that idiot sycophantic AGI combined with idiotic/time-pressured humans might easily screw up that collaboration, and I’m afraid I agree. I hope we do it slowly and carefully, but not slowly enough to fall into the attractor of a vicious human getting the reigns and keeping them forever.
The only thing I don’t agree with (AFAICT on a brief look—I’m rushed myself right now so LMK what else I’m missing if you like) is that we might have a pause. I see that as so unlikely as to not be worth time thinking about. I have yet to see any coherent argument for how we get one in time. If you know of such an argument, I’d love to see it!
I very much agree with your top-level claim: analyzing different alignment targets well before we use them is a really good idea.
But I don’t think those are the right alignment targets to analyze. I think none of those are very likely to actually be deployed as alignment targets for the first real AGIs. I think that Instruction-following AGI is easier and more likely than value aligned AGI Ω or roughly equivalently (and better-framed for the agent foundations crowd), Corrigibility as Singular Target is far superior to anything else. I think it’s so superior that anyone sitting down and thinking about the topic, for instance just before launching something they viscerally believe might actually be able to learn and self-improve, will likely see it the same way.
On top of that logic, the people actually building the stuff would rather have it aligned to their goals than everyones.
I do think that it’s important to analyse alignment targets like these. Given the severe problems that all of these alignment targets suffer from, I certainly hope that you are right about them being unlikely. I certainly hope that nothing along the lines of a Group AI will ever be successfully implemented. But I do not think that it is safe to assume this. The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. The CEV arbital page actually assumes that the path to a Group AI goes through an initial limited AI (referred to as a Task AI). In other words: the classical proposed path to an AI that implements the CEV of Humanity actually starts with an initial AI that is not an AI Sovereign (and such an AI could for example be the type of instruction following AI that you mention). In yet other words: your proposed AI is not an alternative to a Group AI. Its successful implementation does not prevent the later implementation of a Group AI. Your proposed AI is in fact one step in the classical (and still fairly popular) proposed path to a Group AI.
I actually have two previous posts that were devoted to making the case for analysing the types of alignment targets that the present post is focusing on. The present post is instead focusing on doing such analysis. This previous post outlined a comprehensive argument in favour of analysing these types of alignment targets. Another previous post specifically focused on illustrating that Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure. See also this comment where I discuss the difference between proposing solutions on the one hand, and pointing out problems on the other hand.
Charbel-Raphaël responded to my post by arguing that no Sovereign AI should ever be created. My reply pointed out that this is mostly irrelevant to the question at hand. The only relevant question is whether or not a Sovereign AI might be successfully implemented eventually. If that is the case, then one can reduce the probability of some very bad outcomes by doing the type of Alignment Target Analysis that my previous two posts were arguing for (and that the present post is an example of). The second half of this reply (later in the same thread) includes a description of an additional scenario where an initial limited AI is followed by a Sovereign AI (and this Sovereign AI is implemented without significant time spent on analysing the specific proposal, due to Internal Time Pressure).
Regarding Corrigibility as a Singular Target:
I don’t think that one can rely on this idea to prevent the outcome where a dangerous Sovereign AI proposal is successfully implemented at some later time (for example after an initial AI has been used to buy time). One issue is the difficulty of defining critical concepts such as Explanation and Understanding. I previously discussed this with Max Harms here, and with Nathan Helm-Burger here. Both of those comments are discussing attempts to make an AI pursue Corrigibility as a Singular Target (which should not be confused with my post on Corrigibility, which discussed a different type of Corrigibility).
Regarding what the designers might want:
The people actually building the stuff might not be the ones deciding what should be built. For example: if a messy coalition of governments enforces a global AI pause, then this coalition might be able to decide what will eventually be built. If a coalition is capable of successfully enforcing a global AI pause, then I don’t think that we can rule out the possibility that they will be able to enforce a decision to build a specific type of AI Sovereign (they could for example do this as a second step, after first managing to gain effective control over an initial instruction following AI). If that is the case, then the proposal to build something along the lines of a Group AI might very well be one of the politically feasible options (this was previously discussed in this post and in this comment).
I agree with essentially all of this. See my posts
If we solve alignment, do we die anyway? on AGI nonproliferation and government involvement
and
Intent alignment as a stepping-stone to value alignment on eventually building sovereign ASI using intent-aligned (IF or Harms-corrigible) AGI to help with alignment. Wentworth recently pointed out that idiot sycophantic AGI combined with idiotic/time-pressured humans might easily screw up that collaboration, and I’m afraid I agree. I hope we do it slowly and carefully, but not slowly enough to fall into the attractor of a vicious human getting the reigns and keeping them forever.
The only thing I don’t agree with (AFAICT on a brief look—I’m rushed myself right now so LMK what else I’m missing if you like) is that we might have a pause. I see that as so unlikely as to not be worth time thinking about. I have yet to see any coherent argument for how we get one in time. If you know of such an argument, I’d love to see it!