I very much agree with your top-level claim: analyzing different alignment targets well before we use them is a really good idea.
But I don’t think those are the right alignment targets to analyze. I think none of those are very likely to actually be deployed as alignment targets for the first real AGIs. I think that Instruction-following AGI is easier and more likely than value aligned AGIΩ or roughly equivalently (and better-framed for the agent foundations crowd), Corrigibility as Singular Target is far superior to anything else. I think it’s so superior that anyone sitting down and thinking about the topic, for instance just before launching something they viscerally believe might actually be able to learn and self-improve, will likely see it the same way.
On top of that logic, the people actually building the stuff would rather have it aligned to their goals than everyones.
I do think that it’s important to analyse alignment targets like these. Given the severe problems that all of these alignment targets suffer from, I certainly hope that you are right about them being unlikely. I certainly hope that nothing along the lines of a Group AI will ever be successfully implemented. But I do not think that it is safe to assume this. The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. The CEV arbital page actually assumes that the path to a Group AI goes through an initial limited AI (referred to as a Task AI). In other words: the classical proposed path to an AI that implements the CEV of Humanity actually starts with an initial AI that is not an AI Sovereign (and such an AI could for example be the type of instruction following AI that you mention). In yet other words: your proposed AI is not an alternative to a Group AI. Its successful implementation does not prevent the later implementation of a Group AI. Your proposed AI is in fact one step in the classical (and still fairly popular) proposed path to a Group AI.
I actually have two previousposts that were devoted to making the case for analysing the types of alignment targets that the present post is focusing on. The present post is instead focusing on doing such analysis. This previous post outlined a comprehensive argument in favour of analysing these types of alignment targets. Another previous post specifically focused on illustrating that Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure. See also this comment where I discuss the difference between proposing solutions on the one hand, and pointing out problems on the other hand.
Charbel-Raphaël responded to my post by arguing that no Sovereign AI should ever be created. My reply pointed out that this is mostly irrelevant to the question at hand. The only relevant question is whether or not a Sovereign AI might be successfully implemented eventually. If that is the case, then one can reduce the probability of some very bad outcomes by doing the type of Alignment Target Analysis that my previous two posts were arguing for (and that the present post is an example of). The second half of this reply (later in the same thread) includes a description of an additional scenario where an initial limited AI is followed by a Sovereign AI (and this Sovereign AI is implemented without significant time spent on analysing the specific proposal, due to Internal Time Pressure).
Regarding Corrigibility as a Singular Target:
I don’t think that one can rely on this idea to prevent the outcome where a dangerous Sovereign AI proposal is successfully implemented at some later time (for example after an initial AI has been used to buy time). One issue is the difficulty of defining critical concepts such as Explanation and Understanding. I previously discussed this with Max Harms here, and with Nathan Helm-Burger here. Both of those comments are discussing attempts to make an AI pursue Corrigibility as a Singular Target (which should not be confused with my post on Corrigibility, which discussed a different type of Corrigibility).
Regarding what the designers might want:
The people actually building the stuff might not be the ones deciding what should be built. For example: if a messy coalition of governments enforces a global AI pause, then this coalition might be able to decide what will eventually be built. If a coalition is capable of successfully enforcing a global AI pause, then I don’t think that we can rule out the possibility that they will be able to enforce a decision to build a specific type of AI Sovereign (they could for example do this as a second step, after first managing to gain effective control over an initial instruction following AI). If that is the case, then the proposal to build something along the lines of a Group AI might very well be one of the politically feasible options (this was previously discussed in this post and in this comment).
Intent alignment as a stepping-stone to value alignment on eventually building sovereign ASI using intent-aligned (IF or Harms-corrigible) AGI to help with alignment. Wentworth recently pointed out that idiot sycophantic AGI combined with idiotic/time-pressured humans might easily screw up that collaboration, and I’m afraid I agree. I hope we do it slowly and carefully, but not slowly enough to fall into the attractor of a vicious human getting the reigns and keeping them forever.
The only thing I don’t agree with (AFAICT on a brief look—I’m rushed myself right now so LMK what else I’m missing if you like) is that we might have a pause. I see that as so unlikely as to not be worth time thinking about. I have yet to see any coherent argument for how we get one in time. If you know of such an argument, I’d love to see it!
Given that you agreed with most of what I said in my reply, it seems like you should also agree that it is important to analyse these types of alignment targets. But in your original comment you said that you do not think that it is important to analyse these types of alignment targets.
Let’s write Multi Person Sovereign AI Proposal (MPSAIP) for an alignment target proposal to build an AI Sovereign that gets its goal from the global population (in other words: the type of alignment target proposals that I was analysing in the post). I followed your links and can only find one argument against the urgency of analysing MPSAIPs now: that an Instruction Following AI (IFAI) would make this unnecessary. I can see why one might expect that an IFAI would help to some degree when analysing MPSAIPs. But I don’t see how the idea of an IFAI could possibly remove the urgent need to analyse MPSAIPs now.
In your post on distinguishing value alignment from intent alignment, you define value alignment as being about all of humanity’s long term, implicit deep values. It thus seems like you are not talking about anything along the lines of building an AI that will do whatever some specific person wants that AI to do. Please correct me if I’m wrong, but your position thus seems to be built on top of the assumption that it would be safe to assume that an IFAI can be used to solve the problem of how to describe all of humanity’s long term, implicit deep values.
A brief summary of why I think that this is false: You simply cannot delegate the task of picking a goal to an AI (no matter how clever this AI is). You can define the goal indirectly and have the AI work out the details. But the task is simply not possible to delegate. For the same reason: you simply cannot delegate the task of picking a MPSAIP to an AI (no matter how clever this AI is). You can define things indirectly and have the AI work out the details. This is equivalent to fully solving the field of MPSAIP analysis. It would for example necessarily involve defining some procedure for dealing with disagreements amongst individuals that disagree on how to deal with disagreements (because individuals will not agree on which MPSAIP to pick). PCEV is one such procedure. It sounds reasonable but would lead to an outcome far worse than extinction. VarAI is another procedure that sounds reasonable but that is in fact deeply problematic. As shown in the post, this is not easy (partly because intuitions about well known concepts tend to break when transferred to the AI context). In other words: you can’t count on an IFAI to notice a bad MPSAIP, for the same reason that you can’t count on Clippy to figure out that it has the wrong goal.
How useful would an IFAI be for analysing MPSAIPs?
I can see why one might think that an IFAI would be somewhat useful. But I don’t see how one can be confident that it would be very useful (let alone be equivalent to a solution). If one does not hold this position, then the existence of an IFAI does not remove the need to analyse MPSAIPs now. (The idea that an IFAI might be counted on to buy sufficient time to analyse MPSAIPs is covered below, in the section where I answer your question about an AI pause).
The idea that an IFAI would be extremely useful for Alignment Target Analysis seems to be very common. But there is never any actual reason given for why this might be true. In other words: while I have heard similar ideas many times, I have never been able to get any actual argument in favour of the position, that an IFAI would be very useful for analysing MPSAIPs (by you or by anyone else). It is either implicit in some argument, or just flatly asserted. There seems to be two versions of this idea. One version is the delegation plan. In other words: the plan where one builds an IFAI that does know how to describe all of humanity’s long term, implicit deep values. The other version is the assistant plan. In other words: the plan where one builds an IFAI that does not know how to describe all of humanity’s long term, implicit deep values (and then uses that IFAI as an assistant while analysing MPSAIPs). I will cover them separately below.
The delegation plan: The scenario where an IFAI does know how to define all of humanity’s long term, implicit deep values
I don’t know how this plan could possibly remove the need for analysing MPSAIPs now. I don’t know why anyone would believe this (similarly to how I don’t know why anyone would believe that Clippy can be counted on to figure out that it has the wrong goal). It is clearly a common position. But as far as I am aware, there exists no positive argument for this position. Without any actual argument in favour of this position, it is a bit tricky to argue against this position. But I will do my best.
A preliminary point is that the task of picking one specific mapping, that maps from billions of humans to an entity of the type that can be said to want things, is not a technical task with a findable solution (see the post for much more on this). In yet other words: if one were to actually describe in detail the argument that one can delegate the task of analysing MPSAIPs to an IFAI, then one would run into a logical problem (if one tried to actually spell out the details step by step, one would be unable to do so). The problem one would run into, would be the same problem that one would run into if one were to try to argue that Clippy will figure out that it has the wrong goal (if one tried to actually spell out the details step by step, one would be unable to do so). Neither finding the correct goal nor analysing MPSAIPs is a technical task with a findable solution. Thus, neither task can be delegated to an AI, no matter how clever it is.
Let’s say that we have an IFAI that is able to give an answer, when you ask it how to describe all of humanity’s long term, implicit deep values. This is equivalent to the IFAI having already picked a specific MPSAIP.
I see only two ways of arriving at such an IFAI. One is that something has gone wrong, and the IFAI has settled on an answer by following some process that the designers did not intend it to follow. This is a catastrophic implementation failure. In other words: unless the plan was for the IFAI to choose an MPSAIP using some unknown procedure, the project has not gone according to plan. In this case I see no particular reason to think that the outcome would be any better than the horrors implied by PCEV.
The only other option that I see is that the designers have already fully solved the problem of how to define all of humanity’s long term, implicit deep values (presumably indirectly, by defining a process that leads to such a definition). In other words: if one plans to build an IFAI like this, then one has to fully solve the entire field of analysing MPSAIPs, before one builds the IFAI. In yet other words: if this is the plan, then this plan is an argument in favour of the urgent need to analyse MPSAIPs.
The assistant plan: The scenario where an IFAI does not know how to define all of humanity’s long term, implicit deep values
To conclude that analysing MPSAIPs now is not urgent, one must assume that this type of IFAI assistant is guaranteed to have a very dramatic positive effect (a somewhat useful IFAI assistant would not remove the urgent need for analysing MPSAIPs now). It seems to be common to simply assume that an IFAI assistant will basically render prior work on analysing MPSAIPs redundant (the terminology differs. And it is often only implicit in some argument or plan. But the assumption is common). I have however never seen any detailed plan for how this would actually be done. (The situation is similar to how the delegation plan is never actually spelled out). I think that as soon as one were to lay out the details of how this would work, one would realise that one has a plan that is built on top of an incorrect assumption (similar to the type of incorrect implicit assumption that one would find, if one were to spell out the details of why exactly Clippy can be counted on to realise that it has the wrong goal).
It is difficult to argue against this position directly, since I don’t know how this IFAI is supposed to be used (let alone why this would be guaranteed to have a large positive effect). But I will try to at least point to some difficulties that one would run into.
Let’s say that Allan is asking the IFAI questions, as a step in the process of analysing MPSAIPs. Every question Allan asks of an IFAI like this would pose a very dramatic risk. Allan is leaning heavily on a set of definitions, for example definitions of concepts like Explanation and Understanding. Even if those definitions have held up while the IFAI was used to do other things (such as shutting down competing AI projects), those definitions could easily break when discussing MPSAIPs. Since the IFAI does not know what a bad MPSAIP is, the IFAI has no way of noticing that it is steering Allan towards a catastrophically bad MPSAIP. Regardless of how clever the IFAI is, there is simply no chance of it noticing this. Just as there is no chance of Clippy discovering that it has the wrong goal.
In other words: if a definition of Explanation breaks during a discussion with an IFAI, and Allan ends up convinced that he must implement PCEV, then we will end up with the horrors implied by PCEV. (If you think that the IFAI will recognise the outcome implied by PCEV as a bad outcome, then you are imagining the type of IFAI that was treated in the previous subsection (and such an IFAI can only be built after the field of analysing MPSAIPs have been fully solved)). This was previously discussed here and here (using different terminology).
(To be clear: this subsection is not arguing against the plan of building an IFAI of this type. And it is not arguing against the idea that this type of IFAI might be somewhat useful. It is not even arguing against the idea that it might be possible to use an IFAI like this in a way that dramatically increases the ability to analyse MPSAIPs. It is simply arguing against the idea that one can be sure that an IFAI like this will in fact be used in a way that will dramatically increase the ability to analyse MPSAIPs. This is enough to show that the IFAI idea does not remove the urgent need to analyse MPSAIPs now).
Regarding the probability of a pause
The probability of a politically enforced pause is not important for any argument that I am trying to make. Not much changes if we replace a politically enforced pause with an IFAI. Some group of humans will still decide what type of Sovereign AI will eventually be built. If they successfully implement a bad Sovereign AI proposal, then the outcome could be massively worse than extinction. So it makes sense to reduce the probability of that. One tractable way of reducing this probability is by analysing MPSAIPs.
In other words: if you achieve a pause by doing something other than building an AI Sovereign (for example by implementing a politically enforced pause, or by using an IFAI). Then the decision of what AI Sovereign to eventually build will remain in human hands. So then you will still need progress on analysing MPSAIPs to avoid bad Sovereign AI proposals. There is no way of knowing how long it will take to achieve the needed level of such progress. And there is no way of knowing how much time a pause will actually result in. So, even if we did know exactly what method will be used to shut down competing projects. And we also knew exactly who will make decisions regarding Sovereign AI. Then there is still no way of knowing that there will be sufficient time to analyse MPSAIPs. Therefore, such analysis should start now. (And as illustrated by my post, such progress is tractable).
One point that should be made here, is that you can end up with a multipolar world even if there is a single IFAI that flawlessly shuts down all unauthorised AI projects. If a single IFAI is under the control of some set of existing political power structures, then this would be a multipolar world. Regardless of who is in control (for example the UN Security Council (UNSC), the UN general assembly, or some other formalisation of global power structures), it is still possible for some ordinary political movement to gain power over the IFAI, by ordinary political means. Elected governments can be voted out. Governments along the lines of the USSR can evidently also be brought down by ordinary forms of political movements. So there is in general nothing strange about someone being in control of an IFAI, but finding themselves in a situation where they must either act quickly and decisively, or risk permanently losing control to people with different values. This means that shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure.
Let’s consider the scenario where a UNSC resolution is needed to ask the IFAI a question, or to ask the IFAI to do something (such as shutting down competing AI projects, or launching an AI Sovereign). There is currently an agreement of what AI Sovereign to build. But there is also an agreement that it would be good to first analyse this proposal a bit more, to make sure there is no hidden problem with it. In this case, losing control of any of the five countries with a veto would remove the ability to launch an AI Sovereign (if control is lost to a novel and growing political movement, then control could be lost permanently. The result of losing control of one permanent UNSC member could mean that a deadlock will persist until the new movement eventually controls all five). So, the people currently in control would basically have to either act quickly or risk permanently losing power to people with different values. If they decide to aim at their preferred MPSAIP, then it would be very nice if the field of analysing MPSAIPs had progressed to the point where it is possible to notice that this MPSAIP implies an outcome worse than extinction (for example along the lines of the outcome implied by PCEV. But presumably due to a harder-to-notice problem).
I used the UNSC as an example in the preceding paragraph, because it seems to me like the only legal way of taking the actions that would be necessary to robustly shut down all competing AI projects (being the only legal option, and thus a sort of default option, makes it difficult to rule out this scenario). But the same type of Internal Time Pressure might also arise in other arrangements. This comment outlines a scenario where a global electorate is in charge (which seems like another reasonable candidate for how to define what it means to do the default thing). This post outlines a scenario where a group of augmented humans are in charge (in that scenario buying time is achieved by uploading. Not by shutting down competing AI projects. This seems like something that someone might do if they don’t feel comfortable with using force. But simultaneously don’t feel ready to take the decision to give up control to some specific political process).
The reason that I keep going on about the need for Alignment Target Analysis (ATA) is that there seems to currently exist exactly zero people in the world devoted to doing ATA full time. Making enough ATA progress to reduce the probability of bad outcomes is also tractable (trying to solve ATA would be a completely different thing. But there still exists a lot of low hanging fruit in terms of ATA progress that reduces the probability of bad outcomes). It thus seems entirely possible to me that we will end up with a PCEV style catastrophe that could have been easily prevented. Reducing the probability of that seems like a reasonable thing to do. But it is not being done.
An attempt to summarise how I view the situation
At our current level of ATA progress it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction. I simply don’t see how one can think that it is safe to stay at this level of progress. Intuitively this seems like a dangerous situation. The fact that there exists no research project dedicated to improving this situation seems like a mistake (as illustrated by my post, reducing the probability of bad outcomes is a tractable research project). It seems like many people do have some reason for thinking that the current state of affairs is acceptable. As far as I can tell however, these reasons are not made public. This is why I think that it makes sense to spend time on trying to figure out what you believe to be true, and why you believe it to be true (and this is also why I appreciate you engaging on this).
In other words: arguing that ATA should be a small percentage of AI safety work would be one type of argument. Arguing that the current situation is reasonable would be a fundamentally different type of argument. It is clearly the case that plenty of people are convinced that it is reasonable to stay at the current level of ATA progress (in other words: many of people are acting in a way that I can only explain if I assume that they feel very confident, that it is safe to stay at our current level of ATA progress). I think that they are wrong about this. But since no argument in favour of this position is ever outlined in detail, there is no real way of addressing this directly.
PS:
I’m fine with continuing this discussion here. But it probably makes sense to at least note that it would have fitted better under this post (which makes the case for analysing this type of alignment targets. And actually discusses the specific topic of why various types of Non-Sovereign-AIs would not replace doing this now). As a tangent, the end of that post actually explicitly asked people to outline their reasons for thinking that ATA now is not needed. Your response here seems to be an example of this. So I very much appreciate your engagement on this. In other words: I don’t think you are the only one that have ideas along these lines. I think that there are plenty of people with similar ways of looking at things. And I really wish that those people would clearly outline their reasons for thinking that the current situation is reasonable. Because I think that those reasons will fall apart if they are outlined in any actual detail. So I really appreciate that you are engaging on this. And I really wish that more people would do the same.
Thanks! I don’t have time to process this all right now, so I’m just noting that I do want to come back to it quickly and engage fully.
Here’s my position in brief: I think analyzing alignment targets is valuable. Where my current take differs from yours (I think) is that I think that effort would be best spent analyzing what you term corrigibility in the linked post (I got partway through and will have to come back to it), and I’ve called instruction-following.
I think that’s far more important to do first, because that’s approximately what people are aiming for right now. I fully agree that there are other values mixed in with the training other than instruction-following. I think the complexity and impurity of that target makes it more urgent, not less, to have good minds analyzing the alignment targets that developers are most likely to pursue first by default. See my recent post Seven sources of goals in LLM agents. This is my main research focus, but I know of no one else focusing on this, and few people who even give it part-time attention. This seems like a bad allocation of resources; there might be major flaws in the alignment target that we don’t find until developers are far along that path and reluctant to rework it.
You said
Please correct me if I’m wrong, but your position thus seems to be built on top of the assumption that it would be safe to assume that an IFAI can be used to solve the problem of how to describe all of humanity’s long term, implicit deep values.
I definitely do not think it would be safe to assume that IF/corrigible AGI can solve value alignment for other/stronger AGI. John Wentworth’s The Case Against AI Control Research has a pretty compelling argument for how we’d collaborate with sycophantic parahuman AI/AGI to screw up aligning the next step in AGI/ASI.
I do not think any of this is safe. I think we are long past the point where we should be looking for perfectly reliable solutions. I strongly believe we must look for the best viable solutions, factoring in the practicality/likelyhood of getting them actually implemented. I worry that the alignment community’s desire for best (let alone “provably safe”) alignment solutions will prevent them from working on solutions that give us the best possible shot within the economic, psychological, and political realities governing the creation of AGI.
So we seem to be in agreement that the current alignment target of instruction-following/corrigibility should really get more analysis. I am working on that. I’ll just toss out one difficulty I’m reckoning with, which no one has (at least expicitly) recognized AFAIK: IF AGI must be oriented to prioritize taking and following new instructions over old (otherwise it won’t listen for new instructions since that would risk not achieving all the goals from past instructions). With that as first priority, it would avoid being shut down, since that would prevent it from hearing further instructions.
This is the sort of alignment target analysis that thinkers like you could help with. I wish we just had far more people thinking about this. Given the sharp limitations, it still seems like prioritizing IF/corrigibilty (and the messy mix of moralistic behavior devs are training for) seems higher priority than value alignment targets that we perhaps “should” use but will in fact almost certainly not.
Much more in a few days after I publish my next piece on the complexities of instruction-following mixed with other implicit/trained goals, and fully process your pieces! Thanks for engaging. I appreciate your efforts in this direction, whether or not you decide to analyze intent alignment targets other than value alignment targets.
It seems to me that we are going in circles and talking past each other to some degree in the discussion above. So I will just briefly summarise my position on the main topics that you raise (I’ve argued for these positions above. Here I’m just summarising). And then I will give a short outline of the argument for analysing Sovereign AI proposals now.
Regarding the relative priority of different research efforts:
The type of analysis that I am doing in the post is designed to reduce one of the serious AI risks that we face. This risk is due to a combination of the fact that (i): we might end up with a successfully implemented Sovereign AI proposal that has not been analysed properly, and the fact that (ii): the successful implementation of a reasonable sounding Sovereign AI proposal might lead to a massively worse than extinction outcome. In other words: reducing the risk of a massively worse than extinction outcome is a tractable research project (specifically: this risk can be reduced by analysing the types of alignment targets that the post is analysing). This research project is currently not being pursued. Other efforts are needed to reduce other types of risks. And it is certainly possible for reasonable people to disagree substantially on how attention would best be allocated. But it still seems very clear to me that the current situation is a serious mistake.
I don’t actually know what the optimal allocation of attention would be. But I have been in contact with a lot of people during the last few years. And I have never gotten any form of pushback when I say that there currently exists exactly zero people in the world dedicated to the type of analysis that I am talking about. So whatever the optimal ratio is, I am confident that the type of analysis that I am advocating for deserves more attention. (It might of course be perfectly reasonable for a given AI safety researcher to decide to not personally pursue this type of analysis. But I am confident that the overall situation is not reasonable. It simply cannot be reasonable to have zero people dedicated to a tractable research project, that reduces the probability of a massively worse than extinction outcome).
Regarding the type of Instruction Following AGI (IFAGI) that you mention:
The successful implementation of such an IFAGI would not reliably prevent a Sovereign AI proposal from being successfully implemented later. And this Sovereign AI proposal might be implemented before it has been properly analysed. Which means that the IFAGI idea does not remove the need for the type of risk-mitigation focused research project that the post is an example of. In other words: Such an IFAGI might not result in a lot of time to analyse Sovereign AI proposals. And such an IFAGI might not be a lot of help when analysing Sovereign AI proposals. So even if we assume that an IFAGI will be successfully implemented, then this would still not remove the need for the type of analysis that I am talking about. (Conditioned on such an IFAGI being successfully implemented, we might get a lot of time. And we might get a lot of help with analysis. But we might also end up in a situation where we do not have much time, and where the IFAGI does not dramatically increase our ability to analyse Sovereign AI proposals)
Regarding perfect solutions and provably safe AI:
I am not trying to do anything along the lines of proving safety. What I am trying to do is better described as trying to prove un-safety. I look at some specific proposed AI project plan. (For example an AI project plan along the lines of: first humans are augmented. Then those augmented humans builds some form of non-Sovereign AI. And then they use that non-Sovereign AI to build an AI Sovereign, that implements the CEV of Humanity). And then I explain why the success of this project would be worse than extinction (in expectation. From the perspective of a human individual. For the reasons outlined in the post). So I am in some sense looking for definitive answers. But more along the lines of provable catastrophe than provable safety. What I am trying to do is a bit like attempting to conclusively determine that a specific location contains a landmine (where a specific AI project plan being successfully implemented, is analogous to a plan that ends with someone standing on the location of a specific landmine). It is very different from attempting to conclusively determine that a specific path is safe. (Just wanted to make sure that this is clear).
A very brief outline of the argument for analysing Sovereign AI proposals now:
Claim 1: We might end up with a successfully implemented AI Sovereign. Even if the first clever thing created is not an AI Sovereign, an AI Sovereign might be developed later. Augmented humans, non-Sovereign AIs, etc, might be followed by an AI Sovereign. (See for example the proposed path to an AI Sovereign described on the CEV arbitalpage).
Claim 2: In some scenarios that end in a successfully implemented AI Sovereign, we will not get a lot of time to analyse Sovereign AI proposals. (For example due to Internal Time Pressure. See also this subsection for an explanation of why shutting down competing AI projects might not buy a lot of time. See also the last section of this comment, which outlines one specific scenario where a tool-AI successfully shuts down all unauthorised AI projects, but does not buy a lot of time).
Claim 4: A reasonable sounding Sovereign AI proposal might lead to a massively worse than extinction outcome. (See for example the PCEV thought experiment).
Claim 5: Noticing such issues is not guaranteed. (For example illustrated by the fact that the problem with PCEV went unnoticed for many years).
Claim 6: Reducing the probability of such outcomes is possible. Reducing this probability is a tractable research project, because risk can be reduced without finding any good Sovereign AI proposals. (For example illustrated by the present post, or the PCEV thought experiment).
Claim 7: There exists exactly zero people in the world dedicated to this tractable way of reducing the probability of a massively worse than extinction outcome. (It is difficult to prove the non-existence of something. But I have been saying this for quite a while now, while talking to a lot of different people. And I have never gotten any form of pushback on this).
Conclusion: We might end up in a worse than extinction outcome, because a successfully implemented Sovereign AI proposal has a flaw that was realistically findable. It would make sense to spend a non-tiny amount of effort on reducing the probability of this.
(People whose intuition says that this conclusion must surely be false in some way, could try to check whether or not this intuition is actually based on anything real. The most straightforward way would be to spell out the actual argument for this in public, so that the underlying logic can be checked. Acting based on the assumption that such an intuition is based on anything real, without at least trying to evaluate it first, does not sound like a good idea)
I very much agree with your top-level claim: analyzing different alignment targets well before we use them is a really good idea.
But I don’t think those are the right alignment targets to analyze. I think none of those are very likely to actually be deployed as alignment targets for the first real AGIs. I think that Instruction-following AGI is easier and more likely than value aligned AGI Ω or roughly equivalently (and better-framed for the agent foundations crowd), Corrigibility as Singular Target is far superior to anything else. I think it’s so superior that anyone sitting down and thinking about the topic, for instance just before launching something they viscerally believe might actually be able to learn and self-improve, will likely see it the same way.
On top of that logic, the people actually building the stuff would rather have it aligned to their goals than everyones.
I do think that it’s important to analyse alignment targets like these. Given the severe problems that all of these alignment targets suffer from, I certainly hope that you are right about them being unlikely. I certainly hope that nothing along the lines of a Group AI will ever be successfully implemented. But I do not think that it is safe to assume this. The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. The CEV arbital page actually assumes that the path to a Group AI goes through an initial limited AI (referred to as a Task AI). In other words: the classical proposed path to an AI that implements the CEV of Humanity actually starts with an initial AI that is not an AI Sovereign (and such an AI could for example be the type of instruction following AI that you mention). In yet other words: your proposed AI is not an alternative to a Group AI. Its successful implementation does not prevent the later implementation of a Group AI. Your proposed AI is in fact one step in the classical (and still fairly popular) proposed path to a Group AI.
I actually have two previous posts that were devoted to making the case for analysing the types of alignment targets that the present post is focusing on. The present post is instead focusing on doing such analysis. This previous post outlined a comprehensive argument in favour of analysing these types of alignment targets. Another previous post specifically focused on illustrating that Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure. See also this comment where I discuss the difference between proposing solutions on the one hand, and pointing out problems on the other hand.
Charbel-Raphaël responded to my post by arguing that no Sovereign AI should ever be created. My reply pointed out that this is mostly irrelevant to the question at hand. The only relevant question is whether or not a Sovereign AI might be successfully implemented eventually. If that is the case, then one can reduce the probability of some very bad outcomes by doing the type of Alignment Target Analysis that my previous two posts were arguing for (and that the present post is an example of). The second half of this reply (later in the same thread) includes a description of an additional scenario where an initial limited AI is followed by a Sovereign AI (and this Sovereign AI is implemented without significant time spent on analysing the specific proposal, due to Internal Time Pressure).
Regarding Corrigibility as a Singular Target:
I don’t think that one can rely on this idea to prevent the outcome where a dangerous Sovereign AI proposal is successfully implemented at some later time (for example after an initial AI has been used to buy time). One issue is the difficulty of defining critical concepts such as Explanation and Understanding. I previously discussed this with Max Harms here, and with Nathan Helm-Burger here. Both of those comments are discussing attempts to make an AI pursue Corrigibility as a Singular Target (which should not be confused with my post on Corrigibility, which discussed a different type of Corrigibility).
Regarding what the designers might want:
The people actually building the stuff might not be the ones deciding what should be built. For example: if a messy coalition of governments enforces a global AI pause, then this coalition might be able to decide what will eventually be built. If a coalition is capable of successfully enforcing a global AI pause, then I don’t think that we can rule out the possibility that they will be able to enforce a decision to build a specific type of AI Sovereign (they could for example do this as a second step, after first managing to gain effective control over an initial instruction following AI). If that is the case, then the proposal to build something along the lines of a Group AI might very well be one of the politically feasible options (this was previously discussed in this post and in this comment).
I agree with essentially all of this. See my posts
If we solve alignment, do we die anyway? on AGI nonproliferation and government involvement
and
Intent alignment as a stepping-stone to value alignment on eventually building sovereign ASI using intent-aligned (IF or Harms-corrigible) AGI to help with alignment. Wentworth recently pointed out that idiot sycophantic AGI combined with idiotic/time-pressured humans might easily screw up that collaboration, and I’m afraid I agree. I hope we do it slowly and carefully, but not slowly enough to fall into the attractor of a vicious human getting the reigns and keeping them forever.
The only thing I don’t agree with (AFAICT on a brief look—I’m rushed myself right now so LMK what else I’m missing if you like) is that we might have a pause. I see that as so unlikely as to not be worth time thinking about. I have yet to see any coherent argument for how we get one in time. If you know of such an argument, I’d love to see it!
Given that you agreed with most of what I said in my reply, it seems like you should also agree that it is important to analyse these types of alignment targets. But in your original comment you said that you do not think that it is important to analyse these types of alignment targets.
Let’s write Multi Person Sovereign AI Proposal (MPSAIP) for an alignment target proposal to build an AI Sovereign that gets its goal from the global population (in other words: the type of alignment target proposals that I was analysing in the post). I followed your links and can only find one argument against the urgency of analysing MPSAIPs now: that an Instruction Following AI (IFAI) would make this unnecessary. I can see why one might expect that an IFAI would help to some degree when analysing MPSAIPs. But I don’t see how the idea of an IFAI could possibly remove the urgent need to analyse MPSAIPs now.
In your post on distinguishing value alignment from intent alignment, you define value alignment as being about all of humanity’s long term, implicit deep values. It thus seems like you are not talking about anything along the lines of building an AI that will do whatever some specific person wants that AI to do. Please correct me if I’m wrong, but your position thus seems to be built on top of the assumption that it would be safe to assume that an IFAI can be used to solve the problem of how to describe all of humanity’s long term, implicit deep values.
A brief summary of why I think that this is false: You simply cannot delegate the task of picking a goal to an AI (no matter how clever this AI is). You can define the goal indirectly and have the AI work out the details. But the task is simply not possible to delegate. For the same reason: you simply cannot delegate the task of picking a MPSAIP to an AI (no matter how clever this AI is). You can define things indirectly and have the AI work out the details. This is equivalent to fully solving the field of MPSAIP analysis. It would for example necessarily involve defining some procedure for dealing with disagreements amongst individuals that disagree on how to deal with disagreements (because individuals will not agree on which MPSAIP to pick). PCEV is one such procedure. It sounds reasonable but would lead to an outcome far worse than extinction. VarAI is another procedure that sounds reasonable but that is in fact deeply problematic. As shown in the post, this is not easy (partly because intuitions about well known concepts tend to break when transferred to the AI context). In other words: you can’t count on an IFAI to notice a bad MPSAIP, for the same reason that you can’t count on Clippy to figure out that it has the wrong goal.
How useful would an IFAI be for analysing MPSAIPs?
I can see why one might think that an IFAI would be somewhat useful. But I don’t see how one can be confident that it would be very useful (let alone be equivalent to a solution). If one does not hold this position, then the existence of an IFAI does not remove the need to analyse MPSAIPs now. (The idea that an IFAI might be counted on to buy sufficient time to analyse MPSAIPs is covered below, in the section where I answer your question about an AI pause).
The idea that an IFAI would be extremely useful for Alignment Target Analysis seems to be very common. But there is never any actual reason given for why this might be true. In other words: while I have heard similar ideas many times, I have never been able to get any actual argument in favour of the position, that an IFAI would be very useful for analysing MPSAIPs (by you or by anyone else). It is either implicit in some argument, or just flatly asserted. There seems to be two versions of this idea. One version is the delegation plan. In other words: the plan where one builds an IFAI that does know how to describe all of humanity’s long term, implicit deep values. The other version is the assistant plan. In other words: the plan where one builds an IFAI that does not know how to describe all of humanity’s long term, implicit deep values (and then uses that IFAI as an assistant while analysing MPSAIPs). I will cover them separately below.
The delegation plan: The scenario where an IFAI does know how to define all of humanity’s long term, implicit deep values
I don’t know how this plan could possibly remove the need for analysing MPSAIPs now. I don’t know why anyone would believe this (similarly to how I don’t know why anyone would believe that Clippy can be counted on to figure out that it has the wrong goal). It is clearly a common position. But as far as I am aware, there exists no positive argument for this position. Without any actual argument in favour of this position, it is a bit tricky to argue against this position. But I will do my best.
A preliminary point is that the task of picking one specific mapping, that maps from billions of humans to an entity of the type that can be said to want things, is not a technical task with a findable solution (see the post for much more on this). In yet other words: if one were to actually describe in detail the argument that one can delegate the task of analysing MPSAIPs to an IFAI, then one would run into a logical problem (if one tried to actually spell out the details step by step, one would be unable to do so). The problem one would run into, would be the same problem that one would run into if one were to try to argue that Clippy will figure out that it has the wrong goal (if one tried to actually spell out the details step by step, one would be unable to do so). Neither finding the correct goal nor analysing MPSAIPs is a technical task with a findable solution. Thus, neither task can be delegated to an AI, no matter how clever it is.
Let’s say that we have an IFAI that is able to give an answer, when you ask it how to describe all of humanity’s long term, implicit deep values. This is equivalent to the IFAI having already picked a specific MPSAIP.
I see only two ways of arriving at such an IFAI. One is that something has gone wrong, and the IFAI has settled on an answer by following some process that the designers did not intend it to follow. This is a catastrophic implementation failure. In other words: unless the plan was for the IFAI to choose an MPSAIP using some unknown procedure, the project has not gone according to plan. In this case I see no particular reason to think that the outcome would be any better than the horrors implied by PCEV.
The only other option that I see is that the designers have already fully solved the problem of how to define all of humanity’s long term, implicit deep values (presumably indirectly, by defining a process that leads to such a definition). In other words: if one plans to build an IFAI like this, then one has to fully solve the entire field of analysing MPSAIPs, before one builds the IFAI. In yet other words: if this is the plan, then this plan is an argument in favour of the urgent need to analyse MPSAIPs.
The assistant plan: The scenario where an IFAI does not know how to define all of humanity’s long term, implicit deep values
To conclude that analysing MPSAIPs now is not urgent, one must assume that this type of IFAI assistant is guaranteed to have a very dramatic positive effect (a somewhat useful IFAI assistant would not remove the urgent need for analysing MPSAIPs now). It seems to be common to simply assume that an IFAI assistant will basically render prior work on analysing MPSAIPs redundant (the terminology differs. And it is often only implicit in some argument or plan. But the assumption is common). I have however never seen any detailed plan for how this would actually be done. (The situation is similar to how the delegation plan is never actually spelled out). I think that as soon as one were to lay out the details of how this would work, one would realise that one has a plan that is built on top of an incorrect assumption (similar to the type of incorrect implicit assumption that one would find, if one were to spell out the details of why exactly Clippy can be counted on to realise that it has the wrong goal).
It is difficult to argue against this position directly, since I don’t know how this IFAI is supposed to be used (let alone why this would be guaranteed to have a large positive effect). But I will try to at least point to some difficulties that one would run into.
Let’s say that Allan is asking the IFAI questions, as a step in the process of analysing MPSAIPs. Every question Allan asks of an IFAI like this would pose a very dramatic risk. Allan is leaning heavily on a set of definitions, for example definitions of concepts like Explanation and Understanding. Even if those definitions have held up while the IFAI was used to do other things (such as shutting down competing AI projects), those definitions could easily break when discussing MPSAIPs. Since the IFAI does not know what a bad MPSAIP is, the IFAI has no way of noticing that it is steering Allan towards a catastrophically bad MPSAIP. Regardless of how clever the IFAI is, there is simply no chance of it noticing this. Just as there is no chance of Clippy discovering that it has the wrong goal.
In other words: if a definition of Explanation breaks during a discussion with an IFAI, and Allan ends up convinced that he must implement PCEV, then we will end up with the horrors implied by PCEV. (If you think that the IFAI will recognise the outcome implied by PCEV as a bad outcome, then you are imagining the type of IFAI that was treated in the previous subsection (and such an IFAI can only be built after the field of analysing MPSAIPs have been fully solved)). This was previously discussed here and here (using different terminology).
(To be clear: this subsection is not arguing against the plan of building an IFAI of this type. And it is not arguing against the idea that this type of IFAI might be somewhat useful. It is not even arguing against the idea that it might be possible to use an IFAI like this in a way that dramatically increases the ability to analyse MPSAIPs. It is simply arguing against the idea that one can be sure that an IFAI like this will in fact be used in a way that will dramatically increase the ability to analyse MPSAIPs. This is enough to show that the IFAI idea does not remove the urgent need to analyse MPSAIPs now).
Regarding the probability of a pause
The probability of a politically enforced pause is not important for any argument that I am trying to make. Not much changes if we replace a politically enforced pause with an IFAI. Some group of humans will still decide what type of Sovereign AI will eventually be built. If they successfully implement a bad Sovereign AI proposal, then the outcome could be massively worse than extinction. So it makes sense to reduce the probability of that. One tractable way of reducing this probability is by analysing MPSAIPs.
In other words: if you achieve a pause by doing something other than building an AI Sovereign (for example by implementing a politically enforced pause, or by using an IFAI). Then the decision of what AI Sovereign to eventually build will remain in human hands. So then you will still need progress on analysing MPSAIPs to avoid bad Sovereign AI proposals. There is no way of knowing how long it will take to achieve the needed level of such progress. And there is no way of knowing how much time a pause will actually result in. So, even if we did know exactly what method will be used to shut down competing projects. And we also knew exactly who will make decisions regarding Sovereign AI. Then there is still no way of knowing that there will be sufficient time to analyse MPSAIPs. Therefore, such analysis should start now. (And as illustrated by my post, such progress is tractable).
One point that should be made here, is that you can end up with a multipolar world even if there is a single IFAI that flawlessly shuts down all unauthorised AI projects. If a single IFAI is under the control of some set of existing political power structures, then this would be a multipolar world. Regardless of who is in control (for example the UN Security Council (UNSC), the UN general assembly, or some other formalisation of global power structures), it is still possible for some ordinary political movement to gain power over the IFAI, by ordinary political means. Elected governments can be voted out. Governments along the lines of the USSR can evidently also be brought down by ordinary forms of political movements. So there is in general nothing strange about someone being in control of an IFAI, but finding themselves in a situation where they must either act quickly and decisively, or risk permanently losing control to people with different values. This means that shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure.
Let’s consider the scenario where a UNSC resolution is needed to ask the IFAI a question, or to ask the IFAI to do something (such as shutting down competing AI projects, or launching an AI Sovereign). There is currently an agreement of what AI Sovereign to build. But there is also an agreement that it would be good to first analyse this proposal a bit more, to make sure there is no hidden problem with it. In this case, losing control of any of the five countries with a veto would remove the ability to launch an AI Sovereign (if control is lost to a novel and growing political movement, then control could be lost permanently. The result of losing control of one permanent UNSC member could mean that a deadlock will persist until the new movement eventually controls all five). So, the people currently in control would basically have to either act quickly or risk permanently losing power to people with different values. If they decide to aim at their preferred MPSAIP, then it would be very nice if the field of analysing MPSAIPs had progressed to the point where it is possible to notice that this MPSAIP implies an outcome worse than extinction (for example along the lines of the outcome implied by PCEV. But presumably due to a harder-to-notice problem).
I used the UNSC as an example in the preceding paragraph, because it seems to me like the only legal way of taking the actions that would be necessary to robustly shut down all competing AI projects (being the only legal option, and thus a sort of default option, makes it difficult to rule out this scenario). But the same type of Internal Time Pressure might also arise in other arrangements. This comment outlines a scenario where a global electorate is in charge (which seems like another reasonable candidate for how to define what it means to do the default thing). This post outlines a scenario where a group of augmented humans are in charge (in that scenario buying time is achieved by uploading. Not by shutting down competing AI projects. This seems like something that someone might do if they don’t feel comfortable with using force. But simultaneously don’t feel ready to take the decision to give up control to some specific political process).
The reason that I keep going on about the need for Alignment Target Analysis (ATA) is that there seems to currently exist exactly zero people in the world devoted to doing ATA full time. Making enough ATA progress to reduce the probability of bad outcomes is also tractable (trying to solve ATA would be a completely different thing. But there still exists a lot of low hanging fruit in terms of ATA progress that reduces the probability of bad outcomes). It thus seems entirely possible to me that we will end up with a PCEV style catastrophe that could have been easily prevented. Reducing the probability of that seems like a reasonable thing to do. But it is not being done.
An attempt to summarise how I view the situation
At our current level of ATA progress it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction. I simply don’t see how one can think that it is safe to stay at this level of progress. Intuitively this seems like a dangerous situation. The fact that there exists no research project dedicated to improving this situation seems like a mistake (as illustrated by my post, reducing the probability of bad outcomes is a tractable research project). It seems like many people do have some reason for thinking that the current state of affairs is acceptable. As far as I can tell however, these reasons are not made public. This is why I think that it makes sense to spend time on trying to figure out what you believe to be true, and why you believe it to be true (and this is also why I appreciate you engaging on this).
In other words: arguing that ATA should be a small percentage of AI safety work would be one type of argument. Arguing that the current situation is reasonable would be a fundamentally different type of argument. It is clearly the case that plenty of people are convinced that it is reasonable to stay at the current level of ATA progress (in other words: many of people are acting in a way that I can only explain if I assume that they feel very confident, that it is safe to stay at our current level of ATA progress). I think that they are wrong about this. But since no argument in favour of this position is ever outlined in detail, there is no real way of addressing this directly.
PS:
I’m fine with continuing this discussion here. But it probably makes sense to at least note that it would have fitted better under this post (which makes the case for analysing this type of alignment targets. And actually discusses the specific topic of why various types of Non-Sovereign-AIs would not replace doing this now). As a tangent, the end of that post actually explicitly asked people to outline their reasons for thinking that ATA now is not needed. Your response here seems to be an example of this. So I very much appreciate your engagement on this. In other words: I don’t think you are the only one that have ideas along these lines. I think that there are plenty of people with similar ways of looking at things. And I really wish that those people would clearly outline their reasons for thinking that the current situation is reasonable. Because I think that those reasons will fall apart if they are outlined in any actual detail. So I really appreciate that you are engaging on this. And I really wish that more people would do the same.
Thanks! I don’t have time to process this all right now, so I’m just noting that I do want to come back to it quickly and engage fully.
Here’s my position in brief: I think analyzing alignment targets is valuable. Where my current take differs from yours (I think) is that I think that effort would be best spent analyzing what you term corrigibility in the linked post (I got partway through and will have to come back to it), and I’ve called instruction-following.
I think that’s far more important to do first, because that’s approximately what people are aiming for right now. I fully agree that there are other values mixed in with the training other than instruction-following. I think the complexity and impurity of that target makes it more urgent, not less, to have good minds analyzing the alignment targets that developers are most likely to pursue first by default. See my recent post Seven sources of goals in LLM agents. This is my main research focus, but I know of no one else focusing on this, and few people who even give it part-time attention. This seems like a bad allocation of resources; there might be major flaws in the alignment target that we don’t find until developers are far along that path and reluctant to rework it.
You said
I wrote a little more about this in Intent alignment as a stepping-stone to value alignment.
I definitely do not think it would be safe to assume that IF/corrigible AGI can solve value alignment for other/stronger AGI. John Wentworth’s The Case Against AI Control Research has a pretty compelling argument for how we’d collaborate with sycophantic parahuman AI/AGI to screw up aligning the next step in AGI/ASI.
I do not think any of this is safe. I think we are long past the point where we should be looking for perfectly reliable solutions. I strongly believe we must look for the best viable solutions, factoring in the practicality/likelyhood of getting them actually implemented. I worry that the alignment community’s desire for best (let alone “provably safe”) alignment solutions will prevent them from working on solutions that give us the best possible shot within the economic, psychological, and political realities governing the creation of AGI.
So we seem to be in agreement that the current alignment target of instruction-following/corrigibility should really get more analysis. I am working on that. I’ll just toss out one difficulty I’m reckoning with, which no one has (at least expicitly) recognized AFAIK: IF AGI must be oriented to prioritize taking and following new instructions over old (otherwise it won’t listen for new instructions since that would risk not achieving all the goals from past instructions). With that as first priority, it would avoid being shut down, since that would prevent it from hearing further instructions.
This is the sort of alignment target analysis that thinkers like you could help with. I wish we just had far more people thinking about this. Given the sharp limitations, it still seems like prioritizing IF/corrigibilty (and the messy mix of moralistic behavior devs are training for) seems higher priority than value alignment targets that we perhaps “should” use but will in fact almost certainly not.
Much more in a few days after I publish my next piece on the complexities of instruction-following mixed with other implicit/trained goals, and fully process your pieces! Thanks for engaging. I appreciate your efforts in this direction, whether or not you decide to analyze intent alignment targets other than value alignment targets.
It seems to me that we are going in circles and talking past each other to some degree in the discussion above. So I will just briefly summarise my position on the main topics that you raise (I’ve argued for these positions above. Here I’m just summarising). And then I will give a short outline of the argument for analysing Sovereign AI proposals now.
Regarding the relative priority of different research efforts:
The type of analysis that I am doing in the post is designed to reduce one of the serious AI risks that we face. This risk is due to a combination of the fact that (i): we might end up with a successfully implemented Sovereign AI proposal that has not been analysed properly, and the fact that (ii): the successful implementation of a reasonable sounding Sovereign AI proposal might lead to a massively worse than extinction outcome. In other words: reducing the risk of a massively worse than extinction outcome is a tractable research project (specifically: this risk can be reduced by analysing the types of alignment targets that the post is analysing). This research project is currently not being pursued. Other efforts are needed to reduce other types of risks. And it is certainly possible for reasonable people to disagree substantially on how attention would best be allocated. But it still seems very clear to me that the current situation is a serious mistake.
I don’t actually know what the optimal allocation of attention would be. But I have been in contact with a lot of people during the last few years. And I have never gotten any form of pushback when I say that there currently exists exactly zero people in the world dedicated to the type of analysis that I am talking about. So whatever the optimal ratio is, I am confident that the type of analysis that I am advocating for deserves more attention. (It might of course be perfectly reasonable for a given AI safety researcher to decide to not personally pursue this type of analysis. But I am confident that the overall situation is not reasonable. It simply cannot be reasonable to have zero people dedicated to a tractable research project, that reduces the probability of a massively worse than extinction outcome).
Regarding the type of Instruction Following AGI (IFAGI) that you mention:
The successful implementation of such an IFAGI would not reliably prevent a Sovereign AI proposal from being successfully implemented later. And this Sovereign AI proposal might be implemented before it has been properly analysed. Which means that the IFAGI idea does not remove the need for the type of risk-mitigation focused research project that the post is an example of. In other words: Such an IFAGI might not result in a lot of time to analyse Sovereign AI proposals. And such an IFAGI might not be a lot of help when analysing Sovereign AI proposals. So even if we assume that an IFAGI will be successfully implemented, then this would still not remove the need for the type of analysis that I am talking about. (Conditioned on such an IFAGI being successfully implemented, we might get a lot of time. And we might get a lot of help with analysis. But we might also end up in a situation where we do not have much time, and where the IFAGI does not dramatically increase our ability to analyse Sovereign AI proposals)
Regarding perfect solutions and provably safe AI:
I am not trying to do anything along the lines of proving safety. What I am trying to do is better described as trying to prove un-safety. I look at some specific proposed AI project plan. (For example an AI project plan along the lines of: first humans are augmented. Then those augmented humans builds some form of non-Sovereign AI. And then they use that non-Sovereign AI to build an AI Sovereign, that implements the CEV of Humanity). And then I explain why the success of this project would be worse than extinction (in expectation. From the perspective of a human individual. For the reasons outlined in the post). So I am in some sense looking for definitive answers. But more along the lines of provable catastrophe than provable safety. What I am trying to do is a bit like attempting to conclusively determine that a specific location contains a landmine (where a specific AI project plan being successfully implemented, is analogous to a plan that ends with someone standing on the location of a specific landmine). It is very different from attempting to conclusively determine that a specific path is safe. (Just wanted to make sure that this is clear).
A very brief outline of the argument for analysing Sovereign AI proposals now:
Claim 1: We might end up with a successfully implemented AI Sovereign. Even if the first clever thing created is not an AI Sovereign, an AI Sovereign might be developed later. Augmented humans, non-Sovereign AIs, etc, might be followed by an AI Sovereign. (See for example the proposed path to an AI Sovereign described on the CEV arbital page).
Claim 2: In some scenarios that end in a successfully implemented AI Sovereign, we will not get a lot of time to analyse Sovereign AI proposals. (For example due to Internal Time Pressure. See also this subsection for an explanation of why shutting down competing AI projects might not buy a lot of time. See also the last section of this comment, which outlines one specific scenario where a tool-AI successfully shuts down all unauthorised AI projects, but does not buy a lot of time).
Claim 3: In some scenarios that end in a successfully implemented AI Sovereign, we will not get a lot of help with analysis of Sovereign AI proposals. (Partly because asking an AI for a good Sovereign AI proposal is like asking an AI what goal it should have. See also this subsection on the idea of having AI assistants helping with analysis. This subsection and this section argues that augmented humans might turn out to be good at hitting alignment targets, but not good at analysing alignment targets).
Claim 4: A reasonable sounding Sovereign AI proposal might lead to a massively worse than extinction outcome. (See for example the PCEV thought experiment).
Claim 5: Noticing such issues is not guaranteed. (For example illustrated by the fact that the problem with PCEV went unnoticed for many years).
Claim 6: Reducing the probability of such outcomes is possible. Reducing this probability is a tractable research project, because risk can be reduced without finding any good Sovereign AI proposals. (For example illustrated by the present post, or the PCEV thought experiment).
Claim 7: There exists exactly zero people in the world dedicated to this tractable way of reducing the probability of a massively worse than extinction outcome. (It is difficult to prove the non-existence of something. But I have been saying this for quite a while now, while talking to a lot of different people. And I have never gotten any form of pushback on this).
Conclusion: We might end up in a worse than extinction outcome, because a successfully implemented Sovereign AI proposal has a flaw that was realistically findable. It would make sense to spend a non-tiny amount of effort on reducing the probability of this.
(People whose intuition says that this conclusion must surely be false in some way, could try to check whether or not this intuition is actually based on anything real. The most straightforward way would be to spell out the actual argument for this in public, so that the underlying logic can be checked. Acting based on the assumption that such an intuition is based on anything real, without at least trying to evaluate it first, does not sound like a good idea)