The case for more Alignment Target Analysis (ATA)
Summary
We don’t have good proposals for alignment targets: The most recently published version of Coherent Extrapolated Volition (CEV), a fairly prominent alignment target, is Parliamentarian CEV (PCEV). PCEV gives a lot of extra influence to anyone who intrinsically values hurting other individuals (search the CEV arbital page for ADDED 2023 for Yudkowsky’s description of the issue). This feature went unnoticed for many years and would make a successfully implemented PCEV very dangerous.
Bad alignment target proposals are dangerous: There is no particular reason to think that discovery of this problem was inevitable. It went undetected for many years. There are also plausible paths along which PCEV (or a proposal with a similar issue) might have ended up being implemented. In other words: PCEV posed a serious risk. That risk has probably been mostly removed by the arbital update. (It seems unlikely that someone would implement a proposed alignment target without at least reading the basic texts describing the proposal). PCEV is however not the only dangerous alignment target, and risks from scenarios where someone successfully hits some other bad alignment target remains.
Alignment Target Analysis (ATA) can reduce these risks. We will argue that more ATA is needed and urgent. ATA can informally be described as analyzing and critiquing Sovereign AI proposals, for example along the lines of CEV. By Sovereign AI we mean a clever and powerful AI that will act autonomously in the world (as opposed to tool AIs or a pivotal act AI of the type that follows human orders and that can be used to shut down competing AI projects). ATA asks what would happen if a Sovereign AI project were to succeed at aligning their AI to a given alignment target.
ATA is urgent. The majority of this post will focus on arguing that ATA cannot be deferred. A potential Pivotal Act AI (PAAI) might fail to buy enough calendar time for ATA since it seems plausible that a PAAI wouldn’t be able to sufficiently reduce internal time pressure. Augmented humans and AI assistants might fail to buy enough subjective time for ATA. Augmented humans that are good at hitting alignment targets will not necessarily be good at analyzing alignment targets. Creating sufficiently helpful AI assistants might be hard or impossible without already accidentally locking in an alignment target to some extent.
ATA is extremely neglected. The field of ATA is at a very early stage, and currently there does not exist any research project dedicated to ATA. The present post argues that this lack of progress is dangerous and that this neglect is a serious mistake.
A note on authorship
This text largely builds on previous posts by Thomas Cederborg. Chi contributed mostly by trying to make the text more coherent. While the post writes a lot in the “we” perspective, Chi hasn’t thought deeply about many of the points in this post yet, isn’t sure what she would endorse on reflection, and disagrees with some of the details. Her main motivation was to make Thomas’ ideas more accessible to people.
Alignment Target Analysis is important
This post is concerned with proposals for what to align a powerful and autonomous AI to, for example proposals along the lines of Coherent Extrapolated Volition (CEV). By powerful and autonomous we mean AIs that are not directly being controlled by a human or a group of humans but not the types of proposed AI systems that some group of humans might use for limited tasks, such as shutting down competing AI projects. We will refer to this type of AI as Sovereign AI throughout this post. Many people both consider it possible that such an AI will exist at some point, and further think that it matters what goal such an AI would have. This post is addressed to such an audience. (The imagined reader does not necessarily think that creating a Sovereign AI is a good idea. Just that it is possible. And that if it happens, then it matters what goal such an AI has).
A natural conclusion from this is that we need Alignment Target Analysis (ATA) at some point. A straightforward way of doing ATA is to take a proposal for something we should align an AI to (for example: the CEV of a particular set of people) and then ask what would happen if someone were to successfully hit this alignment target. We think this kind of work is very important. Let’s illustrate this with an example.
The most recently published version of CEV is based on extrapolated delegates negotiating in a Parliament. Let’s refer to this version of CEV as Parliamentarian CEV (PCEV). It turns out that the proposed negotiation rules of the Parliament gives a very large advantage to individuals that intrinsically values hurting other individuals. People that want to inflict serious harm get a bigger advantage than people that want to inflict less serious harm. The largest possible advantage goes to any group that wants PCEV to hurt everyone else as much as possible. This feature of PCEV makes PCEV very dangerous. However, this feature went unnoticed for many years, despite this being a fairly prominent proposal. This illustrates three things:
First, it shows that noticing problems with proposed alignment targets is difficult.
Second, it shows that successfully implementing a bad alignment target can result in a very bad outcome.
Third, it shows that reducing the probability of such scenarios is feasible (the fact that this feature has been noticed makes it a lot less likely that PCEV will end up getting implemented).
This example shows that getting the alignment target right is extremely important and that even reasonable seeming targets can be catastrophically bad. The flaws in PCEV’s negotiation rules are also not unique to PCEV. An AI proposal from 2023 uses similar rules and hence suffers from related problems. The reason that more ATA is needed is that finding the right target is surprisingly difficult, noticing flaws is surprisingly difficult, and because targets that look reasonable enough might lead to catastrophic outcomes.
The present post argues that making progress on ATA is urgent. As shown, the risks associated with scenarios where someone successfully hits a bad alignment target are serious. Our main thesis is that there might not be time to do ATA later. If one considers it possible that a Sovereign AI might be built, then the position that doing ATA now is not needed must rest on some form of positive argument. One class of such arguments is based on an assertion that ATA has already been solved. We already argued that this is not the case.
Another class of arguments is based on an assertion that all realistic futures falls into one of two possible categories, (i): scenarios with misaligned AI (in which case ATA is irrelevant), or (ii): scenarios where there will be plenty of time to do ATA later and so we should defer it to future, potentially enhanced humans and their AI assistants. The present post will be focused on countering arguments along these lines. We will further argue that these risks can be reduced by doing ATA. The conclusion is that it is important that ATA work starts now. However, there does not appear to exist any research project dedicated to ATA. This seems like a mistake to us.
Alignment Target Analysis is urgent
Let’s start by briefly looking at one common class of AI safety plans that does not feature ATA until a much later point. It goes something like this: Let’s make AI controllable, i.e. obedient, helpful, not deceptive, ideally without long-term goals on its own really, just there to follow our instructions. We don’t align those AIs to anything more ambitious or object-level. Once we succeed at that, we can use those AIs to help us figure out how to build a more powerful AI sovereign safely and with the right kind of values. We’ll be much smarter with the help of those controllable AI systems, so we’ll also be in a better position to think about what to align sovereign AIs to. We can also use these controllable AI systems to buy more time for safety research, including ATA, perhaps by doing a pivotal act (in other words: use some form of instruction-following-AI to take actions that shuts down competing AI projects). So, we don’t have to worry about more ambitious alignment targets, yet. In summary: We don’t actually need to worry about anything right now other than getting to the point where we have controllable AI systems that are strong enough to either speed up our thinking or slow down AI progress or, ideally, both.
One issue with such proposals is that it seems very difficult to us to make a controllable AI system that is
able to substantially assist you with ATA or can buy you a lot of time
without already implicitly substantially having chosen an alignment target, i.e. without accidental lock-in.
If this is true, ATA is time-sensitive because it needs to happen before and alongside us developing controllable AI systems.
Why we don’t think the idea of a Pivotal Act AI (PAAI) obsoletes doing ATA now
Now, some argue that we can defer ATA by building a Pivotal Act AI (PAAI) that can stop all competing AI projects and hence buy us unlimited time. There are two issues with this: First, PAAI proposals need to balance buying time and avoiding accidental lock-in. The more effective an AI is at implementing a pivotal act of a type that reliably prevents bad outcomes, the higher the risk you have already locked something in.
For an extreme example, if your pivotal act is to have your AI autonomously shut down all “bad AI projects”, we almost certainly have already locked in some values. A similar issue also makes it difficult for an AI assistant to find a good alignment target without many decisions having already been made (more below). If a system reliably shuts down all bad AIs, then the system will necessarily be built on top of some set of assumptions regarding what counts as a bad AI. This would mean that many decisions regarding the eventual alignment target have already been made (which in turn means that ATA would have to happen before any such an AI is built). And if the AI does not reliably shut down all bad AI projects, then decisions will be in the hands of humans that might make mistakes.
Second, and more importantly, we haven’t yet seen enough evidence that a good pivotal act is actually feasible and that people will pursue it. In particular, current discussions of pivotal act AI seem to neglect internal time pressure. For example, we might end up in a situation where early AI is in the hands of a messy coalition of governments that are normally adversaries. Such a coalition is unlikely to pursue a unified, optimized strategy. Some members of the coalition will probably be under internal political pressure to train and deploy the next generation of AIs. Even rational, well informed, and well intentioned governments might decide to take a calculated risk and act decisively before the coalition collapses.
If using the PAAI requires consensus, then the coalition might decide to take a decisive action before an election in one of the countries involved. Even if everyone involved is aware that this is risky, the option of ending up in a situation where the PAAI can no longer be used to prevent competing AI projects might be seen as more risky. An obvious such action would be to launch a Sovereign AI, aiming at whatever happens to be the state of the art alignment target at the time (in other words: build an autonomous AI with whatever goal is the current state of the art proposed AI goal at the time). Hence, even if we assume that the PAAI in question could be used to give them infinite time, it is not certain that a messy coalition would use it in this way, due to internal conflicts.
Besides issues related to reasonable people trying to do the right thing by taking calculated risks, another issue is that the leaders of some countries might prefer that all important decisions are made before their term of office expire (for example by giving the go ahead to a Sovereign AI project that is aiming at their favorite alignment target).
An alternative to a coalition of powerful countries would be to have the PAAI be under the control of a global electorate. In this case, a large but shrinking majority might decide to act before existing trends turn their values into a minority position. Political positions changing in fairly predictable ways is an old phenomenon. Having a PAAI that can stop outside actors from advancing unauthorized AI projects wouldn’t change that.
In addition, if we are really unlucky, corrigibility of weak systems can make things worse. Consider the case where a corrigibility method (or whatever method you use to control your AIs) turns out to work for an AI that is used to shut down competing AI projects, but does not work for sovereign AIs. If they have such a partially functional corrigibility technique, they might take the calculated risk of launching a sovereign AI that they hope is also corrigible (thinking that this is likely, because the method worked on a non sovereign AI). Thus, if the state of the art alignment target has a flaw, then discovering this flaw is urgent. See also this post.
To summarize: Even if someone can successfully prevent outside actors from making AI progress, i.e. if we assume the existence of a PAAI that could, in principle, be used to give humanity infinite time for reflection, that doesn’t guarantee a good outcome. Some group of humans would still be in control (since it is not possible to build a PAAI that prevents them from aiming at a bad alignment target without locking in important decisions). That group might still find themselves in a time crunch due to internal power struggles and other dynamics between themselves. In this case, the humans might decide to take a calculated risk and aim at the best alignment target they know of (which at the current level of ATA progress would be exceptionally dangerous).
However, this group of humans might be open to clear explanations of why their favorite alignment target contains a flaw that would lead to a catastrophic outcome. An argument of the form: “the alignment target that you are advocating for would have led to this specific horrific outcome, for these specific reasons’’ might be enough to make part of a shrinking majority hesitate, even if they would strongly prefer that all important decisions are finalized before they lose power. First however, the field of ATA would need to advance to the point where it is possible to notice the problem in question.
Why we don’t think human augmentation and AI assistance obsolete doing ATA now
Some people might argue that we can defer ATA to the future not because we will have virtually unlimited calendar time but because we will have augmented humans or good AI assistants that will allow us to do ATA much more effectively in the future. This might not buy us much time in calendar months but a lot of time in subjective months to work on ATA.
Why we don’t think the idea of augmenting humans obsoletes doing ATA now
If one is able to somehow create smarter augmented humans, then it is possible that everything works out even without any non-augmented human ever making any ATA progress at all. In order to conclude that this idea obsoletes doing ATA now, one however needs to make a lot of assumptions. It is not sufficient to assume that a project will succeed in creating augmented humans that are both very smart and also well intentioned.
For example, the augmented humans might be very good at figuring out how to hit a specified alignment target while not being very good at ATA since they are two different types of skills. One issue is that making people better at hitting alignment targets might simply be much easier than making them better at ATA. A distinct issue is that (regardless of relative difficulty levels) the first project that succeeds at creating augments that are good at hitting alignment targets, might not have spent a lot of effort to ensure that these augments are also good at ATA. In other words: augmented humans might not be good at ATA, simply because the first successful project never even bothered to try to select for this.
It is important to note that ATA can still help prepare us for scenarios with augmented humans that aren’t better than non augmented humans at ATA, even if it does not result in any good alignment target. To be useful, ATA only needs to find the flaw in alignment targets (before the augmented humans respond to some time crunch, by taking the calculated risk of launching a Sovereign AI, aiming at this alignment target). If the flaw is found in time, then the augmented humans would have no choice other than to keep trying different augmentation methods, until this process results in some mind that is able to make genuine progress on ATA (because they do not have access to any nice-seeming alignment targets).
Accidental value lock-in vs. competence tension for AI assistants
When it comes to deferring to future AI assistants, we have additional issues to consider: We want a relatively weak controllable AI assistant that can help a lot with ATA. And we don’t want this AI to effectively lock in a set of choices. However, there is a problem. The more helpful an AI system is at helping you with ATA, the more risk you run of already having locked in some values accidentally.
Consider an AI that is just trying to help us achieve “what we want to achieve”. Once we give it larger and larger tasks, the AI has to do a lot of interpretation to understand what that means. For an AI to be able to help us achieve “what we want to achieve”, and prevent us from deviating from this, it must have a definition of what that means. Finding a good definition of “what we want to achieve” likely requires value judgments that we don’t want to hand over to AIs. If the system has a definition of “what we want to achieve”, then some choices are effectively already made.
To illustrate: For “help us achieve what we want to achieve” to mean something, one must specify how to deal with disagreements, amongst individuals that disagree on how to deal with disagreements. Without specifying this one cannot refer to “we”. There are many different ways of dealing with such disagreements, and they imply importantly different outcomes. One example of how one can deal with such disagreements is the negotiation rules of PCEV, mentioned above. In other words: if an AI does not know what is meant by “what we want to achieve”, then it will have difficulties helping us solve ATA. But if it does know what “what we want to achieve” means, then important choices have already been made. And if the choice had been made to use the PCEV way of dealing with disagreements, then we would have locked in everything that is implied by this choice. This includes locking in the fact that individuals who intrinsically value hurting other individuals, will have far more power over the AI than individuals that do not have such values.
If we consider scenarios with less powerful AI that just don’t have any lock-in risk by default, then they might not be able to provide substantial help with ATA: They currently seem better at tasks that have lots of data, short horizons, and aren’t very conceptual. These things don’t seem to apply to ATA.
None of this is meant to suggest that AI assistants cannot help with ATA. It is entirely possible that some form of carefully constructed AI assistant will speed up ATA progress to some degree, without locking in any choices (one could for example build an assistant that has a unique perspective but is not smarter than humans. Such an AI might provide meaningful help with conceptual work, without its definitions automatically dominating the outcome). But even if this does happen, it is unlikely to speed you up enough to obsolete current work.
Spuriously opinionated AI assistants
AI systems might also just genuinely be somewhat opinionated for reasons that are not related to anyone making a carefully considered tradeoff. If the AI is opinionated in an unintended way and its opinions matter for what humans choose to do, we run the risk of already having accidentally chosen some alignment target by the time we designed the helpful, controllable AI assistant. We just don’t know what alignment target we have chosen.
If we look at current AI systems, this scenario seems fairly plausible. Current AIs aren’t actually trained purely for “what the user wants” but instead are directly trained to comply with certain moral ideas. It seems very plausible these moral ideas (alongside whatever random default views the AI has about, say, epistemics) will make quite the difference for ATA. It seems plausible that current AIs already are quite influential on people’s attitudes and will increasingly become so. This problem exists even if careful efforts are directed towards avoiding it.
Will we actually have purely corrigible AI assistants?
There exists a third issue, that is separate from both issues mentioned above: even if some people do plan to take great care when building AI assistants, there is no guarantee that such people will be the first ones to succeed. It does not seem to us to be the case, that everyone is in fact paying careful attention towards what kinds of values and personalities we are currently training into our AIs. As a fourth separate issue, despite all the talk about corrigibility and intent alignment, it doesn’t seem obvious at all that most current AI safety efforts differentially push towards worlds where AIs are obedient, controllable etc. as opposed to having specific contentful properties.
Relationship between ATA and other disciplines
There are many disciplines that seem relevant to ATA such as: voting theory, moral uncertainty, axiology, bargaining, political science, moral philosophy, etc. Studying solutions in these fields is an important part of ATA work. But it is necessary to remember that the lessons learned by studying these different contexts might not be valid in the AI context. Since concepts can behave in new ways in the AI context, studying these other fields cannot replace ATA. This implies that in order to build up good intuitions about how various concepts will behave in the AI context, it will be necessary to actually explore these concepts in the AI context. In other words: it will be necessary to do ATA. This is another reason for thinking that the current lack of any serious research effort dedicated to ATA is problematic.
Let’s illustrate the problem of transferring proposals from different contexts to AI with PCEV as an example. The problem that PCEV suffers from as an alignment target is not an issue in the original proposal. The original proposal made by Bostrom is a mapping from a set of weighted ethical theories and a situation to a set of actions (that an individual can use to find a set of actions that can be given the label “morally permissible”). It is unlikely that a given person will put credence in a set of ethical theories that specifically refer to each other, and specifically demands that other theories must be hurt as much as possible. In other words: ethical theories that want to hurt other theories do get a negotiation advantage in the original proposal but this advantage is not a problem in the original context.
In a population of billions however, some individuals will want to hurt other individuals. So here the negotiation advantage is a very big problem. One can describe this as the concept behaving very differently when it is transferred to the AI context. There is nothing particularly unusual about this. It is fairly common for ideas to stop working when they are used in a completely novel context. But it is still worth making this explicit, and important to keep this in mind when thinking about alignment target proposals that were originally designed for a different context. Because there are many aspects of the AI context that are quite unusual.
To illustrate this with another example, this time with a concept from ordinary politics transferred to the AI context. Let’s write Condorcet AI (CAI) for any AI that picks outcomes using a rule that conforms to the Condorcet Criterion or Garrabrant’s Lottery Condorcet Criterion. If a barely caring 51 % solid majority (who agree about everything) would sort of prefer that a 49 % minority be hurt as much as possible, then any CAI will hurt the 49 % minority as much as it can. (It follows directly from the two linked definitions that a 51 % solid majority always gets their highest ranked option implemented without compromise). Ordinary politics does have issues with minorities being oppressed. But in ordinary politics there does not exist any entity that can suddenly start massively oppressing a 49 % minority without any risk or cost. And without extrapolation, solid majorities are a lot less important as a concept. Therefore, ordinary politics does not really contain anything corresponding to the above scenario. In other words: the Condorcet Criterion behaves differently when it is transferred to the AI context.
Alignment Target Analysis is tractable
We’ve argued that alignment to the wrong alignment target can be both catastrophic and non-obvious. We also argued that people might need, want to or just will align their AIs to a specific target relatively soon, without sufficient help from AI assistants or the ability to stall for time. This makes ATA time-sensitive and important. It also is tractable. One way that such research could move forwards, would be to iterate through alignment targets the usual scientific way: Propose them. Wait until someone finds a critical flaw. Propose an adjustment. Wait until someone finds a critical flaw to the new state of art. Repeat. Hopefully, this will help us identify necessary features of good alignment targets. While it seems really hard to tell whether an alignment target is good, this helps us to at least tell when an alignment target is bad. And noticing that a bad alignment target is in fact bad reduces the danger of it being implemented.
A more ambitious branch of ATA could try to find a good alignment target instead of purely analyzing them. Coming up with a good alignment target and showing that it is good seems much, much harder than finding flaws in existing proposals. However, the example with PCEV showed that it is possible to reduce these dangers without finding any good alignment target. In other words: an ATA project does not have to attempt to find a solution to be valuable. Because it can still reduce the probability of worst-case outcomes.
It is also true in general that looking ahead, and seeing what is waiting for us down the road, might be useful in hard to predict ways. It all depends on what one finds. Perhaps, to the extent that waiting for humanity to become more capable before committing to an alignment target or stalling for time are possible (just not guaranteed), ATA can help motivate doing so. It’s possible that, after some amount of ATA, we will conclude that humans, as we currently exist, should never try to align an AI to an alignment target we came up with. In such a scenario we might have no choice but to hope that enhanced humans will be able to handle this (even though there is no guarantee that enhancing the ability to hit an alignment target will reliably enhance the ability to analyze alignment targets).
Limitations of the present post, and possible ways forwards
There is a limit to how much can be achieved by arguing against a wide range of unstated arguments, implicit in the non-existence of any current ATA research project. Many people both consider it possible that a powerful autonomous AI will exist at some point, and also think that it matters what goal such an AI would have. So the common implicit position that ATA now is not needed must rest on positive argument(s). These arguments will be different for different people, and it is difficult to counter all possible arguments in a single post. Each such argument is best treated separately (for example along the lines of these three posts, that each deal with a specific class of arguments).
The status quo is that not much ATA is being done, so we made a positive case for it. However, the situation to us looks as follows:
We will need an alignment target eventually,
Alignment targets that intuitively sound good might be extremely bad, maybe worse than extinction bad, in ways that aren’t obvious.
This seems like a very bad and dangerous situation to be in. To argue that we should stay in this situation, without at least making a serious effort to improve things, requires a positive argument. In our opinion, the current discourse arguing for focusing exclusively on corrigibility, intent alignment, human augmentation, and buying time (because they help with ATA in the long-run) does not succeed at providing such an argument. Concluding that some specific idea should be pursued, does not imply that the idea in question obsoletes doing ATA now. But the position that doing ATA now is not needed, is sort of implicit in the current lack of any research project dedicated to ATA. ATA seems extremely neglected with, as far as we can tell, 0 people working on it full time.
We conclude this post by urging people, who feel confident that doing ATA now is not needed, to make an explicit case for this. The fact that there currently does not exist any research project dedicated to ATA indicates that there exists plenty of people that consider this state of affairs to be reasonable (probably for a variety of different reasons). Hopefully the present text will lead to the various arguments in favor of this position, that people find convincing, to be made explicit and in public. A natural next step would then be to engage with those arguments individually.
Acknowledgements
We would like to thank Max Dalton, Oscar Delaney, Rose Hadshar, John Halstead, William MacAskill, Fin Moorhouse, Alejandro Ortega, Johanna Salu, Carl Shulman, Bruce Tsai, and Lizka Vaintrob, for helpful comments on an earlier draft of this post. This does not imply endorsement.
It seems very plausible to me that alignment targets in practice will evolve out of things like the OpenAI Model Spec. If anyone has suggestions for how to improve that, please DM me.
I interpret your comment as a prediction regarding where new alignment target proposals will come from. Is this correct?
I also have a couple of questions about the linked text:
How do you define the difference between explaining something and trying to change someone’s mind? Consider the case where Bob is asking a factual question. An objectively correct straightforward answer would radically change Bob’s entire system of morality, in ways that the AI can predict. A slightly obfuscated answer would result in far less dramatic changes. But those changes would be in a completely different direction (compared to the straightforward answer). Refusing to answer, while being honest about the reason for refusal, would send Bob into a tailspin. How certain are you that you can find a definition of Acceptable Forms of Explanation that holds up in a large number of messy situations along these lines? See also this.
And if you cannot define such things in a solid way, how do you plan to define ``benefit humanity″? PCEV was an effort to define ``benefit humanity″. And PCEV has been found to suffer from at least one difficult-to-notice problem. How certain are you that you can find a definition of ``benefit humanity″ that does not suffer from some difficult-to-notice problem?
PS:
Speculation regarding where novel alignment target proposals are likely to come from are very welcome. It is a prediction of things that will probably be fairly observable fairly soon. And it is directly relevant to my work. So I am always happy to hear this type of speculation.
Hi !
I agree it’s neglected, but there is in fact at least one researh project dedicated to at least designing alignment targets: the part of the formal alignment agenda dedicated to formal outer alignment, which is the design of math problems to which solutions would be world-saving. Our notable attempts at this are QACI and ESP (there was also some work on a QACI2, but it predates (and in-my-opinion is superceded by) ESP).
Those try to implement CEV in math. They only work for doing CEV of a single person or small group, but that’s fine: just do CEV of {a single person or small group} which values all of humanity/moral-patients/whatever getting their values satisfied instead of just that group’s values. If you want humanity’s values to be satisfied, then “satisfying humanity’s values” is not opposite to “satisfy your own values”, it’s merely the outcome of “satisfy your own values”.
I think I see your point. Attempting to design a good alignment target could lead to developing intuitions that would be useful for ATA. A project trying to design an alignment target might result in people learning skills that allows them to notice flaws in alignment targets proposed by others. Such projects can therefore contribute to the type of risk mitigation that I think is lacking. I think that this is true. But I do not think that such projects can be a substitute for an ATA project with a risk mitigation focus.
Regarding Orthogonal:
It is difficult for me to estimate how much effort Orthogonal spends on different types of work. But it seems to me that your published results are mostly about methods for hitting alignment targets. This also seems to me to be the case for your research goals. If you are successful, it seems to me that your methods could be used to hit almost any alignment target (subject to constraints related to finding individuals that want to hit specific alignment targets).
I appreciate you engaging on this, and I would be very interested in hearing more about how the work done by Orthogonal could contribute to the type of risk mitigation effort discussed in the post. I would, for example, be very happy to have a voice chat with you about this.
[Note: I believe that this proposal is primarily talking about Value-Alignment Target Analysis (VATA), not Intent-Alignment Target Analysis. I think this distinction is important, so I will emphasize it.]
I’m a believer in the wisdom of aiming for intent-aligned corrigible agents (CAST) instead of aiming directly for a value-aligned agent or value-aligned sovereign AI. I am in agreement with you that current proposals around Value-Alignment targets seem likely disastrous.
Questions:
Will we be able to successfully develop corrigible agents in time to be strategically relevant? Or will more risk tolerant, less thoughtful people race to powerful AI first and cause disaster?
What could we humble researchers do about this?
Would focusing on Value-Alignment Target Analysis help?
Will current leading labs correctly adopt and implement the recommendations made by Alignment Target Analysis?
If we do have corrigible agents, and also an international coalition to prevent rogue AGI and self-replicating weapons, and do mutual inspections to prevent any member from secretly defecting and racing toward more powerful AI....
How long do we expect that wary peace to hold?
Will we succeed at pausing for a year or more like ten years? My current guess is that something like 10 years may be feasible.
Thoughts
My current expectation is that we should plan on what to do in the next 2-3 critical years, and prioritize what seems most likely to get to a stable AI-progress-delay situation.
I feel unconvinced that launching a powerful value-locking-in sovereign AI that the launchers hope is Value Aligned with them is something that the key decision makers will be tempted by. My mental model of the likely actors in such a scenario (e.g. US Federal Government & Armed Forces) is that they would be extremely reluctant to irreversibly hand over power to an AI, even if they felt confident that the AI was aligned to their values. I also don’t model them as being interested in accepting a recommendation for a Value Alignment Target which seemed to them to compromise any of their values in favor of other peoples values. I model the Armed Forces (not just US, but as a general pattern everywhere) as tending to be quite confident in the correctness of their own values and the validity of coercing others to obey their values (selection effect strong here).
I do agree that I don’t trust the idea of attempting to rapidly accelerate Value-Alignment Target Research with AI assistants. It seems like the sort of research I’d want to be done carefully, with lots of input from a diverse set of people from many cultures around the world. That doesn’t seem like the sort of problem amenable to manyfold acceleration by AI R&D.
In conclusion:
I do support the idea of you working on Value-Alignment Target Analysis Research, and attempting to recruit people to work on it worth you who aren’t currently working on AI safety. I am not yet convinced that it makes sense to shift current AI safety/alignment researchers (including myself) from what they are currently doing to VATA.
However, I’m open to discussion. If you change my mind on some set of the points above, it would likely change my point of view.
Also, I think you are mistaken in saying that nobody is working on this. I think it’s more the case that people working on this don’t want to say that this is what they are working on. Admitting that you are aiming for a Sovereign AI is obviously not a very politically popular statement. People might rightly worry that you’d give such an AI toward the Value-Alignment Target you personally believe in (which may in turn be biased towards your personal values). If someone who didn’t like the proposals you were putting forth because they felt that their own values were underrepresented, they could be tempted to enter into conflict with you or try to beat your Sovereign AI team to the punch by rushing their own Sovereign AI team (or sabotaging yours). So probably there are people out there thinking about this but keeping silent about it.
I do think that there is more general work being done on related ideas of measuring the values of large heterogeneous groups of people, or trying to understand the nature of values and ethics. I don’t think you should write this work off too quickly, since I think a lot of it pertains quite directly to VATA.
Some recent examples:
(note that I’ve read most of these, and some of them express contradictory views, and I don’t feel like any of them perfectly represents my own views. I just want to share some of what I’ve been reading on the topic.)
https://www.lesswrong.com/posts/As7bjEAbNpidKx6LR/valence-series-1-introduction
https://www.lesswrong.com/posts/YgaPhcrkqnLrTzQPG/we-don-t-know-our-own-values-but-reward-bridges-the-is-ought
https://arxiv.org/abs/2404.10636
https://arxiv.org/html/2405.17345v1
https://medium.com/nerd-for-tech/openais-groundbreaking-research-into-moral-alignment-for-llms-7c4e0e5ffe97
https://futureoflife.org/ai/align-artificial-intelligence-with-human-values/
https://arxiv.org/abs/2311.17017
https://arxiv.org/abs/2309.00779
https://www.pnas.org/doi/10.1073/pnas.2213709120
https://link.springer.com/article/10.1007/s43681-022-00188-y
The proposed research project would indeed be focused on a certain type of alignment target. For example proposals along the lines of PCEV. But not proposals along the lines of a tool-AI. Referring to this as Value-Alignment Target Analysis (VATA) would also be a possible notation. I will adopt this notation for the rest of this comment.
The proposed VATA research project would be aiming for risk mitigation. It would not be aiming for an answer:
There is a big difference between proposing an alignment target on the one hand. And pointing out problems with alignment targets on the other hand. For example: it is entirely possible to reduce risks from a dangerous alignment target, without having any idea how one might find a good alignment target. One can actually reduce risks without having any idea, what it even means for an alignment target to be a good alignment target.
The feature of PCEV mentioned in the post is an example of this. The threat posed by PCEV has presumably been mostly removed. This did not require anything along the lines of an answer. The analysis of Condorcet AI (CAI) is similar. The analysis simply describes a feature shared by all CAI proposals (the feature that a barely caring solid majority can do whatever they want with everyone else). Pointing this out presumably reduces the probability that a CAI will be launched by designers that never considered this feature. All claims made in the post about a VATA research project being tractable is referring to this type of risk mitigation being tractable. There is definitely no claim that a VATA research project can (i): find a good alignment target, (ii): somehow verify that this alignment target does not have any hidden flaws, and (iii): convince whoever is in charge to launch this target.
One can also go a bit beyond analysis of individual proposals, even if one does not have any idea how to find an answer. One can mitigate risk by describing necessary features (for example along the lines of this necessary Membrane formalism feature). This reduces risks from all proposal that clearly does not have such a necessary feature.
(and just to be extra clear: the post is not arguing that launching a Sovereign AI is a good idea. The post is assuming an audience that agree that it is possible that a Sovereign AI might be launched. And then the post is arguing that if this does happen, then there is a risk that such a Sovereign AI project will be aiming at a bad value alignment target. The post then further argues that this particular risk can be reduced by doing VATA)
Regarding people being skeptical of Value Alignment Target proposals:
If someone ends up with the capability to launch a Sovereign AI, then I certainly hope that they will be skeptical of proposed Value Alignment Targets. Such skepticism can avert catastrophe even if the proposed alignment target has a flaw that no one has noticed.
The issue is that a situation might arise where (i): someone has the ability to launch a Sovereign AI, (ii): there exists a Sovereign AI proposal that no one can find any flaws with, and (iii): there is a time crunch.
Regarding the possibility that there exists people trying to find an answer without telling anyone:
I’m not sure how to estimate the probability of this. From a risk mitigation standpoint, this is certainly not the optimal way of doing things (if a proposed alignment target has a flaw, then it will be a lot easier to notice that flaw, if the proposal is not kept secret). I really don’t think that this is a reasonable way of doing things. But I think that you have a point. If Bob is about to launch an AI Sovereign with some critical flaw that would lead to some horrific outcome. Then secretly working Steve might be able to notice this flaw. And if Bob is just about to launch his AI, and speaking up is the only way for Steve to prevent Bob from causing a catastrophe, then Steve will presumably speak up. In other words: the existence of people like secretly working Steve would indeed offer some level of protection. It would mean that the lack of people with relevant intuitions is not as bad as it appears (and when allocating resources, this possibility would indeed point to less resources for VATA). But I think that what is really needed is at least some people doing VATA with a clear risk mitigation focus. And discussing their finding with each other. This does not appear to exist.
Regarding other risks, and the issue that findings might be ignored:
A VATA research project would not help with misalignment. In other words: even if the field of VATA was somehow completely solved tomorrow, AI could still lead to extinction. So the proposed research project is definitely not dealing with all risks. The point of the post is that the field of VATA is basically empty. I don’t know of anyone that is doing VATA full time with a clear risk mitigation focus. And I don’t know if you personally should switch to focusing on VATA. It would not surprise me at all if some other project is a better use of your time. It just seems like there should exist some form of VATA research project with a clear risk mitigation focus.
It is also possible that a VATA finding will be completely ignored (by leading labs, or by governments, or by someone else). It is possible that a Sovereign AI will be launched, leading to catastrophe, even though it has a known flaw (because the people launching it is just refusing to listen). But finding a flaw at least means that it is possible to avert catastrophe.
PS:
Thanks for the links! I will look into this. (I think that there are many fields of research that are relevant to VATA. It’s just that one has to be careful. A concept can behave very differently when it is transferred to the AI context)
Adding more resources :
[Clearer Thinking with Spencer Greenberg] Aligning society with our deepest values and sources of meaning (with Joe Edelman) #clearerThinkingWithSpencerGreenberg https://podcastaddict.com/clearer-thinking-with-spencer-greenberg/episode/175816099
https://podcastaddict.com/joe-carlsmith-audio/episode/157573278
Relevant quote from Zvi: https://www.lesswrong.com/posts/FeqY7NWcFMn8haWCR/ai-83-the-mask-comes-off
thanks Chi!!
I believe we should not create a Sovereign AI. Developing a goal-directed agent of this kind will always be too dangerous. Instead, we should aim for a scenario similar to CERN, where powerful AI systems are used for research in secure labs, but not deployed in the economy.
I don’t want AIs to takeover.
Let’s reason from the assumption that you are completely right. Specifically, let’s assume that every possible Sovereign AI Project (SAIP) would make things worse in expectation. And let’s assume that there exists a feasible Better Long Term Solution (BLTS).
In this scenario ATA would still only be a useful tool for reducing the probability of one subset of SAIPs (even if all SAIPs are bad some designers might be unresponsive to arguments, some flaws might not be realistically findable, etc). But it seems to me that ATA would be one complementary tool for reducing the overall probability of SAIP. And this tool would not be easy to replace with other methods. ATA could convince the designers of a specific SAIP that their particular project should be abandoned. If ATA results in the description of necessary features, then it might even help a (member of a) design team see that it would be bad if a secret project were to successfully hit a completely novel, unpublished, alignment target (for example along the lines of this necessary Membrane formalism feature).
ATA would also be a project where people can collaborate despite almost opposite viewpoints on the desirability of SAIP. Consider Bob who mostly just wants to get some SAIP implemented as fast as possible. But Bob still recognizes the unlikely possibility of dangerous alignment targets with hidden flaws (but he does not think that this risk is anywhere near large enough to justify waiting to launch a SAIP). You and Bob clearly have very different viewpoint regarding how the world should deal with AI. But there is actually nothing preventing you and Bob from cooperating on a risk reduction focused ATA project.
This type of diversity of perspectives might actually be very productive for such a project. You are not trying to build a bridge on a deadline. You are not trying to win an election. You do not have to be on the same page to get things done. You are trying to make novel conceptual progress, looking for a flaw of an unknown type.
Basically: reducing the probability of outcomes along the lines of the outcome implied by PCEV is useful according to a wide range of viewpoints regarding how the world should deal with AI. (there is nothing unusual about this general state of affairs. Consider for example Dave and Gregg who are on opposite sides of a vicious political trench war over the issue of pandemic lockdowns. There is nothing on the object level that prevents them from collaborating on a vaccine research effort. So this feature is certainly not unique. But I still wanted to highlight the fact that a risk mitigation focused ATA project does have this feature)
Fair enough.
I think my main problem with this proposal is that under the current paradigm of AIs (GPTs, foundation models), I don’t see how you want to implement ATA, and this isn’t really a priority?
Your comment makes me think that I might have been unclear regarding what I mean with ATA. The text below is an attempt to clarify.
Summary
Not all paths to powerful autonomous AI go through methods from the current paradigm. It seems difficult to rule out the possibility that a Sovereign AI will eventually be successfully aligned to some specific alignment target. At current levels of progress on ATA this would be very dangerous (because understanding an alignment target properly is difficult, and a seemingly-nice proposal can imply a very bad outcome). It is difficult to predict how long it would take to reach the level of understanding needed to prevent scenarios where a project successfully hits a bad alignment target. And there might not be a lot of time to do ATA later (for example because a tool-AI shuts down all unauthorised AI projects. But does not buy a lot of time due to internal time pressure). So a research effort should start now.
Therefore ATA is one of the current priorities. There are definitely very serious risks that ATA cannot help with (for example misaligned tool-AI projects resulting in extinction). There are also other important current priorities (such as preventing misuse). But ATA is one of the things that should be worked on now.
The next section outlines a few scenarios designed to clarify how I use the term ATA. The section after that outlines a scenario designed to show why I think that ATA work should start now.
What I mean with Alignment Target Analysis (ATA)
The basic idea with ATA is to try to figure out what would happen if a given AI project were to successfully align an autonomously acting AI Sovereign to a given alignment target. The way I use the term, there are very severe risks that cannot be reduced in any way, by any level of ATA progress (including some very serious misalignment and misuse risks). But there are also risks that can and should be reduced by doing ATA now. There might not be a lot of time to do ATA later, and it is not clear how long it will take to advance to the level of understanding that will be needed. So ATA should be happening now. But let’s start by clarifying the term ATA, by outlining a couple of dangerous AI projects where ATA would have nothing to say.
Consider Bill, who plans to use methods from the current paradigm to build a tool-AI. Bill plans to use this tool AI to shut down competing AI projects and then decide what to do next. ATA has nothing at all to say about this situation. Let’s say that Bill’s project plan would lead to a powerful misaligned AI that would cause extinction. No level of ATA progress would reduce this risk.
Consider Bob who also wants to build a tool-AI. But Bob’s AI would work. If the project would go ahead, then Bob would gain a lot of power. And Bob would use that power to do some very bad things. ATA has nothing to say about this project and ATA cannot help reduce this risk.
Now let’s introduce an unusual ATA scenario, just barely within the limits of what ATA can be used for (the next section will give an example of the types of scenarios that makes me think that ATA should be done now. This scenario is meant to clarify what I mean with ATA). Consider Dave who wants to use methods from the current paradigm to implement PCEV. If the project plan moves forwards, then the actual result would be a powerful misaligned AI: Dave’s Misaligned AI (DMAI). DMAI would not care at all what Dave is trying to do, and would cause extinction (for reasons that are unrelated to what Dave was aiming at). One way to reduce the extinction risk from DMAI would be to tell Dave that his plan would lead to DMAI. But it would also be valid to let Dave know that if his project were to successfully hit the alignment target that he is aiming for, then the outcome would be massively worse than extinction.
Dave assumes that he might succeed. So, when arguing against Dave’s project, it is entirely reasonable to argue from the assumption that Dave’s project will lead to PCEV. Pointing out that success would be extremely bad is a valid argument against Dave’s plan, even if success is not actually possible.
You can argue against Dave’s project by pointing out that the project will in fact fail. Or by pointing out that success would be very bad. Both of these strategies can be used to reduce the risk of extinction. And both strategies are cooperative (if Dave is a well meaning and reasonable person, then he would thank you for pointing out either of these aspects of his plan). While both strategies can prevent extinction in a fully cooperative way, they are also different in important ways. It might be the case that only one of these arguments is realistically findable in time. It might for example be the case that Dave is only willing to publish one part of his plan (meaning that there might not be sufficient public information to construct an argument about the other part of the plan). And even if valid arguments of both types are constructed in time, it might still be the case that Dave will only accept one of these arguments. (similar considerations are also relevant for less cooperative situations. For example if one is trying to convince a government to shut down Dave’s project. Or if one is trying to convince an electorate to vote no on a referendum that Dave needs to win in order to get permission to move forwards)
The audience in question (Dave, bureaucrats, voters, etc) are only considering the plan because they believe that it might result in PCEV. Therefore it is entirely valid to reason from the assumption that Dave’s plan will result in PCEV (when one is arguing against the plan). There is no logical reason why such an argument would interfere with attempts to argue that Dave’s plan would in fact result in DMAI.
Now let’s use an analogy from the 2004 CEV document to clarify what role I see an ATA project playing. In this analogy, building an AI Sovereign is analogous to taking power in a political revolution. So (in the analogy) Dave proposes a political revolution. One way a revolution can end in disaster is that the revolution leads to a destructive civil war that the revolutionaries loose (analogous to DMAI causing extinction). Another way a revolution can end in disaster is that ISIS takes power after the government is overthrown (analogous to the outcome implied by PCEV).
It is entirely valid to say to Dave: ``if you actually do manage to overthrow the government, then ISIS will seize power″ (assuming that this conditional is true). One can do this regardless of whether or not one thinks that Dave has any real chance of overthrowing the government. (Which in turn means that one can actually say this to Dave, without spending a lot of time trying to determine the probability that the revolution will in fact overthrow the government. Which in turn means that people with wildly different views on how difficult it is to overthrow the government can cooperate while formulating such an argument)
(this argument can be made separately from an argument along the lines of: ``our far larger neighbour has a huge army and would never allow the government of our country to be overthrown. Your revolution will fail even if every single soldier in our country joins you instantly. Entirely separately: the army of our county is in fact fiercely loyal to the government and you don’t have enough weapons to defeat it. In addition to these two points: you are clearly bad at strategic thinking and would be outmanoeuvred in a civil war by any semi-competent opponent″. This line of argument can also prevent a hopeless civil war. The two arguments can be made separately and there is no logical reason for them to interfere with each other)
Analysing revolutionary movements in terms of what success would mean can only help in some scenarios. It requires a non-vague description of what should happen after the government falls. In general: this type of analysis cannot reduce the probability of lost civil wars, in cases where the post revolutionary strategy is either (i): too vaguely described to analyse, or (ii): actually sound (meaning that the only problem with the revolution in question is that it has no chance of success). Conversely however: arguments based on revolutions failing to overthrow the government cannot prevent revolutions that would actually end with ISIS in charge (analogous to AI projects that would successfully hit a bad alignment target). Scenarios that end in a bad alignment target getting successfully hit is the main reason that I think that ATA should happen now (in the analogy, the main point would be to reduce the probability of ISIS gaining power). Now let’s leave the revolution analogy and outline one such scenario.
A tool-AI capable of shutting down all unauthorised AI projects might not buy a lot of time
It is difficult to predict who might end up controlling a tool-AI. But one obvious compromise would be to put it under the control of some group of voters (for example a global electorate). Let’s say that the tool-AI is designed such that one needs a two thirds majority in a referendum, to be allowed to launch a Sovereign AI. There exists a Sovereign AI proposal that a large majority thinks sounds nice. A small minority would however prefer a different proposal.
In order to prevent inadvertent manipulation risks, the tool AI was designed to only discuss topics that are absolutely necessary for the process of shutting down unauthorised AI projects. Someone figures out how to make the tool-AI explain how to implement Sovereign AI proposals (and Explanation / Manipulation related definitions happens to hold for such discussions). But no one figures out how to get it to discuss any topic along the lines of ATA. The original plan was to take an extended period of time to work on ATA before implementing a Sovereign AI.
Both alignment targets use the same method for extrapolating people and for resolving disagreements. The difference is in terms of who is part of the initial group. The two proposals have different rules with respect to things like: animals, people in cryo, foetuses, artificial minds, etc. It doesn’t actually matter which proposal gets implemented: the aggregation method leads to the same horrific outcome in both cases (due to an issue along the lines of the issue that PCEV suffers from. But more subtle and difficult to notice). (All proposed alignment targets along the lines of ``build an AI Sovereign that would do whatever some specific individual wants it to do″ are rejected out of hand by almost everyone).
In order to avoid making the present post political, let’s say that political debates center around what to do with ecosystems. One side cares about nature and wants to protect ecosystems. The other side wants to prevent animal suffering (even if the cost of such prevention is the total destruction of every ecosystem on earth). It is widely assumed that including animals in the original group will lead to an outcome where animal suffering is prevented at the expense of ecosystems. (in order to make the following scenario more intuitive, readers that have an opinion regarding what should be done with ecosystems, can imagine that the majority shares this opinion)
The majority has enough support to launch their Sovereign AI. But the minority is rapidly and steadily gaining followers due to ordinary political dynamics (sometimes attitudes on a given issue changes steadily in a predictable direction). So the ability to get the preferred alignment target implemented can disappear permanently at any moment (the exact number of people that would actually vote yes in a referendum is difficult to estimate. But it is clearly shrinking rapidly). In this case the majority might act before they loose the ability to act. Part of the majority would however hesitate if the flaw with the aggregation method is noticed in time.
After the tool-AI was implemented, a large number of people started to work on ATA. There are also AI assistants that contribute to conceptual progress (they are tolerated by the tool-AI because they are not smarter than humans. And they are useful because they contribute a set of unique non-human perspectives). However, it turns out that ATA progress works sort of like math progress. It can be sped up significantly by lots of people working on it in parallel. But the main determinant of progress is how long people have been working on it. In other words: it turns out that there is a limit to how much the underlying conceptual progress can be sped up by throwing large numbers of people at ATA. So the question of whether or not the issue with the Sovereign AI proposal is noticed in time, is to a large degree determined by how long a serious ATA research project has been going on at the time that the tool-AI is launched (in other words: doing ATA now reduces the risk of a bad alignment target ending up getting successfully hit in this scenario).
(the idea is not that this exact scenario will play out as described. The point of this section was to give a detailed description of one specific scenario. For example: the world will presumably not actually be engulfed by debates about the Prime Directive from Star Trek. And a tool-AI controlled by a messy coalition of governments might lead to a time crunch due to dynamics that are more related to Realpolitik than any form of ideology. This specific scenario is just one example of a large set of similar scenarios)
PS:
On a common sense level I simply don’t see how one can think that it is safe to stay at our current level of ATA progress (where it is clearly not possible to reliably tell a good alignment target from an alignment target that implies an outcome massively worse than extinction). The fact that there exists no research project dedicated to improving this situation seems like a mistake. Intuitively this seems like a dangerous situation. At the very least it seems like some form of positive argument would be needed before concluding that this is safe. And it seems like such an argument should be published so that it can be checked for flaws before one starts acting based on the assumption that the current situation is safe. Please don’t hesitate to contact me with theories / questions / thoughts / observations / etc regarding what people actually believe about this.