This comment is trying to clarify what the post is about, and by extension clarify which claims are made. Clarifying terminology is an important part of this. Both the post and my research agenda is focused on the dangers of successfully hitting a bad alignment target. This is one specific subset, of the existential threats that humanity face from powerful AI. Let’s distinguish the danger being focused on, from other types of dangers, by looking at a thought experiment, with an alignment target that is very obviously bad. A well intentioned designer named Bill, who shares your values, sincerely believes that it is a good idea to start an AI project, aiming for a Suffering Maximising AI (SMAI). Bill’s project might result in an AI that wants to do something along the lines of building lots of tiny pictures of sad faces, due to a failure to get the AI to want to maximise, a reasonable definition of Suffering. This type of scenario is not what I am focused on. No claim made in the post refer to any scenario along those lines. In this case, Bill’s project is described as having failed. In other words: the aimed for alignment target was not hit. These types of dangers are thus out of scope, of both the post, and my research agenda. It is a very different type of danger, compared to the dangers that come from scenarios where Bill succeeds, and successfully hits the SMAI alignment target. Success would result in an AI, that wants to maximise something, that can be reasonably described as suffering (and little pictures of sad faces, is not a reasonable way of defining suffering). It is also possible that Bill’s project results in an AI that wants to build a lot of molecular Squiggles, due to an issue that has nothing to do with any definition of Suffering (this is also a failure, and it is also out of scope).
Dangers along the lines of Squiggles and pictures of sad faces can be dealt with by improving alignment techniques. Or by explaining to Bill that he does not currently know how to hit an alignment target. All strategies along those lines are out of scope, simply because nothing along those lines, can ever be used to mitigate the dangers, that come from successfully hitting a bad alignment target. Nothing along those lines can help in scenarios, where it would be bad for the project to succeed (which is the case for Bill’s SMAI project). Showing Bill that he does not know how to hit an alignment target, stops working when Bill learns how to hit an alignment target. And helping Bill improve his alignment techniques, dramatically increases the dangers that comes from successfully hitting a bad alignment target. The best way of dealing with such dangers, is by showing Bill, that it would be bad, for his project to succeed. In other words: to show Bill, that his favoured alignment target, is a bad alignment target. My research effort, which the post is a part of, is focused on analysing alignment targets. The reason that such analysis is needed, is that not all bad alignment targets, are as obviously bad as SMAI. The post can be used to illustrate the fact, that alignment target analysis can be genuinely unintuitive.
In other words: the purpose of the proposed research effort, is not to help Bill successfully implement SMAI. Or to help Bill find a good description of SMAI. Or to convince Bill to aim for a different version of SMAI. Instead, the purpose of the proposed alignment target analysis research effort, is to help Bill see that he is aiming for a bad alignment target. So that he will choose to abandon his SMAI project.
One could add a safety measure to Bill’s SMAI project, that might give Bill a chance to try again, in case of a bad outcome. This can reduce many different types of dangers, including dangers from successfully hitting a bad alignment target (less formally: one can add a do over button). But since no such safety measure is guaranteed to work, no such measure can make it ok, for Bill to start his SMAI project. And, crucially, no such safety measure can be relevant to the question, of whether or not SMAI is a bad alignment target. Thus, analysing such safety measures, does not count as analysing alignment targets. Since they are not relevant to the proposed research effort, analysing such safety measures are also out of scope. It would not be particularly surprising, if analysing such safety measures turns out to be more important than analysing alignment targets. But this does not change the fact that proposed safety measures are irrelevant, when analysing alignment targets. In other words: they are a separate, and complementary, method of reducing the type of danger that I am focused on. (less formally: a do over button might save everyone from an outcome massively worse than extinction. So they are definitely not useless in general. They are just useless for the specific task of analysing alignment targets. And since they can never be guaranteed to work, they can never replace alignment target analysis. A do over button can also be added to many different proposed alignment targets. Which means that they are not very helpful when one is comparing two alignment targets)
I don’t use the term alignment target for proposals along the lines of a Pivotal Act AI (PAAI), or any other AI whose function is similar to the typical functions of proposed treaties (such as buying time). So, CEV is an alignment target, but no proposal along the lines of a PAAI, is referred to as an alignment target. To get a bit more concrete, regarding what counts as an alignment target in my terminology, let’s say that a set of designers have an AI proposal, called AIX. If AIX is the type of AI, that implies a decision, regarding which goal to give to a successor AI, then AIX counts as an alignment target. The act of building AIX, implies a choice of alignment target. So, to build AIX, one must be able to analyse alignment targets.
If, on the other hand, the designers do not plan for AIX to play a role in the decision process, regarding what goal to give to a successor AI, then AIX does not count as an alignment target. In this latter case, building AIX might be possible without being able to analyse alignment targets (and building AIX, at a point in time when one does not know how to analyse alignment targets, might be a good idea). But the need to analyse alignment targets would remain, even if this latter type of project is successful.
If AIX is an alignment target, then adding something along the lines of a last judge off switch, to AIX, does not change the fact that AIX is an alignment target. An AI project might add such a feature, specifically because the designers know, that the aimed for alignment target might be bad. In this case, the project is still described as aiming for a bad alignment target. Such designers are not certain about the alignment target in question. And they have taken precautions, in case it turns out to be a bad alignment target. So, it might not be possible to discover anything, that would be very surprising to the designers. But it is still possible to discover that the aimed for alignment target is bad. This might still help a group of not-very-surprised designers, to reach the not-very-surprising-conclusion, that they should abandon the project (since this type of add on might fail, no such add on can ever make it reasonable to start a project that is aiming for a bad alignment target. In other words: it was wrong of these designers, to start this project. But in this scenario, the designers started the project, while thinking that the alignment target might be good. So, there is nothing in this scenario that strongly indicates, that it will be impossible to reason with them, in case one discovers that their alignment target is bad).
A PAAI project might be one part of a larger project plan, that at a later stage calls for some other AI project, that will be trying to hit an agreed upon alignment target. This is similar to how a treaty might be part of such a larger project plan. This larger project counts as aiming for an alignment target. But the alignment target is solely determined by the details of second AI project. The details of any treaty / PAAI / etc is, in my terminology, irrelevant to the question: what alignment target is the larger project aiming for?
Dangers related to successfully hitting a bad alignment target, is a different class of dangers, compared to dangers related to a project, that fails to hit a given alignment target. Preventing dangers related to successfully hitting a bad alignment target, requires a dedicated research effort, because some important measures that can reduce this risk, does not reduce other AI related risks (and are thus unlikely to be found, without a dedicated effort). Specifically, one way to reduce this specific risk, is to analyse alignment targets, with a view to finding some set of features, that are necessary for safety. I propose to take the perspective of one individual that is not given any special treatment. Then one can look for a set of necessary features, that an alignment target of a given project must have, for it to be rational, for such an individual to support this AI project. This should be done, while making as few assumptions as possible regarding this individual. Such a set would help someone, that is trying to construct a new alignment target. Since these features are all of the necessary-but-not-sufficient type, the fact that a project has them all, is not enough to convince anyone, to support the project. But they can help an individual (for whom none of the assumptions made, are false), decide to oppose a project, that is aiming for an alignment target, that lacks one of the features in such a set. The post describes one such necessary feature: that each individual, is given meaningful influence, regarding the adoption of those preferences, that refer to her. In many cases, it will be unclear whether or not some proposal has this feature. But in some cases, it will be clear, that a given proposal does not have this feature. And since it is a necessary but not sufficient feature, it is these clear negatives, that are the most useful. It is for example clear, that no version of CEV, can have this feature. This feature is incompatible with the core concept, of building an AI, that is in any way describable as: implementing the Coherent Extrapolated Volition of Humanity. In other words: in order to build an AI that has this feature, one has to choose some other alignment target. This seems to be a somewhat controversial feature. Possibly because of this incompatibility.
Such features are central to the proposed research effort. And the feature mentioned is the most important part of the post. But, in order to avoid controversies and distractions, this comment will not involve any further discussion of CEV. I will instead make the distinctions that I need to make, by relying exclusively on examples with other alignment targets (whose status as bad alignment targets is not being disputed).
A simple, well understood, and uncontroversial example of a bad alignment target, is a Suffering Reducing AI (SRAI). SRAI cares about what happens to humans, and will not treat humans in an uncaring, strategic way. For example: SRAI would not accept an offer where the solar system is left alone in exchange for resources (with the only intervention being to prevent any other AI from being built). This rejection is not related to any specific detail, or any specific definition. Any reasonable definition of suffering, leads to an AI, that will reject such an offer. Because killing all humans is inherent in the core concept, of reducing suffering. For any reasonable definition of Suffering, accepting the offer, would lead to the continuation of a lot of Suffering. Unless there exists some way of using the offered resources, to reduce some other, larger, source of Suffering, the offer will be rejected (for the rest of this text, it will be assumed that no such opportunity exists).
Regardless of which reasonable set of definitions is used, a successful SRAI project, simply leads to an AI that rejects all such offers (and that kills everyone). In other words, if a SRAI project results in an AI that rejects the offer and kills everyone, then this behaviour is very consistent, with the project having succeeded. The core issue with any SRAI project, is that the project would be aiming for a bad alignment target. The path towards a successfully implemented SRAI project, is simply not relevant to the question, of whether or not this is a bad alignment target. If some treaty, or a Pivotal Act AI, or some other plan for buying time, or some other clever trick, was involved at some point. Then this has no impact whatsoever on the question: is SRAI a bad alignment target? Similarly: if the path to a successful SRAI implementation, involved a last judge off switch, which successfully prevented an issue that would have led to an AI, that would have been glad to accept the offer mentioned above (because that AI would have been able to use the offered resources, to build lots of molecular squiggles), then this is equally irrelevant to the question: is SRAI a bad alignment target? If some other type of do over button, happens to continue to work, until the project is abandoned, then this is equally irrelevant, to the question: is SRAI a bad alignment target? More generally: no try-again-safety-measure can be relevant to this question. And no other detail of the larger project plan, can, be relevant to this question. (it is possible that a failed SRAI project will result in some AI that rejects the offer. Only one positive claim is made: If the resulting AI accepts the offer, then the project did not successfully hit the aimed for alignment target. Or, in other words: if the project does successfully hit the aimed for alignment target, then the resulting AI will reject the offer)
When saying that SRAI is a bad alignment target, there is no intended implication, that an AI project aiming for the SRAI alignment target, will lead to a bad outcome. It could be that some set of safety measures are implemented. And these could hold up until the designers see that the aimed for alignment target is a bad alignment target. Such a project is a bad idea. It is an unnecessary risk. And if it successfully hits the alignment target that it is aiming for, then it will get everyone killed. Success of this project leads to extinction (regardless of path to success, and regardless of the details of the definitions). But there is no implied claim that the act of starting such a project would, in fact, lead to a bad outcome.
It is possible to use other definitions for things such as: ``alignment target″, and ``successfully hitting an alignment target″, and ``the project succeeding″. But the way that these phrases are being defined here, by using the various examples throughout this comment, makes it possible to refer to important concepts in a clear way. The words used to point to these concepts could be changed. But it must be possible to refer to the concepts themselves. Because they are needed in order to make important distinctions.
Now, let’s return to the SMAI alignment target, mentioned at the beginning of this comment. Unless something very strange happens, the outcome of Bill’s SMAI project, would accept the offer mentioned above (this is probably a good deal for the resulting AI, whether or not the project hits the aimed for alignment target. The offered resources can for example be used to create: little pictures of sad faces, or Suffering, or Squiggles). Thus, such an offer can not be used to distinguish different types of scenarios from each other, in the SMAI case. For a SMAI project, it is more informative to ask what it would mean from a human perspective, to give the resulting AI resources. In the SMAI case, this question is better for illustrating the importance of distinguishing between different types of AI related dangers.
So, consider the case where the outcome of this SMAI project is fully contained within some lifeless region. Now, let’s ask what it would mean, to give it more resources. Let’s say that the result of a SMAI project, wants to build lots of little pictures of sad faces (due to a failure to get the AI to want to maximise a reasonable definition of Suffering), or little molecular Squiggles (due to an implementation failure, that is unrelated to the definition of Suffering). In these two cases, it seems like it does not matter much, if resources are destroyed, or given to this AI (assuming that it is fully contained). Dangers related to outcomes along these lines, are out of scope, as they are the results of failed projects. If a SMAI project succeeds however, then it would be far better to destroy those resources, than for SMAI to have those resources. If a SMAI project successfully hits the SMAI alignment target, the details of the definitions would also matter. Possibly a great deal. Successfully hitting the SMAI alignment target, can lead to a very wide range of outcomes, depending on the details of the definitions. Different sets of reasonable definitions, lead to outcomes, that are different in important ways (the SMAI project being successful, means that the resulting AI will want to maximise some reasonable definition of suffering. Thus, the outcome of a successful SMAI project, will be a bad outcome. But different definitions still lead to importantly different outcomes). The dangers involved with successfully hitting the SMAI alignment target, is importantly different from the case where the project fails to hit an alignment target. One very important distinction, is that each scenario, in this very large set, of very diverse types of scenarios, are all best dealt with, by helping Bill see that he is aiming for a bad for alignment target.
(it is possible that a failed SMAI project will result in an AI, such that preventing it from having resources is important (even under a strong containment assumption). Only one positive claim is made: If humans are essentially indifferent to giving the resulting AI resources, then the project did not successfully hit the aimed for alignment target. Or, in other words: if the project does successfully hit the aimed for alignment target, then it is important to keep resources away from the resulting AI)
Shifting resources from the result of one successful SMAI project, to the result of another successful SMAI project, might be very valuable. But issues along those lines are also out of scope of the proposed research effort, of analysing alignment targets. In other words: the point of talking to Bill, is not to help Bill find a better description of SMAI. But instead to help Bill see, that he should aim for a different alignment target. These two objectives are very different. Just as it would be a bad idea to help Bill hit the SMAI alignment target. It would also be a bad idea to help Bill describe SMAI. And while different versions of SMAI might lead to importantly different outcomes, the purpose of the proposed research project, is not to switch from one version of SMAI, to another version of SMAI. The purpose of the proposed research effort, is instead to help Bill see, that SMAI is a bad alignment target.
The reason that the effort to analyse alignment targets needs a dedicated effort, is that the best way to stop the two projects mentioned above, is by pointing out, that it would be a bad thing for them to succeed. If it is possible to stop a project in this way, then this is a better option, than to implement safety measures, that might (or might not) hold up until the project is abandoned. For SRAI and SMAI, the needed insights already exists. So, to stop these two projects, there is no need for further analysis. But if I take an example where there is no consensus, that the alignment target in question is in fact bad. Then the distinctions that I am trying to make in this comment will get lost, in arguments about whether or not the alignment target in question, is in fact bad. So I will stick with SRAI and SMAI.
In other words: To deal with bad alignment targets, that has not currently been discovered to be bad, dedicated effort is needed. Efforts that deal with other types of dangers, does not need the types of insights, that are needed to analyse alignment targets. And such efforts, are thus not likely to lead to such insights. It is not known how long it would take a serious, dedicated, research effort to advance to the point, where the needed insights become possible to see. Partly, this is because it is not known which nice-sounding-but-actually-bad alignment targets, will be proposed in the future. We also don’t know how long there will be to work on this issue. Even if we reason from the assumption, that some alignment target will get successfully hit, there would still be a lot of remaining uncertainty. We would still not know how long time there will be, to find the needed insights. Since we don’t know how much time it would take, or how much time there will be, there is no particular reason to think that the effort will be completed in time. Combined with the stakes involved, this implies urgency. Despite this urgency, currently there exists no serious research project, that is dedicated to alignment target analysis. One positive thing that happens to be true, is that in scenarios where alignment target analysis is needed, there will probably be time to do such analysis. There is no particular reason to think that there will be enough time. But it seems very likely, that there will be a significant amount of time. Another positive thing, that happens to be true, is that one can prevent these types of dangers, without arriving at any actual answer (the dangers from a given bad alignment target, can for example be mitigated by noticing, that it is a bad alignment target. This can be done, without having any example of a good alignment target). Yet another positive thing, that also happens to be true, is that if the needed insights are in fact generated in time, then these insight will probably only have to be explained, to the types of people, that are capable of hitting an alignment target.
If one views the situation from a higher level of abstraction, then the lack of a dedicated research effort is even more strange. It obviously matters, what alignment target an AI project is aiming for. One way to phrase this, is that it simply cannot be the case, that the aimed for alignment target is both irrelevant to the outcome, and simultaneously supposed to reduce division, and unify people towards a common goal. (if an alignment target has been set in stone, then one might be able to reasonably argue that detailed descriptions of this alignment target would not be particularly valuable to a given project. But such arguments are simply not applicable, to suggestions that a proposed project should to aim for a different alignment target)
If an AI project is aiming at a bad alignment target, then this can not be fixed by adding some set of safety measures to this project. And a set of safety measures, is simply not relevant, when one tries to determine, whether or not the aimed for alignment target, is a bad alignment target. And since no safety measure is guaranteed to work, such measures can never make it ok, to launch an AI project, that is aiming at a bad alignment target. A very careful SRAI project, that is implementing a set of very well constructed safety measures, might obviously lead to SRAI. It is still putting everyone in unnecessary danger. One implication of this, is that if someone argues that an SRAI project is aiming for a bad alignment target, then this argument can not be countered, by pointing at safety measures. It is important to emphasise, that it is completely irrelevant, what this set of safety measures is. Such a counterargument can always be safely dismissed out of hand, without having any idea, of which specific type of do-over-button is being referred to. Such safety measures simply, can not, change the fact that SRAI is a bad alignment target. (if such a do-over-button is added to a SRAI project, and this do-over-button happens to work, then it can save everyone from getting killed by SRAI. So it is definitely not useless in general. It is just useless, for the specific task, of analysing alignment targets. In other words: it can stop everyone from getting killed. But it can not be used to build a counterargument, when someone points out that SRAI is a bad alignment target. In yet other words: it is only the counterargument that can be safely dismissed. Some specific safety measure, that such a counterargument is built around, might turn out to be more important than the entire field of alignment target analysis. So dismissing such a safety measure, based on the fact that it is being used as part of an invalid counterargument, is definitely not something that can be safely done)
So far, we have talked about safety measures, that might allow a designer to try again. Let’s now turn our attention to what might perhaps be roughly described as a ``goal softener″, that could be added to an alignment target. In other words: a type of add on that will modify the behaviour implied by an alignment target (not an add on, that will allow a designer to abandon an alignment target, if it implies bad behaviour). Let’s write sSRAI for a soft Suffering Reducing AI, that will determine the actual outcome, and that respects the underlying intentions, behind an effort to get it to (for example) ``avoid optimising too hard″. Specifically, sSRAI wants to reduce suffering, but sSRAI wants to do this, in a way that respects the intentions behind the designers attempts, to get it to do soft optimisation, of the SRAI alignment target.
Let’s use an undramatic example scenario, to make a few points, that have wider applicability. Consider a successfully implemented sSRAI. sSRAI acts in a way that is genuinely in line with the underlying ideas and intentions, behind an effort to get the AI to avoid optimising too hard. sSRAI wants to act in accordance with the spirit of the soft optimisation principles that the designers were trying to implement. In other words: this feature is not used as a way to buy time, or as a way to get a second chance in case of failure (less formally: It is not a speed bump. And it is not a do over button). Specifically, sSRAI will always act in the world, in a way that is consistent with being uncertain, regarding how precisely Suffering should be defined. And sSRAI will avoid any action, that the designers would find alarming or disconcerting in any way. And sSRAI will avoid any action that would be perceived as weird or dramatic. And sSRAI would like to avoid any rapid change, taking place in society or culture. And sSRAI will avoid any action that would be seen as deceptive or manipulative.
Now, let’s explore what sSRAI might do in the world, to see if this has turned the SRAI alignment target, into a good alignment target. sSRAI could, for example, offer a process of sSRAI guided self discovery, that causes people to settle on some new way of viewing life, that is far more fulfilling. There is a wide range of views that people arrive at, but almost all include the idea, that it is best to not have more than one child. Every year, asking sSRAI for such guidance becomes slightly more mainstream, and every year some fraction of the population asks for this. sSRAI also declines to create any set of circumstances, that would dramatically alter human society. For example, sSRAI declines to create any set of circumstances, that would make humans want to go on living indefinitely (as opposed to just wanting to live for a very, very long time). sSRAI also takes actions to increase comfort, happiness, and to reduce suffering. Basically, taking careful actions that lead to people living very long and very happy lives (this does imply gradual but significant changes. But those societal changes are seen as acceptable, as they are the least dramatic changes, that can achieve the objective of significantly reducing suffering in the short term).
At no point on this path, is sSRAI deviating from the spirit of the implied instructions. At no point is sSRAI deviating from what it is supposed to do: which is to reduce suffering, while adhering to the underlying intentions, of avoiding drastic / weird / etc actions, or rapid changes, or manipulations, etc, etc. At all points on this path, is it the case, that all of its plans, are entirely in line with genuinely wanting to respect the underlying intention, of soft optimisation. For example: the process of sSRAI guided self discovery, is not some trick, to get people to want weird things. It is genuinely a process, designed to lead to some viewpoint, that in turn leads to less suffering (and it never results in any viewpoint, that is a weird outlier, in the set of common, pre-AI, viewpoints). The process also takes the other aspirations of a given individual into account, and leads to viewpoints that additionally does things like increasing happiness, or improving artistic output (since taking peoples wishes into account when possible, is also implied by some of the underlying intentions, that lead the designers to try to get the AI to do soft optimisation. Specifically: intentions related to not having tunnel vision). And no one is ever manipulated into asking for it. It is simply offered to everyone. And then word spreads, that it is great. For example: sSRAI did not design it, with the design constraint, that it must become dominant (including such a constraint when designing the process, would have violated the designers underlying intentions. Specifically, they would see this as manipulation). But the process leads to outcomes, that people will eventually want, when it stops being seen as weird to ask for it. The process also does not cause any dramatic changes to peoples lives or personalities.
Additionally, and completely in line with the underlying intentions of the safety measures, sSRAI declines to enable any form of dramatic augmentation procedure (explaining, completely honestly, to anyone that wants dramatic augmentation, that this would both lead to dramatically increased suffering, and to rapid societal change). sSRAI will never decide to explain any concept or fact, that would result in transformative change or increased suffering (or allow any AI project, or any augmentation project, that would lead to such concepts or facts being discovered). In general, sSRAI will prevent any AI project, that would have dramatic impact on society or culture (again straightforwardly in line with the underlying intentions of the designers. Because they intended to prevent drastic AI actions in general, not just drastic sSRAI actions). This overall behaviour is a genuinely reasonable way of interpreting: ``exert a pushing force towards the outcome implied by the aimed for alignment target. But make sure that you always act in a way, that I would describe as pushing softly″.
Even conditioned on a successful implementation of sSRAI, the specific scenario outlined above would obviously never happen (for the same reason that the specific chess moves that I can come up with, are very unlikely to match the actual chess moves, of a good chess player, whose moves I am trying to predict). A more realistic scenario is that sSRAI would design some other path to a lifeless universe, that is a lot more clever, a lot faster, a lot less obvious, and even more in line with underlying intentions. If necessary, this can be some path, that simply can not be explained to any human, or any result of any of the augmentation procedures, that sSRAI will decide to enable. A path that simply can not be specified in any more detail than: ``I am trying to reduce suffering, for a wide range of definitions of this concept, while adhering to the underlying intentions, behind the design teams efforts, to prevent me from optimising too hard″. SRAI is simply a bad alignment target. And there does not exists any trick, that can change this fact, even in principle (even under an assumption of a successfully implemented, and genuinely clever, trick). If a last judge off switch happens to work, then the project mentioned above might be shut down by some extrapolated version of a last judge, with no harm done (in other words: there is no claim that a sSRAI project will lead to a lifeless universe. Just a claim that the project is a bad idea to try, and that it represents an unnecessary risk, because SRAI is a bad alignment target).
In other words: the core concept of reducing suffering, implies killing everyone. And this remains true, even if one does manage to get an AI to genuinely want to make sure that everyone stops living, using only slow, ethical, legal, non deceptive, etc, etc, methods. Or get the AI to use some incomprehensible strategy (that avoids sudden changes, or dramatic actions, or unpleasant aesthetics, or manipulation, etc, etc). SRAI is unsafe, because the implied outcome is unsafe. More generally: SRAI is unsafe because SRAI cares about what happens to you. This is dangerous, when it is combined with the fact, that you never had any meaningful influence, over the decision, of which you-referring-preferences, SRAI should adopt. Similar conclusions hold for other bad alignment targets. And not all of the implied outcomes are this undramatic, or this easy to analyse. For example: sSMAI does not have a known, easy to analyse, endpoint, that sSMAI is known to softly steer the future towards. But the danger implied by a soft push, towards a bad future, is not reduced by the fact that we are having trouble predicting the details. It makes it harder to describe the danger of a soft push in detail, but it does not reduce this danger.
We are now ready to illustrate the value of analysing alignment targets, by noting the difference between a world organised by sSRAI, and a world organised by sSMAI. Let’s consider an AI that genuinely wants to respect the designers underlying intentions, that motivated them to try to get the AI to avoid optimising too hard. We can refer to it as soft AI, or sAI. Let’s assume that sAI does not do anything that causes human culture to stray too far away from current value systems and behaviours. It does not take drastic actions, or manipulate, or deceive, etc, etc. It only acts, in ways that would be genuinely reasonable to view, as ``pushing softly″. It still seems clearly better, to have an AI that softly pushes to reduce suffering, than to have an AI that softly pushes to maximise suffering. In yet other words: analysing alignment targets will continue to matter, even if one is reasoning from some very, very, optimistic set of assumptions (the difference is presumably al lot less dramatic than the difference between SMAI and SRAI. But it still seems valuable to switch from sSMAI to sSRAI).
We can actually add even more optimistic assumptions. Let’s say that Gregg’s Method (GM), is completely guaranteed to turn SRAI into GMSRAI. GMSRAI will create a thriving and ever expanding human civilisation, with human individuals living as long lives as would be best for them. Adding GM to any other alignment target, is guaranteed to have equally dramatic effects, in a direction that is somehow guaranteed to be positive. It seems very likely, that even in the GMSRAI case, there will still be a remnant left, of the push towards a lifeless universe. So, the value of analysing alignment targets would remain, even if we had Greggs Method (because a remnant of the SRAI push, would still be preferable to a remnant of the SMAI push. So, finding a better alignment target, would remain valuable, even if we counterfactually had someone like Gregg to help us. In other words: while finding / modifying someone into / building / etc, a Gregg would be great, this would not actually remove the need to analyse alignment targets). (and again: the reason that analysing alignment targets requires a dedicated research effort, is that not all alignment targets are as easy to analyse and compare, as SRAI and SMAI)
Before leaving the sSRAI thought experiment, it makes sense to make three things explicit. One is that ideas along the lines of soft optimisation, have not been shown to be dead ends or bad ideas. Because nothing along these lines can stop a bad alignment target from being bad, even in principle. Basically: unless the AI project is secretly trying to implement an AI that does, whatever a single designer wants the AI to do (in which case the claimed alignment target is irrelevant. From the perspective of an individual without any special influence, such a project is equivalent to a ``random dictator AI″ (regardless of the words used to describe the project)), then it will continue to matter, that SRAI exerts a pushing force towards a lifeless universe. Either this push will impact the outcome, or the alignment target is an empty PR slogan (either it has an influence on the outcome. Or it does not have an influence on the outcome). So, nothing said here is incompatible with some concept along the lines of soft maximisation turning out to be a genuinely valuable and important concept (because we knew that it would fail to solve the task of turning a bad alignment target into a good alignment target, the second we knew what type of thing, was supposed to perform what type of task).
Secondly, the fact that no one in the sSRAI scenario recognises, that SRAI implies a push towards a lifeless universe, makes things look strange (the same thing makes the various SMAI examples look even stranger). This is unavoidable, when one constructs an example with an alignment target, that is universally understood to be a bad alignment target (if the audience understands this, then the people in the thought experiment should also understand this). In other words: no sSRAI project poses any actual threat, because SRAI is already known to be a bad alignment target. The goal of analysing alignment targets, is to neutralise threats from other alignment targets, that are not currently known to be bad.
And finally, it makes sense to make one particular distinction explicit: the sSRAI thought experiment is not discussing a safety measure of a type that is designed to give a designer a chance to try again, in case of a bad outcome (those are discussed in other parts of this comment). Thus, we have not shown anything at all, regarding this version of the concept. In more straightforwards, but a lot less exact, words: the sSRAI thought experiment is about the goal softener version of the concept. Not the do over button version. The issue with the goal softener is that it only dilutes the bad parts. And the issue with the do over button is that it might fail. So even using both don’t make it ok to start a project that aims at a bad alignment target. Goal softeners and do over buttons can not change a bad alignment target into a good alignment target. They can however interact with each other. For example: when implementing a last judge off switch, there is a strong need to limit the extrapolation distance. Because extrapolation is not intuitive for humans. This need does not mix well with goal softeners. Because goal softeners can make it harder to see the bad effects of a bad goal. Seeing the push towards a lifeless universe in the sSRAI case requires no extrapolation at all. Because this push is already a universally known feature. Because SRAI is already well understood. For example: everyone already knows exactly why SRAI is fundamentally different from Clippy. But when trying to see a less obvious problem, the benefit of modest extrapolation can be canceled out by the obscuring effect of a well designed goal softener. Goal softeners can also interact with definitional issues in weird ways. The actions of SRAI is not sensitive to the definition of Suffering. All reasonable definitions lead to the same behaviour. But the actions of sSRAI might be very sensitive to the specific definition of Suffering used. Nothing discussed in this paragraph counts as analysing alignment targets. Simply because no topic discussed in this paragraph can ever be relevant to any question along the lines of: is SRAI a bad alignment target? So everything in this paragraph is out of scope of the proposed research agenda.
Being able to discuss the suitability of an alignment target, as a separate issue, is important. Because, if an AI project succeeds, and hits the alignment target that the project is aiming for, then it matters a lot, what alignment target the project was aiming for. It is important to be able to discuss this, as a separate question, removed from questions of strategies for buying time, or safety measures that might give a project a chance to try again, or measures that might lead some AI to settle for an approximation of the outcome implied by the alignment target, etc, etc. This is why we need the concepts, that all of these examples, are being used to define (the words used to refer to these concepts can of course be changed. But the concepts themselves must be possible to refer to).
Let’s say that two well intentioned designers, named Dave and Bill, both share your values. Dave is proposing an AI project aiming for SRAI, and Bill is proposing an AI project aiming for SMAI. Both projects have a set of very clever safety measures. Both Dave and Bill say that if some specific attempt to describe SRAI / SMAI fails, and leads to a bad outcome, then their safety measures will keep everyone safe, and they will find a better description (and they are known to be honest. And they are known to share your definition of what counts as a bad outcome. But it is of course not certain, that their safety measures will actually hold). If you can influence a decision, regarding which of these two project gets funded, then it seems very important to direct the funding away from Bill. Because the idea that clever safety measures, makes the alignment target irrelevant, is just straightforwardly false. Even if something along the lines of soft optimisation, is somehow guaranteed to work in a way that fully respects underlying intentions, then it would still matter, which of these two alignment targets are hit (even under the assumption of a non drastic path, to a soft version of the outcome, sSRAI is still preferable to sSMAI).
A very careful sSRAI / sSMAI project can obviously lead to an implemented SRAI / SMAI, that does not respect the underlying intentions of soft maximisation (for example because the resulting AI does not care at all about these intentions due to some technical failure. Or because it only cares about something, that turns out to have no significant impact on the outcome). The claim that this should be seen as a very surprising outcome, of any real world project, would be an extraordinary claim. It would require an extraordinarily solid argument. Even if such a solid-seeming argument were to be provided. The most likely scenario would still be that this argument is wrong in some way. The idea that an actual real world project plan is genuinely solid, in a way that makes a non soft implementation, of the aimed for alignment target, genuinely unlikely, does not seem plausible (even with a solid seeming argument). It seems a lot more likely that someone has constructed a solid seeming argument, on top of an unexamined implicit assumption. I don’t think that anyone, who takes AI dangers seriously, will dispute this fact. Conditioned on a project aiming for a bad alignment target, it is simply not possible for a careful person to rule out the scenario, where the project leads to the outcome, that is implied by the aimed for alignment target. Simply assuming that some safety measure will actually work, is just straightforwardly incompatible, with taking AI risks seriously.
It is worth pointing out, that this constitutes a fully separate argument, in favour of preferring a genuinely careful sSRAI project, to a genuinely careful sSMAI project (an argument that is fully distinct, from arguments based on preferring an outcome determined by sSRAI, to an outcome determined by sSMAI). If one acknowledges the value of switching from a genuinely careful sSMAI project, to a genuinely careful sSRAI project, then it is difficult to deny the necessity, of being able to analyse alignment targets.
From a higher level of abstraction, we can consider the scenario where two novel alignment targets are proposed. Neither one of them has any obvious flaws. This scenario takes place in a the real world, and involve real people as decision makers. Thus, these people deciding to wait indefinitely, is obviously not something that we can safely assume. In this scenario, it would be valuable, if an alignment target analysis effort, has advanced to the point where these two proposals can be meaningfully analysed and / or compared to each other (for example concluding that one is better than the other. Or concluding that both must be discarded, since they both lack at least one feature, that was found to be necessary for safety, while analysing other alignment targets). This remains valuable, even if a separate research effort has designed multiple layers of genuinely valuable safety measures.
Conditioned on Bill’s SMAI project being inevitable, safety measures might be very useful (they might hold up until Bill finally discovers that he is wrong. Until Bill finally sees, that SMAI is a bad alignment target. Until Bill finally realise, that the problems are inherent in the core concept. That the problems are not related to the difficulty, of finding a good definition of Suffering). But, regardless of what safety measures exists, Bill should not initiate an AI project that aims for the SMAI alignment target. This conclusion is completely independent of the specifics of the set of safety measures involved. In other words, it is important to separate the issue of safety measures, from the question of whether or not, a project is aiming for a bad alignment target. When analysing alignment targets, it makes sense to assume that this target will be successfully hit (in other words, it makes sense to assume that there will be nothing along the lines of molecular Squiggles, nothing along the lines of tiny pictures of human faces, nothing along the lines of soft optimisation, nothing along the lines of a triggered do-over-button, nothing along the lines of someone reconsidering the wisdom of aiming at this alignment target, etc, etc, etc, etc). Because if a project is launched, aiming at a bad alignment target, then no person who takes AI risks seriously, can dismiss the scenario, where the outcome implied by this alignment target, ends up getting fully implemented.
(when designing safety measures, it might make sense to imagine the mirror of this. In other words, you could imagine that you are designing these measures for an unpreventable SMAI project. A project lead by a very careful designer named Bill, that share your values, and that is aiming for SMAI, due to a sincere misunderstanding. And it might make sense to assume, that Bill will be very slow to update, regarding the suitability of SMAI as an alignment target. That Bill has severe tunnel vision, and has used his very powerful mind to almost completely insulate himself from object level critique, and closed down essentially every avenue, that might force him to admit that he is wrong. And assume that when Bill sees a bad outcome, that is prevented by some safety measure that you design, then this will be explained away by Bill, as being a result of a failure to describe the SMAI alignment target. One can for example imagine that Bill is the result of an augmentation process, that dramatically increased technical ability and persuasion ability. But also lead to extreme tunnel vision and very entrenched views, on questions of alignment targets. For example because persuasion ability and ability to get things done, is the result of a total and unquestioning adherence to a set of assumptions, that are, on the whole, far superior to any set of assumptions used by any baseline human (but still not fully free from flaws, in the particular case of alignment target analysis). This line of reasoning is entirely about safety measure design principles. In other words: nothing in this parenthesis counts as analysing alignment targets. Thus, everything in this parenthesis is out of scope of the proposed research agenda)
This comment is trying to clarify what the post is about, and by extension clarify which claims are made. Clarifying terminology is an important part of this. Both the post and my research agenda is focused on the dangers of successfully hitting a bad alignment target. This is one specific subset, of the existential threats that humanity face from powerful AI. Let’s distinguish the danger being focused on, from other types of dangers, by looking at a thought experiment, with an alignment target that is very obviously bad. A well intentioned designer named Bill, who shares your values, sincerely believes that it is a good idea to start an AI project, aiming for a Suffering Maximising AI (SMAI). Bill’s project might result in an AI that wants to do something along the lines of building lots of tiny pictures of sad faces, due to a failure to get the AI to want to maximise, a reasonable definition of Suffering. This type of scenario is not what I am focused on. No claim made in the post refer to any scenario along those lines. In this case, Bill’s project is described as having failed. In other words: the aimed for alignment target was not hit. These types of dangers are thus out of scope, of both the post, and my research agenda. It is a very different type of danger, compared to the dangers that come from scenarios where Bill succeeds, and successfully hits the SMAI alignment target. Success would result in an AI, that wants to maximise something, that can be reasonably described as suffering (and little pictures of sad faces, is not a reasonable way of defining suffering). It is also possible that Bill’s project results in an AI that wants to build a lot of molecular Squiggles, due to an issue that has nothing to do with any definition of Suffering (this is also a failure, and it is also out of scope).
Dangers along the lines of Squiggles and pictures of sad faces can be dealt with by improving alignment techniques. Or by explaining to Bill that he does not currently know how to hit an alignment target. All strategies along those lines are out of scope, simply because nothing along those lines, can ever be used to mitigate the dangers, that come from successfully hitting a bad alignment target. Nothing along those lines can help in scenarios, where it would be bad for the project to succeed (which is the case for Bill’s SMAI project). Showing Bill that he does not know how to hit an alignment target, stops working when Bill learns how to hit an alignment target. And helping Bill improve his alignment techniques, dramatically increases the dangers that comes from successfully hitting a bad alignment target. The best way of dealing with such dangers, is by showing Bill, that it would be bad, for his project to succeed. In other words: to show Bill, that his favoured alignment target, is a bad alignment target. My research effort, which the post is a part of, is focused on analysing alignment targets. The reason that such analysis is needed, is that not all bad alignment targets, are as obviously bad as SMAI. The post can be used to illustrate the fact, that alignment target analysis can be genuinely unintuitive.
In other words: the purpose of the proposed research effort, is not to help Bill successfully implement SMAI. Or to help Bill find a good description of SMAI. Or to convince Bill to aim for a different version of SMAI. Instead, the purpose of the proposed alignment target analysis research effort, is to help Bill see that he is aiming for a bad alignment target. So that he will choose to abandon his SMAI project.
One could add a safety measure to Bill’s SMAI project, that might give Bill a chance to try again, in case of a bad outcome. This can reduce many different types of dangers, including dangers from successfully hitting a bad alignment target (less formally: one can add a do over button). But since no such safety measure is guaranteed to work, no such measure can make it ok, for Bill to start his SMAI project. And, crucially, no such safety measure can be relevant to the question, of whether or not SMAI is a bad alignment target. Thus, analysing such safety measures, does not count as analysing alignment targets. Since they are not relevant to the proposed research effort, analysing such safety measures are also out of scope. It would not be particularly surprising, if analysing such safety measures turns out to be more important than analysing alignment targets. But this does not change the fact that proposed safety measures are irrelevant, when analysing alignment targets. In other words: they are a separate, and complementary, method of reducing the type of danger that I am focused on. (less formally: a do over button might save everyone from an outcome massively worse than extinction. So they are definitely not useless in general. They are just useless for the specific task of analysing alignment targets. And since they can never be guaranteed to work, they can never replace alignment target analysis. A do over button can also be added to many different proposed alignment targets. Which means that they are not very helpful when one is comparing two alignment targets)
I don’t use the term alignment target for proposals along the lines of a Pivotal Act AI (PAAI), or any other AI whose function is similar to the typical functions of proposed treaties (such as buying time). So, CEV is an alignment target, but no proposal along the lines of a PAAI, is referred to as an alignment target. To get a bit more concrete, regarding what counts as an alignment target in my terminology, let’s say that a set of designers have an AI proposal, called AIX. If AIX is the type of AI, that implies a decision, regarding which goal to give to a successor AI, then AIX counts as an alignment target. The act of building AIX, implies a choice of alignment target. So, to build AIX, one must be able to analyse alignment targets.
If, on the other hand, the designers do not plan for AIX to play a role in the decision process, regarding what goal to give to a successor AI, then AIX does not count as an alignment target. In this latter case, building AIX might be possible without being able to analyse alignment targets (and building AIX, at a point in time when one does not know how to analyse alignment targets, might be a good idea). But the need to analyse alignment targets would remain, even if this latter type of project is successful.
If AIX is an alignment target, then adding something along the lines of a last judge off switch, to AIX, does not change the fact that AIX is an alignment target. An AI project might add such a feature, specifically because the designers know, that the aimed for alignment target might be bad. In this case, the project is still described as aiming for a bad alignment target. Such designers are not certain about the alignment target in question. And they have taken precautions, in case it turns out to be a bad alignment target. So, it might not be possible to discover anything, that would be very surprising to the designers. But it is still possible to discover that the aimed for alignment target is bad. This might still help a group of not-very-surprised designers, to reach the not-very-surprising-conclusion, that they should abandon the project (since this type of add on might fail, no such add on can ever make it reasonable to start a project that is aiming for a bad alignment target. In other words: it was wrong of these designers, to start this project. But in this scenario, the designers started the project, while thinking that the alignment target might be good. So, there is nothing in this scenario that strongly indicates, that it will be impossible to reason with them, in case one discovers that their alignment target is bad).
A PAAI project might be one part of a larger project plan, that at a later stage calls for some other AI project, that will be trying to hit an agreed upon alignment target. This is similar to how a treaty might be part of such a larger project plan. This larger project counts as aiming for an alignment target. But the alignment target is solely determined by the details of second AI project. The details of any treaty / PAAI / etc is, in my terminology, irrelevant to the question: what alignment target is the larger project aiming for?
Dangers related to successfully hitting a bad alignment target, is a different class of dangers, compared to dangers related to a project, that fails to hit a given alignment target. Preventing dangers related to successfully hitting a bad alignment target, requires a dedicated research effort, because some important measures that can reduce this risk, does not reduce other AI related risks (and are thus unlikely to be found, without a dedicated effort). Specifically, one way to reduce this specific risk, is to analyse alignment targets, with a view to finding some set of features, that are necessary for safety. I propose to take the perspective of one individual that is not given any special treatment. Then one can look for a set of necessary features, that an alignment target of a given project must have, for it to be rational, for such an individual to support this AI project. This should be done, while making as few assumptions as possible regarding this individual. Such a set would help someone, that is trying to construct a new alignment target. Since these features are all of the necessary-but-not-sufficient type, the fact that a project has them all, is not enough to convince anyone, to support the project. But they can help an individual (for whom none of the assumptions made, are false), decide to oppose a project, that is aiming for an alignment target, that lacks one of the features in such a set. The post describes one such necessary feature: that each individual, is given meaningful influence, regarding the adoption of those preferences, that refer to her. In many cases, it will be unclear whether or not some proposal has this feature. But in some cases, it will be clear, that a given proposal does not have this feature. And since it is a necessary but not sufficient feature, it is these clear negatives, that are the most useful. It is for example clear, that no version of CEV, can have this feature. This feature is incompatible with the core concept, of building an AI, that is in any way describable as: implementing the Coherent Extrapolated Volition of Humanity. In other words: in order to build an AI that has this feature, one has to choose some other alignment target. This seems to be a somewhat controversial feature. Possibly because of this incompatibility.
Such features are central to the proposed research effort. And the feature mentioned is the most important part of the post. But, in order to avoid controversies and distractions, this comment will not involve any further discussion of CEV. I will instead make the distinctions that I need to make, by relying exclusively on examples with other alignment targets (whose status as bad alignment targets is not being disputed).
A simple, well understood, and uncontroversial example of a bad alignment target, is a Suffering Reducing AI (SRAI). SRAI cares about what happens to humans, and will not treat humans in an uncaring, strategic way. For example: SRAI would not accept an offer where the solar system is left alone in exchange for resources (with the only intervention being to prevent any other AI from being built). This rejection is not related to any specific detail, or any specific definition. Any reasonable definition of suffering, leads to an AI, that will reject such an offer. Because killing all humans is inherent in the core concept, of reducing suffering. For any reasonable definition of Suffering, accepting the offer, would lead to the continuation of a lot of Suffering. Unless there exists some way of using the offered resources, to reduce some other, larger, source of Suffering, the offer will be rejected (for the rest of this text, it will be assumed that no such opportunity exists).
Regardless of which reasonable set of definitions is used, a successful SRAI project, simply leads to an AI that rejects all such offers (and that kills everyone). In other words, if a SRAI project results in an AI that rejects the offer and kills everyone, then this behaviour is very consistent, with the project having succeeded. The core issue with any SRAI project, is that the project would be aiming for a bad alignment target. The path towards a successfully implemented SRAI project, is simply not relevant to the question, of whether or not this is a bad alignment target. If some treaty, or a Pivotal Act AI, or some other plan for buying time, or some other clever trick, was involved at some point. Then this has no impact whatsoever on the question: is SRAI a bad alignment target? Similarly: if the path to a successful SRAI implementation, involved a last judge off switch, which successfully prevented an issue that would have led to an AI, that would have been glad to accept the offer mentioned above (because that AI would have been able to use the offered resources, to build lots of molecular squiggles), then this is equally irrelevant to the question: is SRAI a bad alignment target? If some other type of do over button, happens to continue to work, until the project is abandoned, then this is equally irrelevant, to the question: is SRAI a bad alignment target? More generally: no try-again-safety-measure can be relevant to this question. And no other detail of the larger project plan, can, be relevant to this question. (it is possible that a failed SRAI project will result in some AI that rejects the offer. Only one positive claim is made: If the resulting AI accepts the offer, then the project did not successfully hit the aimed for alignment target. Or, in other words: if the project does successfully hit the aimed for alignment target, then the resulting AI will reject the offer)
When saying that SRAI is a bad alignment target, there is no intended implication, that an AI project aiming for the SRAI alignment target, will lead to a bad outcome. It could be that some set of safety measures are implemented. And these could hold up until the designers see that the aimed for alignment target is a bad alignment target. Such a project is a bad idea. It is an unnecessary risk. And if it successfully hits the alignment target that it is aiming for, then it will get everyone killed. Success of this project leads to extinction (regardless of path to success, and regardless of the details of the definitions). But there is no implied claim that the act of starting such a project would, in fact, lead to a bad outcome.
It is possible to use other definitions for things such as: ``alignment target″, and ``successfully hitting an alignment target″, and ``the project succeeding″. But the way that these phrases are being defined here, by using the various examples throughout this comment, makes it possible to refer to important concepts in a clear way. The words used to point to these concepts could be changed. But it must be possible to refer to the concepts themselves. Because they are needed in order to make important distinctions.
Now, let’s return to the SMAI alignment target, mentioned at the beginning of this comment. Unless something very strange happens, the outcome of Bill’s SMAI project, would accept the offer mentioned above (this is probably a good deal for the resulting AI, whether or not the project hits the aimed for alignment target. The offered resources can for example be used to create: little pictures of sad faces, or Suffering, or Squiggles). Thus, such an offer can not be used to distinguish different types of scenarios from each other, in the SMAI case. For a SMAI project, it is more informative to ask what it would mean from a human perspective, to give the resulting AI resources. In the SMAI case, this question is better for illustrating the importance of distinguishing between different types of AI related dangers.
So, consider the case where the outcome of this SMAI project is fully contained within some lifeless region. Now, let’s ask what it would mean, to give it more resources. Let’s say that the result of a SMAI project, wants to build lots of little pictures of sad faces (due to a failure to get the AI to want to maximise a reasonable definition of Suffering), or little molecular Squiggles (due to an implementation failure, that is unrelated to the definition of Suffering). In these two cases, it seems like it does not matter much, if resources are destroyed, or given to this AI (assuming that it is fully contained). Dangers related to outcomes along these lines, are out of scope, as they are the results of failed projects. If a SMAI project succeeds however, then it would be far better to destroy those resources, than for SMAI to have those resources. If a SMAI project successfully hits the SMAI alignment target, the details of the definitions would also matter. Possibly a great deal. Successfully hitting the SMAI alignment target, can lead to a very wide range of outcomes, depending on the details of the definitions. Different sets of reasonable definitions, lead to outcomes, that are different in important ways (the SMAI project being successful, means that the resulting AI will want to maximise some reasonable definition of suffering. Thus, the outcome of a successful SMAI project, will be a bad outcome. But different definitions still lead to importantly different outcomes). The dangers involved with successfully hitting the SMAI alignment target, is importantly different from the case where the project fails to hit an alignment target. One very important distinction, is that each scenario, in this very large set, of very diverse types of scenarios, are all best dealt with, by helping Bill see that he is aiming for a bad for alignment target.
(it is possible that a failed SMAI project will result in an AI, such that preventing it from having resources is important (even under a strong containment assumption). Only one positive claim is made: If humans are essentially indifferent to giving the resulting AI resources, then the project did not successfully hit the aimed for alignment target. Or, in other words: if the project does successfully hit the aimed for alignment target, then it is important to keep resources away from the resulting AI)
Shifting resources from the result of one successful SMAI project, to the result of another successful SMAI project, might be very valuable. But issues along those lines are also out of scope of the proposed research effort, of analysing alignment targets. In other words: the point of talking to Bill, is not to help Bill find a better description of SMAI. But instead to help Bill see, that he should aim for a different alignment target. These two objectives are very different. Just as it would be a bad idea to help Bill hit the SMAI alignment target. It would also be a bad idea to help Bill describe SMAI. And while different versions of SMAI might lead to importantly different outcomes, the purpose of the proposed research project, is not to switch from one version of SMAI, to another version of SMAI. The purpose of the proposed research effort, is instead to help Bill see, that SMAI is a bad alignment target.
The reason that the effort to analyse alignment targets needs a dedicated effort, is that the best way to stop the two projects mentioned above, is by pointing out, that it would be a bad thing for them to succeed. If it is possible to stop a project in this way, then this is a better option, than to implement safety measures, that might (or might not) hold up until the project is abandoned. For SRAI and SMAI, the needed insights already exists. So, to stop these two projects, there is no need for further analysis. But if I take an example where there is no consensus, that the alignment target in question is in fact bad. Then the distinctions that I am trying to make in this comment will get lost, in arguments about whether or not the alignment target in question, is in fact bad. So I will stick with SRAI and SMAI.
In other words: To deal with bad alignment targets, that has not currently been discovered to be bad, dedicated effort is needed. Efforts that deal with other types of dangers, does not need the types of insights, that are needed to analyse alignment targets. And such efforts, are thus not likely to lead to such insights. It is not known how long it would take a serious, dedicated, research effort to advance to the point, where the needed insights become possible to see. Partly, this is because it is not known which nice-sounding-but-actually-bad alignment targets, will be proposed in the future. We also don’t know how long there will be to work on this issue. Even if we reason from the assumption, that some alignment target will get successfully hit, there would still be a lot of remaining uncertainty. We would still not know how long time there will be, to find the needed insights. Since we don’t know how much time it would take, or how much time there will be, there is no particular reason to think that the effort will be completed in time. Combined with the stakes involved, this implies urgency. Despite this urgency, currently there exists no serious research project, that is dedicated to alignment target analysis. One positive thing that happens to be true, is that in scenarios where alignment target analysis is needed, there will probably be time to do such analysis. There is no particular reason to think that there will be enough time. But it seems very likely, that there will be a significant amount of time. Another positive thing, that happens to be true, is that one can prevent these types of dangers, without arriving at any actual answer (the dangers from a given bad alignment target, can for example be mitigated by noticing, that it is a bad alignment target. This can be done, without having any example of a good alignment target). Yet another positive thing, that also happens to be true, is that if the needed insights are in fact generated in time, then these insight will probably only have to be explained, to the types of people, that are capable of hitting an alignment target.
If one views the situation from a higher level of abstraction, then the lack of a dedicated research effort is even more strange. It obviously matters, what alignment target an AI project is aiming for. One way to phrase this, is that it simply cannot be the case, that the aimed for alignment target is both irrelevant to the outcome, and simultaneously supposed to reduce division, and unify people towards a common goal. (if an alignment target has been set in stone, then one might be able to reasonably argue that detailed descriptions of this alignment target would not be particularly valuable to a given project. But such arguments are simply not applicable, to suggestions that a proposed project should to aim for a different alignment target)
If an AI project is aiming at a bad alignment target, then this can not be fixed by adding some set of safety measures to this project. And a set of safety measures, is simply not relevant, when one tries to determine, whether or not the aimed for alignment target, is a bad alignment target. And since no safety measure is guaranteed to work, such measures can never make it ok, to launch an AI project, that is aiming at a bad alignment target. A very careful SRAI project, that is implementing a set of very well constructed safety measures, might obviously lead to SRAI. It is still putting everyone in unnecessary danger. One implication of this, is that if someone argues that an SRAI project is aiming for a bad alignment target, then this argument can not be countered, by pointing at safety measures. It is important to emphasise, that it is completely irrelevant, what this set of safety measures is. Such a counterargument can always be safely dismissed out of hand, without having any idea, of which specific type of do-over-button is being referred to. Such safety measures simply, can not, change the fact that SRAI is a bad alignment target. (if such a do-over-button is added to a SRAI project, and this do-over-button happens to work, then it can save everyone from getting killed by SRAI. So it is definitely not useless in general. It is just useless, for the specific task, of analysing alignment targets. In other words: it can stop everyone from getting killed. But it can not be used to build a counterargument, when someone points out that SRAI is a bad alignment target. In yet other words: it is only the counterargument that can be safely dismissed. Some specific safety measure, that such a counterargument is built around, might turn out to be more important than the entire field of alignment target analysis. So dismissing such a safety measure, based on the fact that it is being used as part of an invalid counterargument, is definitely not something that can be safely done)
So far, we have talked about safety measures, that might allow a designer to try again. Let’s now turn our attention to what might perhaps be roughly described as a ``goal softener″, that could be added to an alignment target. In other words: a type of add on that will modify the behaviour implied by an alignment target (not an add on, that will allow a designer to abandon an alignment target, if it implies bad behaviour). Let’s write sSRAI for a soft Suffering Reducing AI, that will determine the actual outcome, and that respects the underlying intentions, behind an effort to get it to (for example) ``avoid optimising too hard″. Specifically, sSRAI wants to reduce suffering, but sSRAI wants to do this, in a way that respects the intentions behind the designers attempts, to get it to do soft optimisation, of the SRAI alignment target.
Let’s use an undramatic example scenario, to make a few points, that have wider applicability. Consider a successfully implemented sSRAI. sSRAI acts in a way that is genuinely in line with the underlying ideas and intentions, behind an effort to get the AI to avoid optimising too hard. sSRAI wants to act in accordance with the spirit of the soft optimisation principles that the designers were trying to implement. In other words: this feature is not used as a way to buy time, or as a way to get a second chance in case of failure (less formally: It is not a speed bump. And it is not a do over button). Specifically, sSRAI will always act in the world, in a way that is consistent with being uncertain, regarding how precisely Suffering should be defined. And sSRAI will avoid any action, that the designers would find alarming or disconcerting in any way. And sSRAI will avoid any action that would be perceived as weird or dramatic. And sSRAI would like to avoid any rapid change, taking place in society or culture. And sSRAI will avoid any action that would be seen as deceptive or manipulative.
Now, let’s explore what sSRAI might do in the world, to see if this has turned the SRAI alignment target, into a good alignment target. sSRAI could, for example, offer a process of sSRAI guided self discovery, that causes people to settle on some new way of viewing life, that is far more fulfilling. There is a wide range of views that people arrive at, but almost all include the idea, that it is best to not have more than one child. Every year, asking sSRAI for such guidance becomes slightly more mainstream, and every year some fraction of the population asks for this. sSRAI also declines to create any set of circumstances, that would dramatically alter human society. For example, sSRAI declines to create any set of circumstances, that would make humans want to go on living indefinitely (as opposed to just wanting to live for a very, very long time). sSRAI also takes actions to increase comfort, happiness, and to reduce suffering. Basically, taking careful actions that lead to people living very long and very happy lives (this does imply gradual but significant changes. But those societal changes are seen as acceptable, as they are the least dramatic changes, that can achieve the objective of significantly reducing suffering in the short term).
At no point on this path, is sSRAI deviating from the spirit of the implied instructions. At no point is sSRAI deviating from what it is supposed to do: which is to reduce suffering, while adhering to the underlying intentions, of avoiding drastic / weird / etc actions, or rapid changes, or manipulations, etc, etc. At all points on this path, is it the case, that all of its plans, are entirely in line with genuinely wanting to respect the underlying intention, of soft optimisation. For example: the process of sSRAI guided self discovery, is not some trick, to get people to want weird things. It is genuinely a process, designed to lead to some viewpoint, that in turn leads to less suffering (and it never results in any viewpoint, that is a weird outlier, in the set of common, pre-AI, viewpoints). The process also takes the other aspirations of a given individual into account, and leads to viewpoints that additionally does things like increasing happiness, or improving artistic output (since taking peoples wishes into account when possible, is also implied by some of the underlying intentions, that lead the designers to try to get the AI to do soft optimisation. Specifically: intentions related to not having tunnel vision). And no one is ever manipulated into asking for it. It is simply offered to everyone. And then word spreads, that it is great. For example: sSRAI did not design it, with the design constraint, that it must become dominant (including such a constraint when designing the process, would have violated the designers underlying intentions. Specifically, they would see this as manipulation). But the process leads to outcomes, that people will eventually want, when it stops being seen as weird to ask for it. The process also does not cause any dramatic changes to peoples lives or personalities.
Additionally, and completely in line with the underlying intentions of the safety measures, sSRAI declines to enable any form of dramatic augmentation procedure (explaining, completely honestly, to anyone that wants dramatic augmentation, that this would both lead to dramatically increased suffering, and to rapid societal change). sSRAI will never decide to explain any concept or fact, that would result in transformative change or increased suffering (or allow any AI project, or any augmentation project, that would lead to such concepts or facts being discovered). In general, sSRAI will prevent any AI project, that would have dramatic impact on society or culture (again straightforwardly in line with the underlying intentions of the designers. Because they intended to prevent drastic AI actions in general, not just drastic sSRAI actions). This overall behaviour is a genuinely reasonable way of interpreting: ``exert a pushing force towards the outcome implied by the aimed for alignment target. But make sure that you always act in a way, that I would describe as pushing softly″.
Even conditioned on a successful implementation of sSRAI, the specific scenario outlined above would obviously never happen (for the same reason that the specific chess moves that I can come up with, are very unlikely to match the actual chess moves, of a good chess player, whose moves I am trying to predict). A more realistic scenario is that sSRAI would design some other path to a lifeless universe, that is a lot more clever, a lot faster, a lot less obvious, and even more in line with underlying intentions. If necessary, this can be some path, that simply can not be explained to any human, or any result of any of the augmentation procedures, that sSRAI will decide to enable. A path that simply can not be specified in any more detail than: ``I am trying to reduce suffering, for a wide range of definitions of this concept, while adhering to the underlying intentions, behind the design teams efforts, to prevent me from optimising too hard″. SRAI is simply a bad alignment target. And there does not exists any trick, that can change this fact, even in principle (even under an assumption of a successfully implemented, and genuinely clever, trick). If a last judge off switch happens to work, then the project mentioned above might be shut down by some extrapolated version of a last judge, with no harm done (in other words: there is no claim that a sSRAI project will lead to a lifeless universe. Just a claim that the project is a bad idea to try, and that it represents an unnecessary risk, because SRAI is a bad alignment target).
In other words: the core concept of reducing suffering, implies killing everyone. And this remains true, even if one does manage to get an AI to genuinely want to make sure that everyone stops living, using only slow, ethical, legal, non deceptive, etc, etc, methods. Or get the AI to use some incomprehensible strategy (that avoids sudden changes, or dramatic actions, or unpleasant aesthetics, or manipulation, etc, etc). SRAI is unsafe, because the implied outcome is unsafe. More generally: SRAI is unsafe because SRAI cares about what happens to you. This is dangerous, when it is combined with the fact, that you never had any meaningful influence, over the decision, of which you-referring-preferences, SRAI should adopt. Similar conclusions hold for other bad alignment targets. And not all of the implied outcomes are this undramatic, or this easy to analyse. For example: sSMAI does not have a known, easy to analyse, endpoint, that sSMAI is known to softly steer the future towards. But the danger implied by a soft push, towards a bad future, is not reduced by the fact that we are having trouble predicting the details. It makes it harder to describe the danger of a soft push in detail, but it does not reduce this danger.
We are now ready to illustrate the value of analysing alignment targets, by noting the difference between a world organised by sSRAI, and a world organised by sSMAI. Let’s consider an AI that genuinely wants to respect the designers underlying intentions, that motivated them to try to get the AI to avoid optimising too hard. We can refer to it as soft AI, or sAI. Let’s assume that sAI does not do anything that causes human culture to stray too far away from current value systems and behaviours. It does not take drastic actions, or manipulate, or deceive, etc, etc. It only acts, in ways that would be genuinely reasonable to view, as ``pushing softly″. It still seems clearly better, to have an AI that softly pushes to reduce suffering, than to have an AI that softly pushes to maximise suffering. In yet other words: analysing alignment targets will continue to matter, even if one is reasoning from some very, very, optimistic set of assumptions (the difference is presumably al lot less dramatic than the difference between SMAI and SRAI. But it still seems valuable to switch from sSMAI to sSRAI).
We can actually add even more optimistic assumptions. Let’s say that Gregg’s Method (GM), is completely guaranteed to turn SRAI into GMSRAI. GMSRAI will create a thriving and ever expanding human civilisation, with human individuals living as long lives as would be best for them. Adding GM to any other alignment target, is guaranteed to have equally dramatic effects, in a direction that is somehow guaranteed to be positive. It seems very likely, that even in the GMSRAI case, there will still be a remnant left, of the push towards a lifeless universe. So, the value of analysing alignment targets would remain, even if we had Greggs Method (because a remnant of the SRAI push, would still be preferable to a remnant of the SMAI push. So, finding a better alignment target, would remain valuable, even if we counterfactually had someone like Gregg to help us. In other words: while finding / modifying someone into / building / etc, a Gregg would be great, this would not actually remove the need to analyse alignment targets). (and again: the reason that analysing alignment targets requires a dedicated research effort, is that not all alignment targets are as easy to analyse and compare, as SRAI and SMAI)
Before leaving the sSRAI thought experiment, it makes sense to make three things explicit. One is that ideas along the lines of soft optimisation, have not been shown to be dead ends or bad ideas. Because nothing along these lines can stop a bad alignment target from being bad, even in principle. Basically: unless the AI project is secretly trying to implement an AI that does, whatever a single designer wants the AI to do (in which case the claimed alignment target is irrelevant. From the perspective of an individual without any special influence, such a project is equivalent to a ``random dictator AI″ (regardless of the words used to describe the project)), then it will continue to matter, that SRAI exerts a pushing force towards a lifeless universe. Either this push will impact the outcome, or the alignment target is an empty PR slogan (either it has an influence on the outcome. Or it does not have an influence on the outcome). So, nothing said here is incompatible with some concept along the lines of soft maximisation turning out to be a genuinely valuable and important concept (because we knew that it would fail to solve the task of turning a bad alignment target into a good alignment target, the second we knew what type of thing, was supposed to perform what type of task).
Secondly, the fact that no one in the sSRAI scenario recognises, that SRAI implies a push towards a lifeless universe, makes things look strange (the same thing makes the various SMAI examples look even stranger). This is unavoidable, when one constructs an example with an alignment target, that is universally understood to be a bad alignment target (if the audience understands this, then the people in the thought experiment should also understand this). In other words: no sSRAI project poses any actual threat, because SRAI is already known to be a bad alignment target. The goal of analysing alignment targets, is to neutralise threats from other alignment targets, that are not currently known to be bad.
And finally, it makes sense to make one particular distinction explicit: the sSRAI thought experiment is not discussing a safety measure of a type that is designed to give a designer a chance to try again, in case of a bad outcome (those are discussed in other parts of this comment). Thus, we have not shown anything at all, regarding this version of the concept. In more straightforwards, but a lot less exact, words: the sSRAI thought experiment is about the goal softener version of the concept. Not the do over button version. The issue with the goal softener is that it only dilutes the bad parts. And the issue with the do over button is that it might fail. So even using both don’t make it ok to start a project that aims at a bad alignment target. Goal softeners and do over buttons can not change a bad alignment target into a good alignment target. They can however interact with each other. For example: when implementing a last judge off switch, there is a strong need to limit the extrapolation distance. Because extrapolation is not intuitive for humans. This need does not mix well with goal softeners. Because goal softeners can make it harder to see the bad effects of a bad goal. Seeing the push towards a lifeless universe in the sSRAI case requires no extrapolation at all. Because this push is already a universally known feature. Because SRAI is already well understood. For example: everyone already knows exactly why SRAI is fundamentally different from Clippy. But when trying to see a less obvious problem, the benefit of modest extrapolation can be canceled out by the obscuring effect of a well designed goal softener. Goal softeners can also interact with definitional issues in weird ways. The actions of SRAI is not sensitive to the definition of Suffering. All reasonable definitions lead to the same behaviour. But the actions of sSRAI might be very sensitive to the specific definition of Suffering used. Nothing discussed in this paragraph counts as analysing alignment targets. Simply because no topic discussed in this paragraph can ever be relevant to any question along the lines of: is SRAI a bad alignment target? So everything in this paragraph is out of scope of the proposed research agenda.
Being able to discuss the suitability of an alignment target, as a separate issue, is important. Because, if an AI project succeeds, and hits the alignment target that the project is aiming for, then it matters a lot, what alignment target the project was aiming for. It is important to be able to discuss this, as a separate question, removed from questions of strategies for buying time, or safety measures that might give a project a chance to try again, or measures that might lead some AI to settle for an approximation of the outcome implied by the alignment target, etc, etc. This is why we need the concepts, that all of these examples, are being used to define (the words used to refer to these concepts can of course be changed. But the concepts themselves must be possible to refer to).
Let’s say that two well intentioned designers, named Dave and Bill, both share your values. Dave is proposing an AI project aiming for SRAI, and Bill is proposing an AI project aiming for SMAI. Both projects have a set of very clever safety measures. Both Dave and Bill say that if some specific attempt to describe SRAI / SMAI fails, and leads to a bad outcome, then their safety measures will keep everyone safe, and they will find a better description (and they are known to be honest. And they are known to share your definition of what counts as a bad outcome. But it is of course not certain, that their safety measures will actually hold). If you can influence a decision, regarding which of these two project gets funded, then it seems very important to direct the funding away from Bill. Because the idea that clever safety measures, makes the alignment target irrelevant, is just straightforwardly false. Even if something along the lines of soft optimisation, is somehow guaranteed to work in a way that fully respects underlying intentions, then it would still matter, which of these two alignment targets are hit (even under the assumption of a non drastic path, to a soft version of the outcome, sSRAI is still preferable to sSMAI).
A very careful sSRAI / sSMAI project can obviously lead to an implemented SRAI / SMAI, that does not respect the underlying intentions of soft maximisation (for example because the resulting AI does not care at all about these intentions due to some technical failure. Or because it only cares about something, that turns out to have no significant impact on the outcome). The claim that this should be seen as a very surprising outcome, of any real world project, would be an extraordinary claim. It would require an extraordinarily solid argument. Even if such a solid-seeming argument were to be provided. The most likely scenario would still be that this argument is wrong in some way. The idea that an actual real world project plan is genuinely solid, in a way that makes a non soft implementation, of the aimed for alignment target, genuinely unlikely, does not seem plausible (even with a solid seeming argument). It seems a lot more likely that someone has constructed a solid seeming argument, on top of an unexamined implicit assumption. I don’t think that anyone, who takes AI dangers seriously, will dispute this fact. Conditioned on a project aiming for a bad alignment target, it is simply not possible for a careful person to rule out the scenario, where the project leads to the outcome, that is implied by the aimed for alignment target. Simply assuming that some safety measure will actually work, is just straightforwardly incompatible, with taking AI risks seriously.
It is worth pointing out, that this constitutes a fully separate argument, in favour of preferring a genuinely careful sSRAI project, to a genuinely careful sSMAI project (an argument that is fully distinct, from arguments based on preferring an outcome determined by sSRAI, to an outcome determined by sSMAI). If one acknowledges the value of switching from a genuinely careful sSMAI project, to a genuinely careful sSRAI project, then it is difficult to deny the necessity, of being able to analyse alignment targets.
From a higher level of abstraction, we can consider the scenario where two novel alignment targets are proposed. Neither one of them has any obvious flaws. This scenario takes place in a the real world, and involve real people as decision makers. Thus, these people deciding to wait indefinitely, is obviously not something that we can safely assume. In this scenario, it would be valuable, if an alignment target analysis effort, has advanced to the point where these two proposals can be meaningfully analysed and / or compared to each other (for example concluding that one is better than the other. Or concluding that both must be discarded, since they both lack at least one feature, that was found to be necessary for safety, while analysing other alignment targets). This remains valuable, even if a separate research effort has designed multiple layers of genuinely valuable safety measures.
Conditioned on Bill’s SMAI project being inevitable, safety measures might be very useful (they might hold up until Bill finally discovers that he is wrong. Until Bill finally sees, that SMAI is a bad alignment target. Until Bill finally realise, that the problems are inherent in the core concept. That the problems are not related to the difficulty, of finding a good definition of Suffering). But, regardless of what safety measures exists, Bill should not initiate an AI project that aims for the SMAI alignment target. This conclusion is completely independent of the specifics of the set of safety measures involved. In other words, it is important to separate the issue of safety measures, from the question of whether or not, a project is aiming for a bad alignment target. When analysing alignment targets, it makes sense to assume that this target will be successfully hit (in other words, it makes sense to assume that there will be nothing along the lines of molecular Squiggles, nothing along the lines of tiny pictures of human faces, nothing along the lines of soft optimisation, nothing along the lines of a triggered do-over-button, nothing along the lines of someone reconsidering the wisdom of aiming at this alignment target, etc, etc, etc, etc). Because if a project is launched, aiming at a bad alignment target, then no person who takes AI risks seriously, can dismiss the scenario, where the outcome implied by this alignment target, ends up getting fully implemented.
(when designing safety measures, it might make sense to imagine the mirror of this. In other words, you could imagine that you are designing these measures for an unpreventable SMAI project. A project lead by a very careful designer named Bill, that share your values, and that is aiming for SMAI, due to a sincere misunderstanding. And it might make sense to assume, that Bill will be very slow to update, regarding the suitability of SMAI as an alignment target. That Bill has severe tunnel vision, and has used his very powerful mind to almost completely insulate himself from object level critique, and closed down essentially every avenue, that might force him to admit that he is wrong. And assume that when Bill sees a bad outcome, that is prevented by some safety measure that you design, then this will be explained away by Bill, as being a result of a failure to describe the SMAI alignment target. One can for example imagine that Bill is the result of an augmentation process, that dramatically increased technical ability and persuasion ability. But also lead to extreme tunnel vision and very entrenched views, on questions of alignment targets. For example because persuasion ability and ability to get things done, is the result of a total and unquestioning adherence to a set of assumptions, that are, on the whole, far superior to any set of assumptions used by any baseline human (but still not fully free from flaws, in the particular case of alignment target analysis). This line of reasoning is entirely about safety measure design principles. In other words: nothing in this parenthesis counts as analysing alignment targets. Thus, everything in this parenthesis is out of scope of the proposed research agenda)