Noosphere89 comments on A problem shared by many different alignment targets

Noosphere89 15 Jan 2025 21:25 UTC
5 points
0
The Coherent Extrapolated Volition of a human Individual (CEVI) is a completely different type of thing, than the Coherent Extrapolated Volition of Humanity (CEVH). Both are mappings to an entity of the type that can be said to want things. But only CEVI is a mapping from an entity of the type that can be said to want things (the original human). CEVH does not map from such an entity. CEVH only maps to such an entity. A group of billions of human individuals can only be seen as such an entity, if one already has a specific way of resolving disagreements, amongst individuals that disagree on how to resolve disagreements. Such a disagreement resolution rule is one necessary part of the definition of any CEVH mapping.
I like to state this as the issue that all versions of CEV/group alignment that want to aggregate thousands of people’s or more values requires implicitly resolving disagreements in values, which in turn require value-laden choices, and at that point, you are essentially doing value-alignment to what you think is good, and the nominal society is just a society of you.
I basically agree with Seth Herd here, in that instruction following is both the most likely and the best alignment target for purposes of AI safety (at least assuming offense-defense balance issues aren’t too severe).
- ThomasCederborg 22 Apr 2025 14:53 UTC
  1 point
  0
  Parent
  The successful implementation of an instruction following AI would not remove the possibility that an AI Sovereign will be implemented later. For example: the path to an AI that implements the CEV of Humanity outlined on the CEV arbital page starts with an initial non-Sovereign AI (and this initial AI could be the type of instruction following AI that you mention). In other words: the successful implementation of an instruction following AI does not prevent the later implementation of a Group AI. It is in fact one step on the classical path to a Group AI. So even if we assume that the first clever AI will be an instruction following AI, then this does not remove the need to analyse Sovereign AI proposals. In some scenarios where we end up with a successfully implemented AI Sovereign, we will not have a lot of time for Alignment Target Analysis (ATA), and we will not get a lot of help with ATA. This is dangerous. Because it is currently not possible to reliably tell if a reasonable sounding proposal implies a massively worse than extinction outcome. The last section of this comment goes into more detail about this, and links to previous posts and comments. But first I will respond to your comment about CEV, and try to clarify what type of proposals the present post is analysing.
  Regarding your comment about CEV:
  One can propose to Coherently Extrapolate the Volition of all sorts of things for all sorts of reasons. The acronym CEV can thus be used as a shorthand for all sorts of AI proposals. Including proposals to build an AI that effectively does whatever one single individual wants that AI to do. And if such a proposal is explained in an unclear way then readers might get the mistaken impression that the proposed AI is supposed to do what Humanity wants that AI to do (in some sense). But such proposals are out of scope of the present post.
  None of the alignment targets that I am analysing in the present post are like this. I am instead analysing proposals along the lines of Yudkowsky’s proposal to build an AI that implements the Coherent Extrapolated Volition of Humanity. (To me, it seems that this text is explicitly and unambiguously proposing to build an AI that genuinely does what Humanity wants done (in some sense)). In other words: each proposal analysed in the post is genuinely a proposal to create an AI that is built on top of a CEVH mapping. None of them is a proposal to create an AI that is effectively built on top of a CEVI mapping. In yet other words: they are all, genuinely, Group AI proposals. My point is that such Group AI proposals are either confused (effectively proposing to build an AI that implements the will of a non-existing free floating G entity, with an existence that is implicitly assumed to be separate from any specific mapping or set of definitions). Or such proposals are extremely bad for individuals (which should not be surprising. Because there is no reason to expect that doing what one type of thing wants, would be good for a completely different type of thing).
  In other words: all proposals analysed in the post are genuinely Group AI proposals. What I am trying to do is describe a feature shared by all Group AI proposals: that the successful implementation of a Group AI would be extremely bad for individuals. (Since all my readers are individuals, I don’t have to show that the proposals are objectively bad. A given proposal might be very good for some specific Group entity. But all my readers are individuals. So the fact that the AI in question might be good for some arbitrarily defined abstract G entity is not relevant).
  I think that it is often possible to differentiate between a proposal based on a CEVH mapping on the one hand, and a misleadingly described proposal based on a CEVI mapping on the other hand. In other words: I think that it is often possible to separate a genuine Group AI proposal, from a proposed AI that effectively does what one individual wants that AI to do (but that is described in such a way that readers might get the mistaken impression that the AI will do what Humanity wants it to do). Let’s say that Dave is proposing to build an AI: Dave’s proposed AI (DAI), that will implement the Coherent Extrapolated Volition of something. And let’s say that we know that Dave is either proposing a genuine Group AI, or else Dave is using a misleading description of an AI that is effectively doing whatever one person wants that AI to do. One way to see what is actually being proposed, is to examine how Dave responds to a certain type of criticism of DAI.
  Let’s say that Steve claims that a successfully implemented DAI would be bad for individuals. If Dave is at any point responding to such criticism by implying that this type of criticism is somehow self defeating. Or if Dave ever implies that if the resulting AI is bad for individuals, then this means that it is by definition not successfully implemented. Or if Dave ever implies that if a successfully implemented DAI is bad for individuals, then there must be some sort of definitional issue involved. Then Dave is presumably not thinking about anything along the lines of a CEVI project. Instead, Dave is presumably thinking about a genuine Group AI project. As explained in the post: if Dave defends a Group AI proposal in this way, then Dave is making a mistake. But if Dave defends a CEVI AI proposal in this way, then Dave is making a far more obvious mistake. So if we assume some basic levels of honesty and competence. Then this type of defence of DAI indicates that Dave is proposing an AI that is genuinely meant to do what a group wants that AI to do (in some sense).
  A person proposing a CEVI project would presumably never claim that an AI that is bad for individuals, is by definition not successfully implemented (since this would be very obvious nonsense as a response to criticism of a CEVI project). If Steve says that a given AI is a bad idea, then a person proposing a CEVI project would never imply that this is somehow self-defeating or wrong per definition. It is presumably obvious to most people that a successfully implemented CEVI project can be catastrophically bad for individuals (in a way that is not due to any specific detail or definition).
  In other words: It is presumably obvious to most people that doing what one individual wants can be bad for other individuals. But it seems to be more difficult to see that doing what an arbitrarily defined abstract Group entity wants, can also be bad for individuals. With this in mind, let’s say that we know that DAI will get its goal entirely from some set of humans. If Dave says that a successfully implemented DAI would per definition be good for individuals, then Dave is presumably not proposing an AI that effectively does what one person would want (because defending such an AI in this way would be obvious nonsense. Because it would be equivalent to claiming that doing what one individual wants can never be bad for other individuals).
  In yet other words: consider the case where Dave is making claims along the lines of: if a successfully implemented DAI is bad for individuals, then this implies that DAI is built on top of bad definitions. This behaviour means that Dave is probably not proposing an AI that effectively does what one individual wants the AI to do. Dave’s behaviour is however consistent with Dave proposing a Group AI (and being confused in a less obvious way). It seems very unlikely that someone would claim that doing what one individual wants is per definition good for other individuals. Therefore, anyone that is in fact proposing an AI that would effectively do whatever one person wants, would presumably not make the type of arguments that Dave is making above (assuming a basic level of honesty and competence).
  Let’s now look at how someone that is actually proposing a CEVI project might respond to this type of criticism. Mike is proposing to build Mike’s proposed AI (MAI), that will effectively do whatever one individual wants MAI to do. Let’s now say that Steve claims that MAI would be bad for individuals. Mike would presumably never respond by implying that such an outcome is somehow contradictory. Let’s look closer at the claim: “if a successfully implemented MAI hurts Steve, then MAI must be built on top of bad definitions”. This is not just nonsense. It is very obvious nonsense. Saying that if a successfully implemented MAI is catastrophically bad for individuals, then MAI must be built on top of some bad definitions, would be obvious nonsense. Mike would for example presumably never say anything along the lines of: “If you, Steve Smith, can see that MAI would be bad for individuals, then surely a clever AI will see this problem as well”. Mike would never say this, because that statement would be obvious nonsense (assuming a basic level of honesty and competence). It would be equivalent to claiming that doing what one individual wants cannot by definition be bad for other individuals.
  My focus is not on people like Mike or proposals along the lines of MAI. My focus is on genuine Group AI proposals. When Dave is claiming that his Group AI proposal cannot by definition be bad for individuals, then Dave is also confused (as I explain at length in the post). But what Dave is saying is evidently not obvious nonsense. Which is why I thought that it made sense to explain exactly why Dave’s statements indicate confusion. Labels being used in problematic ways can make it more difficult to see this type of free-floating-G-entity confusion.
  Let’s look at an analogy where the label Gavagai is being used in an inconsistent way to talk about cells and individuals (and where this usage in turn makes it more difficult to notice a type of confusion, that is analogous to the free-floating-G-entity confusion discussed in the post). There is nothing strange about an AI that does what Gregg wants, and in the process destroys every single one of Gregg’s cells (for example because Gregg wants to be uploaded, and would prefer that his cells are not left alive after uploading). This is unsurprising, because there is no reason to be surprised when doing what one type of thing wants (in this case Gregg), is bad for a completely different type of thing (in this case Gregg’s cells). This behaviour is not caused by any specific detail in the proposal to build an AI that does what Gregg wants that AI to do (in other words: the behaviour does not imply any form of implementation or definitional issue).
  Claiming that a clever enough AI (that is designed to do what Gregg wants that AI to do), would surely be able to figure out that killing cells is a bad outcome is obvious nonsense. It seems very unlikely that anyone would say anything along the lines of: if you, Steve Smith, can see that the outcome would be bad for cells, then surely a clever AI would be able to see this too. It would be obvious nonsense to claim that if such an AI kills cells, then it is by definition not successfully implemented. If an AI that is meant to do what Gregg wants, ends up killing cells, then there is no reason at all to think that something has gone wrong. Because there is no reason at all to be surprised when doing what one type of thing wants, is bad for a completely different type of thing.
  If Bob actually does claim that doing what Gregg wants cannot by definition be bad for Gregg’s cells. Then it would presumably be obvious that Bob is confused. If Allan however uses the label Gavagai to sometimes refer to Gregg, and at other times to refer to Gregg’s cells. Then it might be more difficult to notice that Allan is confused. There are many scenarios where a given statement would be reasonable for both meanings (for example: seatbelts are important to prevent bad things from happening to Gavagai). Let’s now say that Steve claims that an AI that does what Gregg wants this AI to do, will kill every one of Greggs’ cells (if that AI is successfully implemented). Steve further claims that this AI will behave like this, for reasons that are not related to any specific detail in the definitions (in other words: Steve claims that this behaviour is not due to any definitional issues). Now let’s say that Allan claims that any AI that behaves like this, would by definition not be successfully implemented. Allan justifies this with an argument that basically boils down to some claim along the lines of: doing what Gavagai wants cannot by definition be bad for Gavagai. In this case, Allan would be just as confused as Bob. But the fact that Allan is confused might not be equally obvious. Something similar might happen if a confused person is using the label CEV in a discussion about a specific AI proposal, without noticing that the label is being used to refer to different things at different times.
  In yet other words: If Dave is at any point responding to criticism of his proposed AI project by implying that criticism is somehow self defeating. Or responds by saying that if his proposed AI is bad for individuals, then this means that the AI is by definition not successfully implemented. Or responds by saying that if individuals are hurt by a successfully implemented AI, then this must be due to some form of definitional problem. Then Dave is not thinking about anything along the lines of a CEVI project. Dave is not proposing an AI that effectively does what one individual wants that AI to do. Such responses would however be expected if Dave is confused, and is proposing an AI project that is genuinely meant to lead to the implementation of a Group AI.
  This confusion about abstract Group entities that I keep talking about is making it difficult to discuss a certain class of AI proposals. But the confusion is not the actual problem that I am trying to deal with. The actual problem that I am trying to deal with, is that the proposals in question imply very bad outcomes (for individuals. In expectation. If successfully implemented. For reasons that are not related to any specific implementation detail or definitional choice). This should be an entirely unsurprising discovery. For the same reason that it is entirely unsurprising to discover that an AI that does what Gregg wants this AI to do, would kill every single one of Gregg’s cells (if successfully implemented. For reasons that are not related to any specific implementation detail or definitional choice).
  The fact that these proposals would be bad for individuals, is a problem for my intended audience: individuals. In other words: the proposals are not objectively bad. A given proposal can be very good for whatever arbitrarily defined abstract Group entity is implied by the chosen set of definitions. But I am addressing human individuals. So I don’t try to argue that proposals are objectively bad. I only need to show that they are bad for human individuals. The confusion is problematic because it is making it difficult to discuss the real issue: the existence of AI proposals that imply massively worse than extinction outcomes. As I have argued elsewhere, there exists many plausible paths along which a Sovereign AI proposal might end up getting successfully implemented, before much progress has been made on understanding Sovereign AI proposals. This is the topic of the next section.
  Regarding your comment about what type of alignment target is most likely to be pursued first:
  My claim is that it is important to start analysing Sovereign AI proposals now. A reasonable sounding proposal might lead to a massively worse than extinction outcome (if successfully implemented). Noticing this is not guaranteed (as illustrated by the fact that PCEV was a fairly famous proposal for a long time, without anyone noticing that PCEV implies such an outcome). The relevant question for evaluating the claim that Sovereign AI proposals need to be analysed now, is whether or not a Sovereign AI proposal might at some point be successfully implemented. The prior implementation of some other AI is not necessarily relevant. Such an earlier AI is only relevant if that AI removes the need to analyse Sovereign AI proposals now. For example because it is assumed that such an AI would buy sufficient time. Or because it is assumed that such an AI would be very useful when analysing Sovereign AI proposals.
  I have argued elsewhere that the successful implementation of a Limited AI (for example an instruction following AI) might not actually result in sufficient time (for example in this post, this post, and the last section of this comment). I have also argued that a successfully implemented Limited AI might not be very helpful when analysing Sovereign AI proposals (for example in my reply to the Seth Herd comment that you mention, and in this post). (One issue is that trying to ask a clever AI to find a good Sovereign AI proposal, is equivalent to trying to ask an AI what goal it should have). Relatedly, this post argues that augmented humans might not be very helpful when analysing Sovereign AI proposals (in brief: even if the augmentation process results in a mind that is better than baseline humans at hitting alignment targets, this does not mean that the resulting mind will be good at analysing alignment targets).
  Analysing Sovereign AI proposals now might end up having no impact at all on the outcome. One unsurprising scenario is that a misaligned AI kills everyone for reasons that are completely unrelated to what the designers were trying to do. Another possibility is that an augmented human (or an AI assistant) will propose some new way of looking at things, that renders all past work by non-augmented humans irrelevant (and that is not building on such past work). But it is not safe to assume that anything along these lines will actually happen (and the position that analysing Sovereign AI proposals now is not needed, is implicitly built on top of such an assumption).
  The basic problem is that it is clearly not currently possible to reliably determine whether or not a reasonable sounding proposal implies a massively worse than extinction outcome (as for example illustrated by the PCEV incident). This seems like a really bad place to be. It is also true that making incremental progress on risk reduction is a tractable research goal (as for example illustrated by the present post). In other words: we might end up with a massively worse than extinction outcome as a result of a successfully implemented AI Sovereign. The probability of this can be reduced by analysing Sovereign AI proposals. (See also this comment for an attempt to clarify the type of risk reduction that ATA is trying to accomplish).
  A summary and an analogy: There exists many plausible paths that ends in a successfully implemented AI Sovereign. Even if the first AI is not an AI Sovereign, we might eventually end up with a successfully implemented AI Sovereign anyway. And we might not have a lot of time to analyse such proposals. And we might not get a lot of help with such analysis. Thus, analysing Sovereign AI proposals now reduces the probability of a massively worse than extinction outcome. This remains true despite the fact that ATA might end up having no impact on the outcome. Let’s take an analogy with Bill who finds himself in a war-zone and decides to look for a good bulletproof vest. The vest will not help if Bill is shot in the head. It is also possible that Bill will get shot in the stomach with a high caliber weapon that no vest can stop. But vests are still very popular amongst people that find themselves in war-zones. Because if you do get shot in the stomach, then it really is very nice to be wearing a vest (and the quality of that vest might make a huge difference).
  Getting shot in the stomach is here analogous to a situation where (i): it becomes possible to successfully implement a Sovereign AI (for example with the help of something along the lines of: an instruction following superintelligent AI, augmented humans, a non-superintelligent AI assistant, etc), (ii): there is a time crunch (for example due to Internal Time Pressure), (iii): the ability to do ATA has not increased dramatically, and (iv): there exists a Sovereign AI proposal that implies an outcome massively worse than extinction (for example along the lines of the outcome implied by PCEV. But due to a problem that is more difficult to notice). It could be that this problem is simply not realistically findable (analogous to the high caliber weapon). But it could also be the case that the problem in question is realistically findable in time, iff ATA progress has advanced to a level that is realistically achievable (analogous to a bullet that is possible to stop with a realistically findable vest). This is what makes me think that the current situation (where there exists exactly zero people in the word dedicated to ATA) is a serious mistake.

Noosphere89 comments on A problem shared by many different alignment targets

Regarding your comment about CEV:

Regarding your comment about what type of alignment target is most likely to be pursued first: