ThomasCederborg comments on Managing risks while trying to do good

ThomasCederborg Feb 2, 2024, 12:53 AM
1 point
0
I’m not sure that I agree with this. I think it mostly depends on what you mean by: ``something like CEV″. All versions of CEV are describable as ``doing what a Group wants″. It is inherent in the core concept of building an AI, that is ``Implementing the Coherent Extrapolated Volition of Humanity″. This rules out proposals, where each individual, is given meaningful influence, regarding the adoption, of those preferences, that refer to her. For example as in MPCEV (described in the post that I linked to above). I don’t see how an AI can be safe, for individuals, without such influence. Would you say that MPCEV counts as ``something like CEV″?
If so, then I would say that it is possible, that ``something like CEV″, might be a good, long term solution. But I don’t see how one can be certain about this. How certain are you, that this is in fact a good idea, for a long term solution?
Also, how certain are you, that the full plan that you describe (including short term solutions, etc), is actually a good idea?
- Vladimir_Nesov Feb 2, 2024, 1:24 AM
  2 points
  0
  Parent
  The issue with proxies for an objective is that they are similar to it. So an attempt to approximately describe the objective (such as an attempt to say what CEV is) can easily arrive at a proxy that has glaring goodharting issues. Corrigibility is one way of articulating a process that fixes this, optimization shouldn’t outpace accuracy of the proxy, which could be improving over time.
  
  Volition of humanity doesn’t obviously put the values of the group before values of each individual, as we might put boundaries between individuals and between smaller groups of individuals, with each individual or smaller group having greater influence and applying their values more strongly within their own boundaries. There is then no strong optimization from values of the group, compared to optimization from values of individuals. This is a simplistic sketch of how this could work in a much more elaborate form (where the boundaries of influence are more metaphorical), but it grounds this issue in more familiar ideas like private property, homes, or countries.
  - ThomasCederborg Feb 2, 2024, 10:17 PM
    1 point
    0
    Parent
    I think that my other comment to this, will hopefully be sufficient, to outline what my position actually is. But perhaps a more constructive way forwards, would be to ask how certain you are, that CEV is in fact, the right thing to aim at? That is, how certain are you, that this situation is not symmetrical, to the case where Bob thinks that: ``a Suffering Reducing AI (SRAI), is the objectively correct thing to aim at″? Bob will diagnose any problem, with any specific SRAI proposal, as arising from proxy issues, related to the fact that Bob is not able to perfectly define ``Suffering″, and must always rely on a proxy (those proxy issues exists. But they are not the most serious issue, with Bob’s SRAI project).
    I don’t think that we should let Bob proceed with an AI project, that aims to find the correct description of ``what SRAI is″, even if he is being very careful, and is trying to implement a safety measure (that will, while it continues to work as intended, prevent SRAI from killing everyone). Because those safety features might fail, regardless of whether or not someone has pointed out a critical flaw in them, before the project reaches the point of no return (this conclusion is not related to Corrigibility. I would reach the exact same conclusion, if Bob’s SRAI project, was using any other safety measure). For the exact same reason, I simply do not think, that it is a good idea, to proceed with your proposed CEV project (as I understand that project). I think that doing so, would represent a very serious s-risk. At best, it will fail in a safe way, for predictable reasons. How confident are you, that I am completely wrong about this?
    Finally, I should note, that I still don’t understand your terminology. And I don’t think that I will, until you specify what you mean with ``something like CEV″. My current comments, are responding to my best guess, of what you mean (which is, that MPCEV, from my linked to post, would not count as ``something like CEV″, in your terminology). (Does an Orca count as: ``something like a shark″? If it is very important, that some water tank is free of fish, then it is difficult for me to discuss Dave’s ``let’s put something like a shark, in that water tank″ project, until I have an answer to my Orca question.)
    (I assume that this is obvious, but just to be completely sure that this is clear, it probably makes sense to note explicitly that I, very much, appreciate that you are engaging on this topic)
    - Vladimir_Nesov Feb 3, 2024, 1:09 AM
      2 points
      0
      Parent
      Metaphorically, there is a question CEV tries to answer, and by “something like CEV” I meant any provisional answer to the appropriate question (so that CEV-as-currently-stated is an example of such an answer). Formulating an actionable answer is not a project humans would be ready to work on directly any time soon. So CEV is something to aim at by intention that defines CEV. If it’s not something to aim at, then it’s not a properly constructed CEV.
      
      This lack of a concrete formulation is the reason goodharting and corrigibility seem salient in operationalizing the process of formulating it and making use of the formulation-so-far. Any provisional formulation of an alignment target (such as CEV-as-currently-stated) would be a proxy, and so any optimization according to such proxy should be wary of goodharting and be corrigible to further refinement.
      
      The point of discussion of boundaries was in response to possible intuition that expected utility maximization tends to make its demands with great uniformity, with everything optimized in the same direction. Instead, a single goal may ask for different things to happen in different places, or to different people. It’s a more reasonable illustration of goal aggregation than utilitarianism that sums over measures of value from different people or things.
      - ThomasCederborg Feb 9, 2024, 6:21 PM
        1 point
        0
        Parent
        The version of CEV, that is described on the page that your CEV link leads to, is PCEV. The acronym PCEV was introduced by me. So this acronym does not appear on that page. But that’s PCEV that you link to. (in other words: the proposed design, that would lead to the LP outcome, can not be dismissed as some obscure version of CEV. It is the version that your own CEV link leads to. I am aware of the fact, that you are viewing PCEV as: ``a proxy for something else″ / ``a provisional attempt to describe what CEV is″. But this fact still seemed noteworthy)
        On terminology: If you are in fact using ``CEV″ as a shorthand, for ``an AI that implements the CEV of a single human designer″, then I think that you should be explicit about this. After thinking about this, I have decided that without explicit confirmation that this is in fact your intended usage, I will proceed as if you are using CEV as a shorthand, for ``an AI that implements the Coherent Extrapolated Volition of Humanity″ (but I would be perfectly happy to switch terminology, if I get such confirmation). (another reading of your text, is that: ``CEV″ (or: ``something like CEV″) is simply a label that you attach, to any good answer, to the correct phrasing of the ``what alignment target should be aimed at?″ question. That might actually be a sort of useful shorthand. In that case I would, somewhat oddly, have to phrase my claim as: under no reasonable set of definitions, does the Coherent Extrapolated Volition of Humanity, deserve the label ``CEV″ / ``something like CEV″. Due to the chosen label(s), the statement looks odd. But there is no more logical tension in the above statement, than there is logical tension in the following statement: ``under no reasonable set of definitions, does the Coherent Extrapolated Volition of Steve, result in the survival of any of Steve’s cells″ (which is presumably a true statement for at least some human individuals). Until I hear otherwise, I will however stay with the terminology, where ``CEV″ is shorthand for ``an AI that implements the Coherent Extrapolated Volition of Humanity″, or ``an AI that is helping humanity″, or something less precise, that is still hinting at something along those lines)
        It probably makes sense to clarify my own terminology some more. I think this can be done by noting, that I think that CEV, sounds like a perfectly reasonable way of helping ``a Group″ (including the PCEV version that you link to, and that implies the LP outcome). I just don’t think that helping ``a Group″ (that is made up of human individuals) is good for the (human) individuals that make up that ``Group″ (in expectation). Pointing a specific version of CEV (including PCEV) at a set of individuals, might be great for some other type of individuals. Let’s consider a large number of ``insatiable, Clippy like maximisers″. Each of them cares exclusively about the creation of a different, specific, complex object. No instances of any of these very complex objects will ever exist, unless someone looks at the exact specification of a given individual, and uses this specification to create such objects. In this case PCEV might, from the perspective of each of those individuals, be the best thing that can happen (if special influence is off the table). It is also worth noting, that a given human individual might get what she wants, if some specific version of CEV is implemented. But CEV, or ``helping humanity″, is not good, for human individuals, in exception, compared to extinction. And why would it be? Groups and human individuals are completely different types of things. And a human individual is very vulnerable to a powerful AI, that wants to hurt her. And humanity certainly looks like it contains an awful lot of ``will to hurt″, specifically directed at existing human individuals.
        If I zoom out a bit, I would describe the project of ``trying to describe what CEV is″ / ``trying to build an AI that helps humanity″ as: A project that searches for an AI design that helps an arbitrarily defined abstract entity. But this same project is, in practice, evaluating specific proposed AI designs, based on how they interact with a completely different type of thing: human individuals. You are for example presumably discarding PCEV, because the LP outcome implied by PCEV, contains a lot of suffering individuals (when PCEV is pointed at billions of humans). It is however not obvious to me why LP would be a bad way of helping an arbitrarily defined abstract entity (especially considering that the negotiation rules of PCEV simultaneously (i): implies LP, and is also (ii): an important part of the set of definitions, that is needed to differentiate the specific abstract entity that is to be helped, from the rest of the vast space of entities, that a mapping from billions-of-humans to the ``class-of-entities-that-can-be-said-to-want-things″, can point to). Thus, I suspect that PCEV is not actually being discarded, due to being bad at helping an abstract entity (and my guess it that PCEV is actually being discarded, because LP is bad for human individuals).
        I think that one reasonable way of moving past this situation, is to switch perspective. Specifically: adopt the perspective of a single human individual, in a population of billions, and ask: ``without giving her any special treatment, compared to other existing humans, what type of AI, would want to help her″. And then try to answer this question, while making as few assumptions about her as possible (for example making sure that there is no implicit assumption, regarding whether she is ``selfish or selfless″, or anything along those lines. Both ``selfless and selfish″ human individuals, would strongly prefer to avoid being a Heretic in LP. Thus, discarding PCEV does not contain an implicit assumption related to the ``selfish or selfless″ issue. Discarding PCEV, does however, involve an assumption, that human individuals are not like the ``insatiable Clippy maximisers″ mentioned above. So, such maximisers might justifiably feel ignored, when we discard PCEV. But no one can justifiably feel ignored when we discard PCEV, on account of where she is on the ``selfish or selfless″ spectrum). When one adopts this perspective, it becomes obvious to suggest that, the initial dynamic, should grant this individual meaningful influence, regarding the adoption of those preferences, that refer to her. Making sure that such influence, is included as a core aspect of the initial dynamic, is made even more important, by the fact, that the designers will be unable to consider all implications of a given project, and will be forced to rely on, potentially flawed, safety measures (for example along the lines of a ``Last Judge″ off switch, which might fail to trigger. Combined with a learned DWIKIM layer, that might turn out to be very literal, when interpreting some specific class of statements). If such influence is included, in the initial dynamic, then the resulting AI is no longer describable as ``doing what a Group wants it to do″. Thus, the resulting AI can not be described as a version of CEV. (it might however be describable as ``something like CEV″. Sort of how one can describe an Orca as ``something like a shark″, despite the fact that an Orca is not a type of shark (or a type of a fish). I would guess, that you would say, that an AI that grants such influence, as part of the initial dynamic, is not ``something like CEV″. But I’m not sure about this)
        (I should have added ``,in the initial dynamic,″ to the text in my earlier comments. It is explicit in the description of MPCEV, but I should have added this phrase to my comments here too. As a tangent, I agree that the intuition, that you were trying to counter, with your Boundaries / Membrane mention, is probably both common and importantly wrong. Countering this intuition makes sense, and I should have read this part of your comment more carefully. I would however like to note, that the description of the LP outcome, in the PCEV thought experiment, actually contains billions of (presumably very different) localities. Each locality is optimised according to very different criteria. Each place is designed to hurt a specific individual human Heretic. And each such location, is additionally bound by it’s own unique ``comprehension constraint″, that refers to the specific individual Heretic, being punished in that specific location)
        Perhaps a more straightforward way to move this discussion along is to ask a direct question, regarding what you would do if you were in the position, that I believe, that I find myself in. In other words: a well intentioned designer called John, wants to use PCEV as the alignment target for his project (rejecting any other version of CEV out of hand, by saying: ``if that is indeed a good idea, then it will be the outcome of Parliamentary Negotiations″). When someone points out that PCEV is a bad alignment target, John responds by saying that PCEV cannot, by definition, be a bad alignment target. John claims that any thought experiment, where PCEV leads to a bad outcome, must be due to a bad extrapolation of human individuals. John says that any given ``PCEV with a specific extrapolation procedure″ is just an attempt, to describe what PCEV is. If aiming at a given ``PCEV with a specific extrapolation procedure″ is a bad idea, then it is a badly constructed PCEV. Aiming at PCEV is a good idea, by intention that defines PCEV. John further says that his project will include features that (if they are implemented successfully, and are not built on top of any problematic unexamined implicit assumption) will to let John try again, if a given attempt to ``say what PCEV is″, fails. Do you agree that this project, is a bad idea? (compared to achievable alternatives, that start with a different set of, findable, assumptions) If so, what would you say to John? (what you are proposing is different from what John is proposing. I predict that you will say that John is making a mistake. My point is that, to me, it looks like you are making a mistake, of the same type as John’s mistake. So, I wonder what you would say to John (your behaviour in this exchange, is not the same as John’s behaviour in this thought experiment. But it looks to me, like you are making the same class of mistake, as John. So, I’m not asking how you would ``act in a debate, as a response to Johns behaviour″. Instead, I’m curious about how you would explain to John, that he is making an object level mistake))
        Or maybe a better approach, is to go less meta, and get into some technical details. So, let’s use the terminology in your CEV link, to explore some of the technical details in that post. What do you think would happen, if the learning algorithm that outputs the DWIKIM layer in John’s PCEV project, is built on top of an unexamined implicit assumption, that turns out to be wrong? Let’s say that the DWIKIM layer that pops out, interprets the request to build PCEV, as a request to implement the straightforward interpretation of PCEV. The DWIKIM layer happens to be very literal, when presented with the specific phrasing, used in the request. In other words: it interprets John as requesting, something along the lines of LP? I think this might result in an outcome, along the lines of LP (if the problems with the DWIKIM layer, stems form a problematic unexamined implicit assumption, related to extrapolation, then the exact same problematic assumption, might also render something along the lines of a ``Last Judge off switch add on″, ineffective). I think that it would be better, if John had aimed at something, that does not suffer from known, avoidable, s-risks. Something whose straightforward interpretation, is not known to imply an outcome, that would be far, far, worse than extinction. For the same reason, I make the further claim, that I do not think that it is a good idea, to subject everyone to the known, avoidable, s-risks associated with any AI, that is describable as ``doing what a Group wants″ (which includes all versions of CEV). Again, I’m certainly not against some feature that, might, let you try again, or that, might, re interpret an unsafe request, as a request for something completely different, that happens to be safe (such as, for example, a learned DWIKIM layer). I am aware of the fact, that you do not have absolute faith in the DWIKIM layer (if this layer is perfectly safe, in the sense of reliably re interpreting requests that straightforwardly imply LP, as something desirable to the designer. Then the full architecture would be functionally identical, to an AI, that simply does, whatever the designer wants the AI to do. In that case, you would not care what the request was. You might then, just as well have the designer ask the DWIKIM layer, for an AI, that maximises the number of bilberries. So, I am definitely not implying, that you are unaware, of the fact that the DWIKIM layer, is unable to provide reliable safety).
        Zooming out a bit, it is worth noting that the details of the safety measure(s) is actually not very relevant to the points that I am trying to make here. Any conceivable, human implemented, safety measure, might fail. And, more importantly, these measures do not help much, when one is deciding what to aim at. For example: MPCEV, can also be built on top of a (potentially flawed) DWIKIM layer, in the exacts same way as you can build CEV on top of a DWIKIM layer (and you can stick a ``Last Judge off switch add on″ to MPCEV too. Etc, etc, etc). Or in yet other words: anything, along the lines of, a ``Last Judge off switch add on″ can be used by many different projects aiming at many different targets. Thus, the ``Last Judge″ idea, or any other idea along those lines (including a DWIKIM layer), provides very limited help, when one decides what to aim at. And even more generally: regardless of what safety measure is used, John is, still, subjecting everyone to an unnecessary, avoidable, s-risk. I hope we can agree that John should not do that with, any, version of ``PCEV with a specific extrapolation procedure″. The further claim, that I am making, is that no one should do that with, any, ``Group AI″, for similar reasons. Surely, discovering that this further claim is true, cannot be, by definition, impossible.
        While re reading our exchange, I realised that I never actually clarified, that my primary reason for participating in this exchange (and my primary reason for publishing things on LW), is not actually to stop CEV projects. However, I think that a reasonable person might, based on my comments here, come to believe that my primary goal is to stop CEV projects (which is why the present clarification is needed). My focus is actually on trying to make progress on the ``what alignment target should be aimed at?″ question. In the present exchange, my target is the idea, that this question has already been given an answer (and, specifically, that the answer is CEV). The first step to progress, on the ``what alignment target should be aimed at?″ question, is to show that this question does not currently have an answer. This is importantly different, from saying that: ``CEV is the answer, but the details are unknown″ (I think that such a statement is importantly wrong. And I also think, that the fact that people still believe things along these lines, is standing in the way of getting a project off the ground, that is devoted to making progress on the ``what alignment target should be aimed at?″ question).
        I think that it is very unlikely, that the relevant people will stay committed to CEV, until the technology arrives, that would make it possible for them to hit CEV as an alignment target (the reason I find this unlikely, is that, (i): I believe that I have outlined a sufficient argument, to show that CEV is a bad idea, and (ii): I think that such technology will take time to arrive, and (iii): it seems likely that this team of designers, who are by assumption capable of hitting CEV, will be both careful enough to read that argument before reaching the point of no return on their CEV launch, and also capable enough to understand it. Thus, since the argument against CEV already exists, in my estimate, it would not make sense to focus on s-risks, related to a successfully implemented CEV). If that unlikely day ever does arrive, then I might switch focus, to trying to prevent direct CEV related s-risk, by arguing against this imminent CEV project. But I don’t expect to ever see this happening.
        The set of paths that I am actually focused on reducing the probability of, can be hinted at by outlining the following specific scenario. Imagine a well intentioned designer that we can call Dave, who is aiming for Currently Unknown Alignment Target X (CUATX). Due to an unexamined implicit assumption, that CUATX is built on top of, turning out to be wrong in a critical way, CUATX implies an outcome, along the lines of LP. But the issue that CUATX suffers from, is far more subtle than the issue that CEV suffers from. And progress on the ``what alignment target should be aimed at?″ question, has not yet progressed to the point, where this problematic unexamined implicit assumption can be seen. CUATX has all the features, that are known at launch time, to be necessary for safety (such as the necessary, but very much not sufficient, feature that any safe AI must give each individual, meaningful influence, regarding the adoption of those preferences, that refer to her). Thus, the CUATX idea leads to a CUATX project, which in turn leads to an, avoidable, outcome along the lines of LP (after some set of human implemented safety measures fail). That is the type of scenario that I am trying to avoid (by trying to make sufficient progress on the ``what alignment target should be aimed at?″ question, in time). My real ``opponent in this debate″ is an implemented CUATX, not the idea of CEV (and very definitely not you. Or anyone else that has contributed, or is likely to contribute, valuable insights related to the ``what alignment target should be aimed at?″ question). It just happens to be the case, that the effort to prevent CUATX, that I am trying to get off the ground, starts by showing that CEV, is not an answer, to the ``what alignment target should be aimed at?″ question. And you just happen to be the only person, that is pushing back against this in public (and again: I really appreciate the fact that you chose to engage on this topic).
        (I should also note explicitly, that I am most definitely not against exploring safety measures. They might stop CUATX. In some plausible scenarios, they might be the only realistic thing, that can stop CUATX. And I am not against treaties. And I am open to hearing more about the various human augmentation proposals that have been going around for many years. I am simply noting, that a safety measure, regardless of how clever it sounds, simply cannot fill the function of a substitute, for progress on the ``what alignment target should be aimed at?″ question. An attempt to get people to agree to a treaty might fail. Or a successfully implemented treaty might fail to actually prevent a race dynamic for long enough. And similarly, augmented humans might systematically tend towards being: (i): superior at alignment, (ii): superior at persuasion, (iii): well intentioned, and (iv): not better at dealing with the ``what alignment target should be aimed at?″ question, than the best baseline humans (but still, presumably, capable of understanding an insight on this question, at least if that insight is well explained). Regardless of augmentation technique, selection for ``technical ability and persuasion ability″ seems like a far more likely, de facto, outcome to me, due to being far easier to measure. I expect it to be far more difficult to measure the ability to deal with the ``what alignment target should be aimed at?″ question (and it is not obvious that the abilities needed to deal with the ``what alignment target should be aimed at?″ question, will be strongly correlated with the thing that I think will, de facto, have driven the trial and error augmentation process, of the augments that eventually hits an alignment target: ``technical-ability-and-persuasion-ability-and-ability-to-get-things-done″). Maybe the first augment will be great at making progress on the ``what alignment target should be aimed at?″ question, and will quickly render all previous work on this question irrelevant (and in that case, the persuasion ability is probably good for safety). But assuming that this will happen, seems like a very unsafe bet to make. Even more generally: I simply do not think that it is possible to come up with any type of clever sounding trick, that makes it safe to skip the ``what alignment target should be aimed at?″ question (to me, the ``revolution-analogy-argument″, in the 2004 CEV text, looks like a sufficient argument for the conclusion, that it is important to make progress on the ``what alignment target should be aimed at?″ question. But it seems like many people do not consider this, to be a sufficient argument for this conclusion. It is unclear to me, why this conclusion, seems to require such extensive further argument)).
        If my overall strategic goal was not clear, then this was probably my fault (in addition to not making this goal explicit, I also seem to have a tendency to lose focus on this larger strategic picture, during back and fourth technical exchanges).
        Two out of of my three LW posts are in fact entirely devoted to arguing, that making progress on the ``what alignment target should be aimed at?″ question, is urgent (in our present discussion, we have only talked about the one post, that is not exclusively focused on this). See:
        Making progress on the ``what alignment target should be aimed at?″ question, is urgent
        The proposal to add a ``Last Judge″ to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?″ question.
        (I am still very confused about this entire conversation. But I don’t think that re reading everything, yet again, will help much. I have been continually paying, at least some, attention to SL4, OB, and LW since around 2002-2003. I can’t remember exactly who said what when, or where. However, I have developed a strong intuition, that can be very roughly translated as: ``if something sounds strange, then it is very definitely not safe, to explain away this strangeness, by conveniently assuming that Nesov is confused on the object-level″. I am nowhere near the point where I would consider going against this intuition. So, I expect that I will remain very confused about this exchange, until there is some more information available. I don’t expect to be able to just think my way out of this one (wild speculation, regarding what it might be, that I was missing, by anyone that happens to stumble on this comment, at any point in the future, are very welcome. For example in a LW comment, or in a LW DM, or in an email))
        Vladimir_Nesov Feb 11, 2024, 2:27 PM
        4 points
        0
        Parent
        You are directing a lot of effort at debating details of particular proxies for an optimization target, pointing out flaws. My point is that strong optimization for any proxy that can be debated in this way is not a good idea, so improving such proxies doesn’t actually help. A sensible process for optimizing something has to involve continually improving formulations of the target as part of the process. It shouldn’t be just given any target that’s already formulated, since if it’s something that would seem to be useful to do, then the process is already fundamentally wrong in what it’s doing, and giving a better target won’t fix it.
        
        The way I see it, CEV-as-formulated is gesturing at the kind of thing an optimization target might look like. It’s in principle some sort of proxy for it, but it’s not an actionable proxy for anything that can’t come up with a better proxy on its own. So improving CEV-as-formulated might make the illustration better, but for anything remotely resembling its current form it’s not a useful step for actually building optimizers.
        
        Variants of CEV all having catastrophic flaws is some sort of argument that there is no optimization target that’s worth optimizing for. Boundaries seem like a promising direction for addressing the group vs. individual issues. Never optimizing for any proxy more strongly than its formulation is correct (and always pursuing improvement over current proxies) responds to there often being hidden flaws in alignment targets that lead to catastrophic outcomes.
        ThomasCederborg Feb 14, 2024, 5:08 PM
        1 point
        0
        Parent
        If your favoured alignment target suffers from a critical flaw, that is inherent in the core concept, then surely it must be useful for for you to discover this. So I assume that you agree that, conditioned on me being right about CEV suffering from such a flaw, you want me to tell you about this flaw. In other words, I think that I have demonstrated, that CEV suffers from a flaw, that is not related to any detail, of any specific version, or any specific description, or any specific proxy, or any specific attempt to describe what CEV is, or anything else along those lines. Instead, this flaw is inherent in the core concept, of building an AI that is describable as ``doing what a Group wants″. The Suffering Reducing AI (SRAI) alignment target is known to suffer from this type of a core flaw. The SRAI flaw is not related to any specific detail, of any specific version, or proxy, or attempt to describe what SRAI is, etc. And the flaw is not connected to any specific definition of ``Suffering″. Instead, the tendency to kill everyone, is inherent in the core concept of SRAI. It must surely be possible for you to update the probability that CEV also suffers from a critical flaw of this type (a flaw inherent in the core concept). SRAI sounds good on the surface, but it it is known to suffer from such a core flaw. Thus, the fact that CEV sounds good on the surface, does not rule out the existence of such a core flaw in CEV.
        I do not think, that it possible to justify making no update, when discovering that the version of CEV, that you linked to, implies an outcome that would be far, far worse that extinction. I think that the probability must go up, that CEV contains a critical flaw, inherent in the core concept. Outcomes massively worse than extinction, is not an inherent feature, of any conceivable detailed description, of any conceivable alignment target. To take a trivial example, such an outcome is not implied by any given specific description of SRAI. The only way that you can motivate not updating, is if you already take the position, that any conceivable AI, that is describable as ``implementing the Coherent Extrapolated Volition of Humanity″, will lead to an outcome that is far, far, worse than extinction. If this is your position, then you can justify not updating. But I do not think that this is your position (if this were your position, then I don’t think that CEV would be your favoured alignment target).
        And this is not filtered evidence, where I constructed a version of CEV and then showed problems in that version. It is the version that you link to, that would be far, far, worse than extinction. So, from your perspective, this is not filtered. Other designs that I have mentioned elsewhere, like USCEV, or the ``non stochastic version of PCEV″, are versions that other people have viewed as reasonable attempts to describe what CEV is. The fact that you would like AI projects to implement safety measures, that would (if they work as intended) protect against these types of dangers, is great. I strongly support that. I would not be particularly surprised if a technical insight in this type of work turns out to be completely critical. But this does not allow you to justify not updating on unfiltered data. You simply can not block off all conceivable paths, leading to a situation, where you conclude that CEV suffers from the same type of core flaw, that SRAI is known to suffer from.
        If one were to accept the line of argument, that all information of this type can be safely dismissed, then this would have very strange consequences. If Steve is running a SRAI project, then he could use this line of argument, to dismiss any finding, that a specific version of SRAI, leads to everyone dying. If Steve has a great set of safety measures, but simply does not update, when presented with the information, that a given version of SRAI would kill everyone, then Steve can never reach the point where he says: ``I was wrong. SRAI is not a good alignment target. The issue is not due to any details, of any specific version, or any specific definition or suffering, or anything else along those lines. The issue is inherent in the core concept of building an AI, that is describable as a SRAI. Regardless of how great some set of safety measures looks to the design team, no one should initiate a SRAI project″. Surely, you do not want to accept a line of argument, that would have allowed Steve, to indefinitely avoid making such a statement, in the face of any conceivable new information about the outcomes of different SRAI variants.
        The alternative to debating specific versions, is to make arguments on the level, of what one should expect based on the known properties of a given proposed alignment target. I have tried to do this and I will try again. For example, I wonder how you would answer the question: ``why would an AI, that does what an arbitrarily defined abstract entity wants that AI to do, be good for a human individual?″. One can discover that the Coherent Extrapolated Volition of Steve, would lead to the death of all of Steve’s cells (according to any reasonable set of definitions). One can similarly discover that the Coherent Extrapolated Volition of ``a Group″, is bad for the individuals in that group (according to any reasonable set of definitions). Neither statement suffers from any logical tension. For humans, this should in fact be the expected conclusion for any ``Group AI″, given that, (i): many humans certainly sound as if they will ask the AI to hurt other humans as much as possible, (ii): a human individual is very vulnerable, to a powerful AI that is trying to hurt her as much as possible, and (iii): in a ``Group AI″ no human individual can have any meaningful influence, in the initial dynamic, regarding the adoption of those preferences that refer to her (if the group is large). If you doubt the accuracy of one of these three points, then I would be happy to elaborate, on whichever one you find doubtful. None of this, has any connection, to any specific version, or proxy, or attempt to describe what CEV is, or anything else along those lines. It is all inherent in the core concept of CEV (and any other AI proposal, that is describable as ``building an AI that does what a group wants it to do″). If you want, we can restrict all further discussion to this form of argument.
        If one has already taken the full implications of (i), (ii), and (iii) into account, then one does not have to make a huge additional update, when observing an unfiltered massively-worse-than-extinction type outcome. But this is only because, when one has taken the full implications of (i), (ii), and (iii) into account, then one has presumably already concluded, that CEV suffers from a critical, core, flaw.
        I don’t understand your sentence: ``Variants of CEV all having catastrophic flaws is some sort of argument that there is no optimization target that’s worth optimizing for.″. The statement ``CEV is not a good alignment target″ does not imply the non existence of good alignment targets. Right? In other words: it looks to me like you are saying, that a rejection of CEV as an alignment target, is equivalent to a rejection of all conceivable alignment targets. To me, this sounds like nonsense, so I assume that this is not what you are saying. To take a trivial example: I don’t think that SRAI is a good alignment target. But surely a rejection of CEV does not imply a rejection of SRAI. Right? Just to be clear: I am definitely not postulating the non existence of good alignment targets. Discovering that ``the Coherent Extrapolated Volition of Steve implies the death of all his cells″, does not imply the non existence of alignment targets, where Steve’s cells survive. Similarly, discovering that ``the Coherent Extrapolated Volition of Humanity is bad for human individuals″, does not imply the non existence of alignment targets, that are good for human individuals. (I don’t think that good alignment targets are easy to find, or easy to describe, or easy to evaluate, etc. But that is a different issue)
        I think it’s best that I avoid building a whole argument, based on a guess, regarding what you mean here. But I do want to say, that if you are using ``CEV″ as a shorthand for ``the Coherent Extrapolated Volition of a single designer″, then you have to be explicit about this if you want me to understand you. And similarly: if ``CEV″ is simply a label, that you assign to any reasonable answer, to the ``what alignment target should be aimed at?″ question (provisional or otherwise), then you have to be explicit about this if you want me to understand you. If that is the case then I would have to phrase my claim as: ``Under no reasonable set of definitions does the Coherent Extrapolated Volition of Humanity deserve the label ``CEV‴’. This only sounds odd due to the chosen label. There is no more logical tension in that statement, than there is logical tension in the statement: ``Under no reasonable set of definitions, does the Coherent Extrapolated Volition of Steve, result in any of Steve’s cells surviving″ (discovering this about Steve should not be very surprising. And discovering this about Steve does not imply the non existence of alignment targets where Steve’s cells survive).
        
        PS:
        I am aware of the fact that you (and Yudkowsky, and Bostrom, and a bunch of other people), can not be reasonably described as having any form of reckless attitude along the lines of: ``Conditioned on knowing how to hit alignment targets, the thing to do is to just instantly hit some alignment target that sounds good″. I hope that it is obvious, that I am aware of this. But I wanted to be explicit about this, just in case it is not obvious to everyone, that I am aware of this. Given the fact that there is one of those green leaf thingies next to my username, it is probably best to be explicit about this sort of thing.
      - ThomasCederborg Feb 3, 2024, 4:32 AM
        1 point
        0
        Parent
        I think that ``CEV″ is usually used as shorthand for ``an AI that implements the CEV of Humanity″. This is what I am referring to, when I say ``CEV″. So, what I mean when I say that ``CEV is a bad alignment target″, is that, for any reasonable set of definitions, it is a bad idea, to build an AI, that does what ``a Group″ wants it to do (in expectation, from the perspective of essentially any human individual, compared to extinction). Since groups and individuals, are completely different types of things, it should not be surprising to learn, that doing what one type of thing wants (such as ``a Group″), is bad for a completely different type of thing (such as a human individual). In other words, I think that ``an AI that implements the CEV of Humanity″, is a bad alignment target, in the same sense, as I think that SRAI is a bad alignment target.
        But I don’t think your comment uses ``CEV″ in this sense. I assume that we can agree, that aiming for ``the CEV of a chimp″, can be discovered to be a bad idea (for example by referring to facts about chimps, and using thought experiments, to see what these facts about chimps, implies about likely outcomes). Similarly, it must be possible to discover, that aiming for ``the CEV of Humanity″, is also a bad idea (for human individuals). Surely, discovering this, cannot be, by definition, impossible. Thus, I think that you are in fact, not, using ``CEV″ as shorthand for ``an AI that implements the CEV of Humanity″. (I am referring to your sentence: ``If it’s not something to aim at, then it’s not a properly constructed CEV.″)
        Your comment makes perfect sense, if I read ``CEV″ as shorthand for ``an AI that implements the CEV of a single human designer″. I was not expecting this terminology. But it is a perfectly reasonable terminology, and I am happy to make my argument, using this terminology. If we are using this terminology, then I think that you are completely right, about the problem that I am trying to describe, being a proxy issue (thus, if this is was indeed your intended meaning, then I was completely wrong, when I said that I was not referring to a proxy issue. In this terminology, it is indeed a proxy issue). So, using this terminology, I would describe my concerns as: ``an AI that implements the CEV of Humanity″ is a predictably bad proxy, for ``an AI that implements the CEV of a single human designer″. Because ``an AI that implements the CEV of Humanity″, is far, far, worse, than extinction, form the perspective of essentially any human individual (which, presumably, disqualifies it as a proxy, for ``an AI that implements the CEV of a single human designer″. If this does not disqualify it as a proxy, then I think that this particular human designer, is a very dangerous person (from the perspective of essentially any human individual)). Using this terminology (and assuming a non unhinged designer), I would say that if your proposed project, is to use ``an AI that implements the CEV of Humanity″, as a proxy, for ``an AI that implements the CEV of a single human designer″, then this constitutes a, predictable, proxy failure. Further, I would say that pushing ahead, despite this predictable failure, with a project that is trying to implement ``an AI that implements the CEV of Humanity″ (as a proxy), inflicts an unnecessary s-risk, on everyone. Thus, I think it would be a bad idea, to pursue such a project (from the perspective of essentially any human individual. Presumably including the designer).
        If we take the case of Bob, and his Suffering Reducing AI (SRAI) project (and everyone has agreed to use this terminology), then we can tell Bob:
        SRAI is not a good proxy, for ``an AI that implements the CEV of Bob″ (assuming that you, Bob, do not want to kill everyone). Thus, you will run into a, predictable, issue, when your project tries to use SRAI as a proxy, for ``an AI that implements the CEV of Bob″. If you are implementing a safety measure successfully, then this will still, at best, lead to your project failing safely. At worst, your safety measure will fail, and SRAI will kill everyone. So, please don’t proceed with your project, given that it would put everyone at risk of being killed by SRAI (and this would be an unnecessary risk, because your project will predictably fail, due to a predictable proxy issue).
        By making sufficient progress, on the ``what alignment target should be aimed at?″ question, before Bob gets started on his SRAI project, it is possible to avoid the unnecessary extinction risks, associated with the proxy failure, that Bob will predictably run into, if his project uses SRAI, as a proxy for ``an AI that implements the CEV of Bob″. Similarly, it is possible to avoid the unnecessary s-risks, associated with the proxy failure, that Dave will predictably run into, if Dave uses ``an AI that implements the CEV of Humanity″, as a proxy, for ``an AI that implements the CEV of Dave″ (because any ``Group AI″, is very bad for human individuals (including Dave)).
        Mitigating the unnecessary extinction risks, that are inherent in any SRAI project, does not require an answer, to the ``what alignment target should be aimed at?″ question (it was a long time ago, but if I remember correctly, Yudkowsky did this, around two decades ago. It seems likely, that anyone that is careful and capable enough, to hit an alignment target, will be able to understand that old explanation, of why SRAI, is a bad alignment target. So, generating such an explanation, was sufficient for mitigating the extinction risks, associated with a successfully implemented SRAI. Generating such an explanation, did not require an answer, to the ``what alignment target should be aimed at?″ question. One can demonstrate that a given bad answer, is a bad answer, without having any good answer). Similarly, avoiding the unnecessary s-risks, that are inherent in any ``Group AI″ project, does not require an answer, to the ``what alignment target should be aimed at?″ question. (I strongly agree, that finding an actual answer to this question, is probably very, very, difficult. I am simply pointing out, that even partial progress, on this question, can be very useful)
        (I think that there are other issues, related to AI projects, whose purpose is to aim at ``the CEV, of a single human designer″. I will not get into this here, but I thought that it made sense, to at least mention, that there are other issues, related to this type of project)
        the gears to ascension Feb 3, 2024, 5:58 AM
        2 points
        0
        Parent
        
        Since groups and individuals, are completely different types of things,
        
        I don’t think this is obviously justifiable. It seems to me that cells work together to be a person, together tracking and implementing the agency of the aggregate system according to their interest as part of that combined entity, and in the same way, people work together to be a group, together tracking and implementing the agency of the group. I’m pretty sure that if you try to calculate my CEV with me in a box, you end up with an error like “import error: the rest of the reachable social graph of friendships and caring”. I cannot know what I want without deliberating with others who I intend to be in a society with long term, because I will know that whatever answer I give for my CEV, it will be highly probably misaligned with the rest of the people I care about. And I expect that the network of mutual utility across humanity is fairly well connected such that if I import friends, it ends up being a recursive import that requires evaluation of everyone on earth.
        
        (By the way, any chance you could use fewer commas? The reading speed I can reach on your comments are reduced by them due to having to bump up to deliberate thinking to check whatever I’ve joined sentence fragments the way you meant. No worries if not, though.)
        ThomasCederborg Feb 3, 2024, 3:56 PM
        3 points
        0
        Parent
        I think that extrapolation is a genuinely unintuitive concept. I would for example not be very surprised if it turns out that you are right, and that it is impossible to reasonably extrapolate you if the AI that is doing the extrapolation is cut off from all information about other humans. I don’t think that this fact is in tension with my statement, that individuals and groups are completely different types of things. Taking your cell analogy: I think that implementing the CEV of you could lead to the death of every single cell in your body (for example if your mind is uploaded in a way that does not preserve information about any individual cell). I don’t think that it is strange in general, if an extrapolated version of a human individual, is completely fine with the complete annihilation of every cell in her body (and this is true, despite the fact that ``hostility towards cells″ is not a common thing). Such an outcome is no indication of any technical failure, in an AI project, that was aiming for the CEV of that individual. This shows why there is no particular reason to think, that doing what a human individual wants, would be good for any of her cells (for any reasonable definition of ``doing what a human individual wants″). And this fact remains true, even if it is also the case, that a given cell would become impossible to understand, if that cell was isolated from other cells.
        A related tangent here relates to the fact that extrapolation is a genuinely unintuitive concept. I think that this has important implications for AI safety. This fact is for example central to my argument about ``Last Judge″ type proposals in my post:
        The proposal to add a ``Last Judge″ to an AI, does not remove the urgency, of making progress on the ``what alignment target should be aimed at?″ question.
        (I will try to reduce the commas. I see what you are talking about. I have in the past been forced to do something about an overuse of both footnotes and parentheses. Reading badly written academic history books seems to be making things worse (if one is analysing AI proposals where the AI is getting its goal from humans, then it makes sense to me to at least try to understand humans))
        the gears to ascension Feb 3, 2024, 5:57 PM
        3 points
        0
        Parent
        
        I think that implementing the CEV of you could lead to the death of every single cell in your body (for example if your mind is uploaded in a way that does not preserve information about any individual cell)
        
        I will take this bet at any amount. My cells are a beautiful work of art crafted by evolution, and I am a guest in their awesome society. Any future where my cells’ information is lost rather than transmuted and the original stored is unacceptable to me. Switching to another computational substrate without deep translation of the information in my cells is effectively guaranteed to need to examine the information in a significant fraction of my cells at a deep level, such that a generative model can be constructed which has significantly higher accuracy at cell information reconstruction than any generative model of today would. I suspect I am only unusual in having thought through this enough to identify this value, and that it is common in somewhat-less-transhumanist circles, usually manifesting as a resistance to augmentation rather than a desire to augment in a way that maintains a biology-like substrate.
        
        Now, to be clear, I do want to rewrite my cells at a deep level—a sort of highly advanced dynamics-faithful “style transfer” into some much more advanced substrate, in particular one that operates smoothly between temperatures 2 kelvin and ~310 kelvin or ideally much higher (though if it turns out that a long adaptation period is needed to switch between ultra low temp and ultra high temp, that’s fine, I expect that the chemicals that operate smoothly at the respective temperatures will look rather different). I also expect to not want to be stuck with using carbon; I don’t currently understand chemistry enough to confidently tell you any of the things I’m asking for in this paragraph are definitely possible, but my hunch is that there are other atoms which form stronger bonds and have smaller fields that could be used instead, ie classic precise nanotech sorts of stuff. probably takes a lot of energy to construct them, if they’re possible.
        
        But again, no uplift without being able to map the behaviors of my cells in high fidelity.
        Seth Herd Feb 3, 2024, 7:44 PM
        2 points
        0
        Parent
        Interesting. I haven’t heard this perspective. Can you say a little more about why you want to preserve the precise information in your cells? Is it solely about their impact on your mind’s function? What level of approximation would you be okay with?
        
        I’d be fine with having my mind simulated with a low-res body simulation, as long as that body felt more-or-less right and supported a range of moods and emotions similar to the ones I have now—but I’d be fine with a range of moods being not quite the same as the ones caused by the intricacies of my current body.
        ThomasCederborg Feb 3, 2024, 8:23 PM
        1 point
        0
        Parent
        I was clearly wrong regarding how you feel about your cells. But surely the question of whether or not an AI that is implementing the CEV of Steve, would result in any surviving cells, is an empirical question? (which must settled by referring to facts about Steve. And trying to figure out what these facts mean in terms of how the CEV of Steve would treat his cells). It cannot possibly be the case that it is impossible, by definition, to discover that any reasonable way of extrapolating Steve would result in all his cells dying?
        Thank you for engaging on this. Reading your description of how you view your own cells was a very informative window, into how a human mind can work. (I find it entirely possible, that I am very wrong regarding how most people view their cells. Or about how they would view their cells upon reflection. I will probably not try to introspect, regarding how I feel about my own cells, while this exchange is still fresh)
        Zooming out a bit, and looking at this entire conversation, I notice that I am very confused. I will try to take a step back from LW and gain some perspective, before I return to this debate.
  - ThomasCederborg Feb 2, 2024, 4:19 AM
    1 point
    0
    Parent
    It is getting late here, so I will stop after this comment, and look at this again tomorrow (I’m in Germany). Please treat the comment below as not fully thought through.
    The problem from my perspective, is that I don’t think that the objective, that you are trying to approximate, is a good objective (in other words, I am not referring to problems, related to optimising a proxy. They also exist, but they are not the focus of my current comments). I don’t think that it is a good idea, to do what an abstract entity, called ``humanity″, wants (and I think that this is true, from the perspective of essentially any human individual). I think that it would be rational, for essentially any human individual, to strongly oppose the launch of any such ``Group AI″. Human individuals, and groups, are completely different types of things. So, I don’t think that it should be the surprising, to learn that doing what a group wants, is bad for the individuals, in that group. This is a separate issue, from problems related to optimising for a proxy.
    I give one example, of how things can go wrong, in the post:
    A problem with the most recently published version of CEV
    This is of course just one specific example, and it is meant as an introduction, to the dangers, involved in building an AI, that is describable as ``doing what a group wants″. Showing that a specific version of CEV, would lead to an outcome, that is far, far, worse than extinction, does not, on its own, prove that all versions of CEV are dangerous. I do however think that all versions of CEV, are, very, very, dangerous. And I do think, that this specific thought experiment, can be used to hint at a more general problem. I also hope, that this thought experiment will at least be sufficient, for convincing most readers that there, might, exist a deeper problem, with the core concept. In other words, I hope that it will be sufficient, to convince most readers that you, might, be going after the wrong objective, when you are analysing different attempts ``to say what CEV is″.
    While I’m not actually talking about implementation, perhaps it would be more productive, to approach this from the implementation angle. How certain are you, that the concept of Boundaries / Membranes, provides reliable safety, for individuals, from a larger group, that contains the type of fanatics, described in the linked post? Let’s say that it turns out, that they do not, in fact, reliably provide such safety, for individuals. How certain are you then, that the first implemented system, that relies on Boundaries / Membranes, to protect individuals from such groups, will in fact result, in you being able to try again? I don’t think that you can possibly know this, with any degree of certainty. (I’m certainly not against safety measures. If anyone attempts to do what you are describing, then I certainly hope that this attempt will involve safety measures) (I also have nothing against the idea of Boundaries / Membranes)
    An alternative (or parallel) path, to trial and error, is to try to make progress on the ``what alignment target should be aimed at?″ question. Consider what you would say to Bob, who wants to build a Suffering Reducing AI (SRAI). He is very uncertain of his definition of ``Suffering″, and he is implementing safety systems. He knows that any formal definition of ``Suffering″ that he can come up with, will be a proxy, for the actually, correct, definition of Suffering. If it can be shown, that some specific implementation of SRAI, would lead to a bad outcome (such as an AI, that decides to kill everyone), then Bob will respond that the definition of Suffering, must be wrong (and that he has prepared safety systems, that will let him try to find a better definition of ``Suffering″).
    This might certainly end well. Bob’s safety systems might continue to work, until Bob realises, that the core idea, of building any AI, that is describable as a SRAI, will always lead to an AI, that simply kills everyone (in other words: until he realises, that he is going after the wrong objective). But I would say, that a better alternative, is to make enough progress, on the ``what alignment target should be aimed at?″ question, that it is possible to explain to Bob, that he is, in fact, going after the wrong objective (and is not, in fact, dealing with proxy issues). (in the case of SRAI, such progress has off course been around for a while. I think I remember reading an explanation of the ``SRAI issue″, written by Yudkowsky, decades ago. So, to deal with people like Bob, there is no actual need, for us, to make additional progress. But for people in a world where SRAI, is the state of the art, in terms of answering the ``what alignment target should be aimed at?″ question, I would advice them to focus on making further progress, on this question)
    Alternatively, I could ask what you would say to Bob, if he thinks that ``reducing Suffering″, is ``the objectively correct thing to do″, and is convinced, that any implementation that leads to bad outcomes (such as an AI, that kills everyone), must be a proxy issue? I think that, just as any reasonable definition of ``Suffering″, implies a SRAI, that kills everyone, any reasonable set of definitions of ``a Group″, implies a Group AI, that is bad for human individuals (in expectation, when that Group AI is pointed at billions of humans, from the perspective of essentially any human individual, in the set of humans, that the Group AI is pointed at, compared to extinction). In other words, a Group AI is bad for human individuals in expectation, in the same sense as a SRAI kills everyone. I’m definitely not saying that this is true for ``minds in general″. If Dave is able to reliably see all implications of any AI proposal (or if Dave is invulnerable to a powerful AI that is trying to hurt Dave. Or if the minds that the Group AI will be pointed at, are known to be ``friendly towards Dave″ in some formal sense, that is fully understood by Dave), then this might not be true for Dave. But I claim that it is true for human individuals.