Vladimir_Nesov comments on Managing risks while trying to do good

Vladimir_Nesov 11 Feb 2024 14:27 UTC
4 points
0
You are directing a lot of effort at debating details of particular proxies for an optimization target, pointing out flaws. My point is that strong optimization for any proxy that can be debated in this way is not a good idea, so improving such proxies doesn’t actually help. A sensible process for optimizing something has to involve continually improving formulations of the target as part of the process. It shouldn’t be just given any target that’s already formulated, since if it’s something that would seem to be useful to do, then the process is already fundamentally wrong in what it’s doing, and giving a better target won’t fix it.

The way I see it, CEV-as-formulated is gesturing at the kind of thing an optimization target might look like. It’s in principle some sort of proxy for it, but it’s not an actionable proxy for anything that can’t come up with a better proxy on its own. So improving CEV-as-formulated might make the illustration better, but for anything remotely resembling its current form it’s not a useful step for actually building optimizers.

Variants of CEV all having catastrophic flaws is some sort of argument that there is no optimization target that’s worth optimizing for. Boundaries seem like a promising direction for addressing the group vs. individual issues. Never optimizing for any proxy more strongly than its formulation is correct (and always pursuing improvement over current proxies) responds to there often being hidden flaws in alignment targets that lead to catastrophic outcomes.
- ThomasCederborg 14 Feb 2024 17:08 UTC
  1 point
  0
  Parent
  If your favoured alignment target suffers from a critical flaw, that is inherent in the core concept, then surely it must be useful for for you to discover this. So I assume that you agree that, conditioned on me being right about CEV suffering from such a flaw, you want me to tell you about this flaw. In other words, I think that I have demonstrated, that CEV suffers from a flaw, that is not related to any detail, of any specific version, or any specific description, or any specific proxy, or any specific attempt to describe what CEV is, or anything else along those lines. Instead, this flaw is inherent in the core concept, of building an AI that is describable as ``doing what a Group wants″. The Suffering Reducing AI (SRAI) alignment target is known to suffer from this type of a core flaw. The SRAI flaw is not related to any specific detail, of any specific version, or proxy, or attempt to describe what SRAI is, etc. And the flaw is not connected to any specific definition of ``Suffering″. Instead, the tendency to kill everyone, is inherent in the core concept of SRAI. It must surely be possible for you to update the probability that CEV also suffers from a critical flaw of this type (a flaw inherent in the core concept). SRAI sounds good on the surface, but it it is known to suffer from such a core flaw. Thus, the fact that CEV sounds good on the surface, does not rule out the existence of such a core flaw in CEV.
  I do not think, that it possible to justify making no update, when discovering that the version of CEV, that you linked to, implies an outcome that would be far, far worse that extinction. I think that the probability must go up, that CEV contains a critical flaw, inherent in the core concept. Outcomes massively worse than extinction, is not an inherent feature, of any conceivable detailed description, of any conceivable alignment target. To take a trivial example, such an outcome is not implied by any given specific description of SRAI. The only way that you can motivate not updating, is if you already take the position, that any conceivable AI, that is describable as ``implementing the Coherent Extrapolated Volition of Humanity″, will lead to an outcome that is far, far, worse than extinction. If this is your position, then you can justify not updating. But I do not think that this is your position (if this were your position, then I don’t think that CEV would be your favoured alignment target).
  And this is not filtered evidence, where I constructed a version of CEV and then showed problems in that version. It is the version that you link to, that would be far, far, worse than extinction. So, from your perspective, this is not filtered. Other designs that I have mentioned elsewhere, like USCEV, or the ``non stochastic version of PCEV″, are versions that other people have viewed as reasonable attempts to describe what CEV is. The fact that you would like AI projects to implement safety measures, that would (if they work as intended) protect against these types of dangers, is great. I strongly support that. I would not be particularly surprised if a technical insight in this type of work turns out to be completely critical. But this does not allow you to justify not updating on unfiltered data. You simply can not block off all conceivable paths, leading to a situation, where you conclude that CEV suffers from the same type of core flaw, that SRAI is known to suffer from.
  If one were to accept the line of argument, that all information of this type can be safely dismissed, then this would have very strange consequences. If Steve is running a SRAI project, then he could use this line of argument, to dismiss any finding, that a specific version of SRAI, leads to everyone dying. If Steve has a great set of safety measures, but simply does not update, when presented with the information, that a given version of SRAI would kill everyone, then Steve can never reach the point where he says: ``I was wrong. SRAI is not a good alignment target. The issue is not due to any details, of any specific version, or any specific definition or suffering, or anything else along those lines. The issue is inherent in the core concept of building an AI, that is describable as a SRAI. Regardless of how great some set of safety measures looks to the design team, no one should initiate a SRAI project″. Surely, you do not want to accept a line of argument, that would have allowed Steve, to indefinitely avoid making such a statement, in the face of any conceivable new information about the outcomes of different SRAI variants.
  The alternative to debating specific versions, is to make arguments on the level, of what one should expect based on the known properties of a given proposed alignment target. I have tried to do this and I will try again. For example, I wonder how you would answer the question: ``why would an AI, that does what an arbitrarily defined abstract entity wants that AI to do, be good for a human individual?″. One can discover that the Coherent Extrapolated Volition of Steve, would lead to the death of all of Steve’s cells (according to any reasonable set of definitions). One can similarly discover that the Coherent Extrapolated Volition of ``a Group″, is bad for the individuals in that group (according to any reasonable set of definitions). Neither statement suffers from any logical tension. For humans, this should in fact be the expected conclusion for any ``Group AI″, given that, (i): many humans certainly sound as if they will ask the AI to hurt other humans as much as possible, (ii): a human individual is very vulnerable, to a powerful AI that is trying to hurt her as much as possible, and (iii): in a ``Group AI″ no human individual can have any meaningful influence, in the initial dynamic, regarding the adoption of those preferences that refer to her (if the group is large). If you doubt the accuracy of one of these three points, then I would be happy to elaborate, on whichever one you find doubtful. None of this, has any connection, to any specific version, or proxy, or attempt to describe what CEV is, or anything else along those lines. It is all inherent in the core concept of CEV (and any other AI proposal, that is describable as ``building an AI that does what a group wants it to do″). If you want, we can restrict all further discussion to this form of argument.
  If one has already taken the full implications of (i), (ii), and (iii) into account, then one does not have to make a huge additional update, when observing an unfiltered massively-worse-than-extinction type outcome. But this is only because, when one has taken the full implications of (i), (ii), and (iii) into account, then one has presumably already concluded, that CEV suffers from a critical, core, flaw.
  I don’t understand your sentence: ``Variants of CEV all having catastrophic flaws is some sort of argument that there is no optimization target that’s worth optimizing for.″. The statement ``CEV is not a good alignment target″ does not imply the non existence of good alignment targets. Right? In other words: it looks to me like you are saying, that a rejection of CEV as an alignment target, is equivalent to a rejection of all conceivable alignment targets. To me, this sounds like nonsense, so I assume that this is not what you are saying. To take a trivial example: I don’t think that SRAI is a good alignment target. But surely a rejection of CEV does not imply a rejection of SRAI. Right? Just to be clear: I am definitely not postulating the non existence of good alignment targets. Discovering that ``the Coherent Extrapolated Volition of Steve implies the death of all his cells″, does not imply the non existence of alignment targets, where Steve’s cells survive. Similarly, discovering that ``the Coherent Extrapolated Volition of Humanity is bad for human individuals″, does not imply the non existence of alignment targets, that are good for human individuals. (I don’t think that good alignment targets are easy to find, or easy to describe, or easy to evaluate, etc. But that is a different issue)
  I think it’s best that I avoid building a whole argument, based on a guess, regarding what you mean here. But I do want to say, that if you are using ``CEV″ as a shorthand for ``the Coherent Extrapolated Volition of a single designer″, then you have to be explicit about this if you want me to understand you. And similarly: if ``CEV″ is simply a label, that you assign to any reasonable answer, to the ``what alignment target should be aimed at?″ question (provisional or otherwise), then you have to be explicit about this if you want me to understand you. If that is the case then I would have to phrase my claim as: ``Under no reasonable set of definitions does the Coherent Extrapolated Volition of Humanity deserve the label ``CEV‴’. This only sounds odd due to the chosen label. There is no more logical tension in that statement, than there is logical tension in the statement: ``Under no reasonable set of definitions, does the Coherent Extrapolated Volition of Steve, result in any of Steve’s cells surviving″ (discovering this about Steve should not be very surprising. And discovering this about Steve does not imply the non existence of alignment targets where Steve’s cells survive).
  
  PS:
  I am aware of the fact that you (and Yudkowsky, and Bostrom, and a bunch of other people), can not be reasonably described as having any form of reckless attitude along the lines of: ``Conditioned on knowing how to hit alignment targets, the thing to do is to just instantly hit some alignment target that sounds good″. I hope that it is obvious, that I am aware of this. But I wanted to be explicit about this, just in case it is not obvious to everyone, that I am aware of this. Given the fact that there is one of those green leaf thingies next to my username, it is probably best to be explicit about this sort of thing.