habryka comments on Alignment: “Do what I would have wanted you to do”

habryka Jul 13, 2024, 4:37 PM
4 points
0
Maybe I am missing some part of this discussion, but I don’t get the last paragraph. It’s clear there are a lot of issues with CEV, but I also have no idea what the alternative to something like CEV as a point of comparison is supposed to be. In as much as I am a godshatter of wants, and I want to think about my preferences, I need to somehow come to a conclusion about how to choose between different features, and the basic shape of CEV feels like the obvious (and approximately only) option that I see in front of me.
I agree there is no “canonical” way to scale me up, but that doesn’t really change the need for some kind of answer to the question of “what kind of future do I want and how good could it be?”.
How does “instruction-following AI” have anything to do with this? Like, OK, now you have an AI that in some sense follows your instructions. What are you going to do with it?
My best guess is you are going to do something CEV like, where you figure out what you want, and you have it help you reflect on your preferences and then somehow empower you to realize more of them. Ideally it would fully internalize that process so it doesn’t need to rely on your slow biological brain and weak body, though of course you want to be very careful with that since changes to values under reflection seem very sensitive to small changes in initial conditions.
It’s also what seems to me relatively broad consensus on LW that you should not aim for CEV as a first thing to do with an AGI. It’s a thing you will do eventually, but aiming for it early does indeed seem doomed, but like, that’s not really what the concept or process is about. It’s about setting a target for what you want to eventually allow AI systems to help you with.
The Arbital article is also very clear about this:
CEV is meant to be the literally optimal or ideal or normative thing to do with an autonomous superintelligence, if you trust your ability to perfectly align a superintelligence on a very complicated target. (See below.)
CEV is rather complicated and meta and hence not intended as something you’d do with the first AI you ever tried to build. CEV might be something that everyone inside a project agreed was an acceptable mutual target for their second AI. (The first AI should probably be a Task AGI.)
What links here?
- sunwillrise's comment on What are the actual arguments in favor of computationalism as a theory of identity? by sunwillrise (Jul 18, 2024, 10:50 PM; 4 points)
- sunwillrise Jul 13, 2024, 4:59 PM
  3 points
  2
  Parent
  It’s clear there are a lot of issues with CEV, but I also have no idea what the alternative to something like CEV as a point of comparison is supposed to be.
  This reads like an invalid appeal-to-consequences argument. The basic point is that “there are no good alternatives to CEV”, even if true, does not provide meaningful evidence one way or another about whether CEV makes sense conceptually and gives correct and useful intuitions about these issues.
  In as much as I am a godshatter of wants, and I want to think about my preferences, I need to somehow come to a conclusion about how to choose between different features
  I mean, one possibility (unfortunate and disappointing as it would be if true) is what Wei Dai described 12 years ago:
  By the way, I think nihilism often gets short changed around here. Given that we do not actually have at hand a solution to ontological crises in general or to the specific crisis that we face, what’s wrong with saying that the solution set may just be null? Given that evolution doesn’t constitute a particularly benevolent and farsighted designer, perhaps we may not be able to do much better than that poor spare-change collecting robot? If Eliezer is worried that actual AIs facing actual ontological crises could do worse than just crash, should we be very sanguine that for humans everything must “add up to moral normality”?
  To expand a bit more on this possibility, many people have an aversion against moral arbitrariness, so we need at a minimum a utility translation scheme that’s principled enough to pass that filter. But our existing world models are a hodgepodge put together by evolution so there may not be any such sufficiently principled scheme, which (if other approaches to solving moral philosophy also don’t pan out) would leave us with legitimate feelings of “existential angst” and nihilism. One could perhaps still argue that any current such feelings are premature, but maybe some people have stronger intuitions than others that these problems are unsolvable?
  So it’s not like CEV is the only logical possibility in front of us, or the only one we have enough evidence to raise to the level of relevant hypothesis. As such, I see this as still being of the appeal-to-consequences form. It might very well be the case that CEV, despite all the challenges and skepticism, nonetheless remains the best or most dignified option to pursue (as a moonshot of sorts), but again, this has no impact on the object-level claims in my earlier comment.
  How does “instruction-following AI” have anything to do with this? Like, OK, now you have an AI that in some sense follows your instructions. What are you going to do with it?
  I think you’re talking at a completely different level of abstraction and focus than me. I made no statements about the normative desirability of instruction-following AI in my comment on Seth’s post. Instead, I simply claimed, as a positive, descriptive, factual matter, that I was confident value-aligned AGI would not come about (and likely could not come about because of what I thought were serious theoretical problems).
  It’s also what seems to me relatively broad consensus on LW that you should not aim for CEV as a first thing to do with an AGI.
  I don’t think any relevant part of my comment is contingent on the timing of when you aim for CEV? Whether it’s the first thing you do with an AGI or not.
- Seth Herd Jul 13, 2024, 9:44 PM
  2 points
  0
  Parent
  I was confused for a moment. You start out by saying there’s no alternative to CEV, then end up by saying there’s a consensus that CEV isn’t a good first alignment target.
  
  Doesn’t that mean that whether or how to pursue CEV it’s not relevant to whether we live or die? It seems like we should focus on the alignment targets we’ll pursue first, and leave CEV and the deeper nature of values and preferences for the Long Reflection—if we can arrange to get one.
  
  I certainly hope you’re right that there’s a de-facto consensus that CEV/value alignment probably isn’t relevant for our first do-or-die shots at alignment. It sure looks that way to me, so I’d like to see more LW brainpower going toward detailed analyses of the alignment schemes on which we’re most likely to bet the future.
  - habryka Jul 13, 2024, 10:10 PM
    2 points
    0
    Parent
    I think it’s still relevant because it creates a rallying point around what to do after you made substantial progress aligning AGI, which helps coordination in the run up to it, but I agree that most effort should go into other approaches.