sunwillrise comments on Alignment: “Do what I would have wanted you to do”

sunwillrise Jul 13, 2024, 5:47 PM
5 points
2
Honestly, this doesn’t seem like a good-faith response that does the slightest bit of interpretive labor. It’s the type of “gotcha” comment that I really wouldn’t have expected from you of all people. I’m not sure it’s even worthwhile to continue this conversation.
I have preferences right now; this statement makes sense in the type of low-specificity conversation dominated by intuition where we talk about such words as though they referred to real concepts that point to specific areas of reality. Those preferences are probably not coherent, in the case that I can probably be money pumped by an intelligent enough agent that sets up a strange-to-my-current-self scenario. But they still exist, and one of them is to maintain a sufficient amount of money in my bank account to continue living a relatively high-quality life. Whether I “endorse” those preferences or not is entirely irrelevant to whether I have them right now; perhaps you could offer a rational argument to eventually convince me that you would make much better use of all my money, and then I would endorse giving you that money, but I don’t care about any of that right now. My current, unreflectively-endorsed self, doesn’t want to part with what’s in my bank account, and that’s what guiding my actions, not an idealized, reified future version.
None of this means anything conclusive about me ultimately endorsing these preferences in the reflective limit, of those preferences being stable under ontology shifts that reveal how my current ontology is hopelessly confused and reifies the analogues of ghosts, of there being any nonzero intersection between the end states of a process that tries to find my individual volition, of changes to my physical and neurological make-up keeping my identity the same (in a decision-relevant sense relative to my values) when my memories and path through history change.
What links here?
- sunwillrise's comment on What is it to solve the alignment problem? by Joe Carlsmith (Aug 25, 2024, 10:00 AM; 8 points)
- sunwillrise's comment on Is the mind a program? by EuanMcLean (Dec 4, 2024, 5:20 PM; 4 points)
- habryka Jul 13, 2024, 5:56 PM
  2 points
  0
  Parent
  It’s not a gotcha, I just really genuinely don’t get how the model you are explaining doesn’t just collapse into nothingness.
  
  Like, you currently clearly think that some of your preferences are more stable under reflection. And you have guesses and preferences over the type of reflection that makes your preferences better by your own lights. So seems like you want to apply one to the other. Doing that intellectual labor is the core of CEV.
  
  If you really have no meta level preferences (though I have no idea what that would mean since it’s part of everyday life to balance and decide between conflicting desires) then CEV outputs something at least as coherent as you are right now, which is plenty coherent given that you probably acquire resources and have goals. My guess is you can do a bunch better. But I don’t see any way for CEV to collapse into nothingness. It seems like it has to output something at least as coherent as you are now.
  
  So when you say “there is no coherence” that just seems blatantly contradicted by you standing before me and having coherent preferences, and not wanting to collapse into a puddle of incoherence.
  - sunwillrise Jul 13, 2024, 6:06 PM
    3 points
    2
    Parent
    I just really genuinely don’t get how the model you are explaining doesn’t just collapse into nothingness.
    I mean, I literally quoted the Wei Dai post in my previous comment which made the point that it could be possible for the process of extrapolated volition to output “nothingness” as its answer. I don’t necessarily think that’s very likely, but it is a logically possible alternative to CEV.
    In any case, suppose you are right about this, about the fact that the model I am explaining collapses into nothingness. So what? “X, if true, leads to bad consequences, so X is false” is precisely the type of appeal-to-consequences reasoning that’s not a valid line of logic.
    This conversation, from my perspective, has gone like this:
    Me: [specific positive, factual claim about CEV not making conceptual sense, and about the fact that there might be no substantial part of human morality that’s coherent after applying the processes discussed here]
    You: If your model is right, then you just collapse into nothingness.
    Me: … So then I collapse into nothingness, right? The fact that it would be bad to do so according to my current, rough, unreflectively-endorsed “values” obviously doesn’t causally affect the truth value of the proposition about what CEV would output, which is a positive rather than normative question.
    - habryka Jul 13, 2024, 6:14 PM
      2 points
      0
      Parent
      I think you misunderstood what I meant by “collapse to nothingness”. I wasn’t referring to you collapsing into nothingness under CEV. I meant your logical argument outputting a contradiction (where the contradiction would be that you prefer to have no preferences right now).
      
      The thing I am saying is that I am pretty confident you don’t have meta preferences that when propagated will cause you to stop wanting things, because like, I think it’s just really obvious to both of us that wanting things is good. So in as much as that is a preference, you’ll take it into account in a reasonable CEV set up.
      
      We clearly both agree that there are ways to scale you up that are better or worse by your values. CEV is the process of doing our best to choose the better ways. We probably won’t find the very best way, but there are clearly ways through reflection space that are better than others and that we endorse more going down.
      
      You might stop earlier than I do, or might end up in a different place, but that doesn’t change the validity of the process that much, and clearly doesn’t result in you suddenly having no wants or preferences anymore (because why would you want that, and if you are worried about that, you can just make a hard commit at the beginning to never change in ways that causes that).
      
      And yeah, maybe some reflection process will cause us to realize that actually everything is meaningless in a way that I would genuinely endorse. That seems fine but it isn’t something I need to weigh from my current vantage point. If it’s true, nothing I do matters anyways, but also it honestly seems very unlikely because I just have a lot of things I care about and I don’t see any good arguments that would cause me to stop caring about them.
      - sunwillrise Jul 13, 2024, 6:23 PM
        4 points
        1
        Parent
        clearly doesn’t result in you suddenly having no wants or preferences anymore
        Of course, but I do expect the process of extrapolation to result in wants and preferences that are extremely varied and almost arbitrary, depending on the particular choice of scale-up and the path taken through it, and which have little to do with my current (incoherent, not-reflectively-endorsed values).
        Moreover, while I expect extrapolation to output desires and values at any point we hit the “stop” button, I do not expect (and this is another reframing of precisely the argument that I have been making the entire time) those values to be coherent themselves. You can scale me up as much as you want, but when you stop and consider the being that emerges, my perspective is that this being can also be scaled up now to have an almost arbitrary set of new values.
        I think your most recent comment fails to disambiguate between “the output of the extrapolation process”, which I agree will be nonempty (similarly to how my current set of values is nonempty), and “the coherent output of the extrapolation process”, which I think might very well be empty, and in any case will mostly likely be very small in size (or measure) compared to the first one.
        habryka Jul 13, 2024, 6:33 PM
        4 points
        2
        Parent
        I think your most recent comment fails to disambiguate between “the output of the extrapolation process”, which I agree will be nonempty (similarly to how my current set of values is nonempty), and “the coherent output of the extrapolation process”, which I think might very well be empty, and in any case will mostly likely be very small in size (or measure) compared to the first one.
        Hmm, this paragraph maybe points to some linguistic disagreement we have (and my guess is causing confusion in other cases).
        I feel like you are treating “coherent” as a binary, when I am treating it more as a sliding scale. Like, I think various embedded agency issues prevent an agent from being fully coherent (for example, a classical bayesian formulation of coherence requires logical omniscience and is computationally impossible), but it’s also clearly the case that when I notice a dutch-book against me, and I update in a way that avoids future dutch-bookings, that I have in a meaningful way (but not in a way I could formalize) become more coherent.
        So what I am saying is something like “CEV will overall increase the degree of coherence of your values”. I totally agree that it will not get you all the way (whatever that means), and I also think we don’t have a formalization of coherence that we can talk about fully formally (though I think we have some formal tools that are useful).
        This gives me some probability we don’t disagree that much, but my sense is you are throwing out the baby with the bathwater in your response to Roger, and that that points to a real disagreement.
        Like, yes, I think for many overdetermined reasons we will not get something that looks like a utility function out of CEV (because computable utility functions over world-histories aren’t even a thing that can meaningfully exist in the real world). But it seems to me like something like “The Great Reflection” would be extremely valuable and should absolutely be the kind of thing we aim for with an AI, since I sure have updated a lot on what I reflectively endorse, and would like to go down further that path by learning more true things and getting smarter in ways that don’t break me.
        sunwillrise Jul 13, 2024, 7:29 PM
        2 points
        2
        Parent
        Hmm, this paragraph maybe points to some linguistic disagreement we have (and my guess is causing confusion in other cases).
        I feel like you are treating “coherent” as a binary, when I am treating it more as a sliding scale.
        Alright, so, on the one hand, this is definitely helpful for this conversation because it has allowed me to much better understand what you’re saying. On the other hand, the sliding scale of coherence is much more confusing to me than the prior operationalizations of this concept were. I understand, at a mathematical level, what Eliezer (and you) mean by coherence, when viewed as binary. I don’t think the same is true when we have a sliding scale instead. This isn’t your fault, mind you; these are probably supposed to be confusing topics given our current, mostly-inadequate, state of understanding of them, but the ultimate effect is still what it is.
        I expect we would still have a great deal of leftover disagreement about where on that scale CEV would take a human when we start extrapolating. I’m also somewhat confident that no effective way of resolving that disagreement is available to us currently.
        This gives me some probability we don’t disagree that much, but my sense is you are throwing out the baby with the bathwater in your response to Roger, and that that points to a real disagreement.
        Well, yes, I suppose. The standard response to critiques of CEV, from the moment they started appearing, is some version of Stuart Armstrong’s “we don’t need a full solution, just one good enough,” and there probably is some disagreement over what is “enough” and over how well CEV is likely to work.
        But there is also another side of this, which I mentioned at the very end of my initial comment that sparked this entire discussion, namely my expectation that overly optimistic ^[1] thinking about the “theoretical soundness and practical viability” of CEV “causes confusions and incorrect thinking more than it dissolves questions and creates knowledge.” I think this is another point of genuine disagreement, which is downstream of not just the object-level questions discussed here but also stuff like the viability of Realism about rationality, overall broad perspectives about rationality and agentic cognition, and other related things. These are broader philosophical issues, and I highly doubt much headway can be made to resolve this dispute through a mere series of comments.
        But it seems to me like something like “The Great Reflection” would be extremely valuable
        Incidentally, I do agree with this to some extent:
        I do not have answers to the very large set of questions I have asked and referenced in this comment. Far more worryingly, I have no real idea of how to even go about answering them or what framework to use or what paradigm to think through. Unfortunately, getting all this right seems very important if we want to get to a great future. Based on my reading of the general pessimism you [Wei Dai] have been signaling throughout your recent posts and comments, it doesn’t seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us.
        Perhaps if a group of really smart philosophy-inclined people who have internalized the lessons of the Sequences without being wedded to the very specific set of conclusions MIRI has reached about what AGI cognition must be like and which seem to be contradicted by the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models were to be given a ton of funding and access and 10 years to work on this problem as part of a proto-Long Reflection, something interesting would come out. But that is quite a long stretch at this point.
        ^
        From my perspective, of course.
        dxu Jul 13, 2024, 11:03 PM
        4 points
        0
        Parent
        Can we not speak of apparent coherence relative to a particular standpoint? If a given system seems to be behaving in such a way that you personally can’t see a way to construct for it a Dutch book, a series of interactions with it such that energy/negentropy/resources can be extracted from it and accrue to you, that makes the system inexploitable with respect to you, and therefore at least as coherent as you are. The closer to maximal coherence a given system is, the less it will visibly depart from the appearance of coherent behavior, and hence utility function maximization; the fact that various quibbles can be made about various coherence theorems does not seem to me to negate this conclusion.
        
        Humans are more coherent than mice, and there are activities and processes which individual humans occasionally undergo in order to emerge more coherent than they did going in; in some sense this is the way it has to be, in any universe where (1) the initial conditions don’t start out giving you fully coherent embodied agents, and (2) physics requires continuity of physical processes, so that fully formed coherent embodied agents can’t spring into existence where there previously were none; there must be some pathway from incoherent, inanimate matter from which energy may be freely extracted, to highly organized configurations of matter from which energy may be extracted only with great difficulty, if it can be extracted at all.
        
        If you expect the endpoint of that process to not fully accord with the von Neumann-Morgenstein axioms, because somebody once challenged the completeness axiom, independence axiom, continuity axiom, etc., the question still remains as to whether departures from those axioms will give rise to exploitable holes in the behavior of such systems, from the perspective of much weaker agents such as ourselves. And if the answer is “no”, then it seems to me the search for ways to make a weaker, less coherent agent into a stronger, more coherent agent is well-motivated, and necessary—an appeal to consequences in a certain sense, yes, but one that I endorse!