habryka comments on Alignment: “Do what I would have wanted you to do”

habryka 13 Jul 2024 18:14 UTC
2 points
0
I think you misunderstood what I meant by “collapse to nothingness”. I wasn’t referring to you collapsing into nothingness under CEV. I meant your logical argument outputting a contradiction (where the contradiction would be that you prefer to have no preferences right now).

The thing I am saying is that I am pretty confident you don’t have meta preferences that when propagated will cause you to stop wanting things, because like, I think it’s just really obvious to both of us that wanting things is good. So in as much as that is a preference, you’ll take it into account in a reasonable CEV set up.

We clearly both agree that there are ways to scale you up that are better or worse by your values. CEV is the process of doing our best to choose the better ways. We probably won’t find the very best way, but there are clearly ways through reflection space that are better than others and that we endorse more going down.

You might stop earlier than I do, or might end up in a different place, but that doesn’t change the validity of the process that much, and clearly doesn’t result in you suddenly having no wants or preferences anymore (because why would you want that, and if you are worried about that, you can just make a hard commit at the beginning to never change in ways that causes that).

And yeah, maybe some reflection process will cause us to realize that actually everything is meaningless in a way that I would genuinely endorse. That seems fine but it isn’t something I need to weigh from my current vantage point. If it’s true, nothing I do matters anyways, but also it honestly seems very unlikely because I just have a lot of things I care about and I don’t see any good arguments that would cause me to stop caring about them.
- sunwillrise 13 Jul 2024 18:23 UTC
  7 points
  4
  Parent
  clearly doesn’t result in you suddenly having no wants or preferences anymore
  Of course, but I do expect the process of extrapolation to result in wants and preferences that are extremely varied and almost arbitrary, depending on the particular choice of scale-up and the path taken through it, and which have little to do with my current (incoherent, not-reflectively-endorsed values).
  Moreover, while I expect extrapolation to output desires and values at any point we hit the “stop” button, I do not expect (and this is another reframing of precisely the argument that I have been making the entire time) those values to be coherent themselves. You can scale me up as much as you want, but when you stop and consider the being that emerges, my perspective is that this being can also be scaled up now to have an almost arbitrary set of new values.
  I think your most recent comment fails to disambiguate between “the output of the extrapolation process”, which I agree will be nonempty (similarly to how my current set of values is nonempty), and “the coherent output of the extrapolation process”, which I think might very well be empty, and in any case will mostly likely be very small in size (or measure) compared to the first one.
  - habryka 13 Jul 2024 18:33 UTC
    4 points
    2
    Parent
    I think your most recent comment fails to disambiguate between “the output of the extrapolation process”, which I agree will be nonempty (similarly to how my current set of values is nonempty), and “the coherent output of the extrapolation process”, which I think might very well be empty, and in any case will mostly likely be very small in size (or measure) compared to the first one.
    Hmm, this paragraph maybe points to some linguistic disagreement we have (and my guess is causing confusion in other cases).
    I feel like you are treating “coherent” as a binary, when I am treating it more as a sliding scale. Like, I think various embedded agency issues prevent an agent from being fully coherent (for example, a classical bayesian formulation of coherence requires logical omniscience and is computationally impossible), but it’s also clearly the case that when I notice a dutch-book against me, and I update in a way that avoids future dutch-bookings, that I have in a meaningful way (but not in a way I could formalize) become more coherent.
    So what I am saying is something like “CEV will overall increase the degree of coherence of your values”. I totally agree that it will not get you all the way (whatever that means), and I also think we don’t have a formalization of coherence that we can talk about fully formally (though I think we have some formal tools that are useful).
    This gives me some probability we don’t disagree that much, but my sense is you are throwing out the baby with the bathwater in your response to Roger, and that that points to a real disagreement.
    Like, yes, I think for many overdetermined reasons we will not get something that looks like a utility function out of CEV (because computable utility functions over world-histories aren’t even a thing that can meaningfully exist in the real world). But it seems to me like something like “The Great Reflection” would be extremely valuable and should absolutely be the kind of thing we aim for with an AI, since I sure have updated a lot on what I reflectively endorse, and would like to go down further that path by learning more true things and getting smarter in ways that don’t break me.
    - sunwillrise 13 Jul 2024 19:29 UTC
      5 points
      5
      Parent
      Hmm, this paragraph maybe points to some linguistic disagreement we have (and my guess is causing confusion in other cases).
      I feel like you are treating “coherent” as a binary, when I am treating it more as a sliding scale.
      Alright, so, on the one hand, this is definitely helpful for this conversation because it has allowed me to much better understand what you’re saying. On the other hand, the sliding scale of coherence is much more confusing to me than the prior operationalizations of this concept were. I understand, at a mathematical level, what Eliezer (and you) mean by coherence, when viewed as binary. I don’t think the same is true when we have a sliding scale instead. This isn’t your fault, mind you; these are probably supposed to be confusing topics given our current, mostly-inadequate, state of understanding of them, but the ultimate effect is still what it is.
      I expect we would still have a great deal of leftover disagreement about where on that scale CEV would take a human when we start extrapolating. I’m also somewhat confident that no effective way of resolving that disagreement is available to us currently.
      This gives me some probability we don’t disagree that much, but my sense is you are throwing out the baby with the bathwater in your response to Roger, and that that points to a real disagreement.
      Well, yes, I suppose. The standard response to critiques of CEV, from the moment they started appearing, is some version of Stuart Armstrong’s “we don’t need a full solution, just one good enough,” and there probably is some disagreement over what is “enough” and over how well CEV is likely to work.
      But there is also another side of this, which I mentioned at the very end of my initial comment that sparked this entire discussion, namely my expectation that overly optimistic ^[1] thinking about the “theoretical soundness and practical viability” of CEV “causes confusions and incorrect thinking more than it dissolves questions and creates knowledge.” I think this is another point of genuine disagreement, which is downstream of not just the object-level questions discussed here but also stuff like the viability of Realism about rationality, overall broad perspectives about rationality and agentic cognition, and other related things. These are broader philosophical issues, and I highly doubt much headway can be made to resolve this dispute through a mere series of comments.
      But it seems to me like something like “The Great Reflection” would be extremely valuable
      Incidentally, I do agree with this to some extent:
      I do not have answers to the very large set of questions I have asked and referenced in this comment. Far more worryingly, I have no real idea of how to even go about answering them or what framework to use or what paradigm to think through. Unfortunately, getting all this right seems very important if we want to get to a great future. Based on my reading of the general pessimism you [Wei Dai] have been signaling throughout your recent posts and comments, it doesn’t seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us.
      Perhaps if a group of really smart philosophy-inclined people who have internalized the lessons of the Sequences without being wedded to the very specific set of conclusions MIRI has reached about what AGI cognition must be like and which seem to be contradicted by the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models were to be given a ton of funding and access and 10 years to work on this problem as part of a proto-Long Reflection, something interesting would come out. But that is quite a long stretch at this point.
      ^
      From my perspective, of course.
      - dxu 13 Jul 2024 23:03 UTC
        4 points
        0
        Parent
        Can we not speak of apparent coherence relative to a particular standpoint? If a given system seems to be behaving in such a way that you personally can’t see a way to construct for it a Dutch book, a series of interactions with it such that energy/negentropy/resources can be extracted from it and accrue to you, that makes the system inexploitable with respect to you, and therefore at least as coherent as you are. The closer to maximal coherence a given system is, the less it will visibly depart from the appearance of coherent behavior, and hence utility function maximization; the fact that various quibbles can be made about various coherence theorems does not seem to me to negate this conclusion.
        
        Humans are more coherent than mice, and there are activities and processes which individual humans occasionally undergo in order to emerge more coherent than they did going in; in some sense this is the way it has to be, in any universe where (1) the initial conditions don’t start out giving you fully coherent embodied agents, and (2) physics requires continuity of physical processes, so that fully formed coherent embodied agents can’t spring into existence where there previously were none; there must be some pathway from incoherent, inanimate matter from which energy may be freely extracted, to highly organized configurations of matter from which energy may be extracted only with great difficulty, if it can be extracted at all.
        
        If you expect the endpoint of that process to not fully accord with the von Neumann-Morgenstein axioms, because somebody once challenged the completeness axiom, independence axiom, continuity axiom, etc., the question still remains as to whether departures from those axioms will give rise to exploitable holes in the behavior of such systems, from the perspective of much weaker agents such as ourselves. And if the answer is “no”, then it seems to me the search for ways to make a weaker, less coherent agent into a stronger, more coherent agent is well-motivated, and necessary—an appeal to consequences in a certain sense, yes, but one that I endorse!