sunwillrise comments on What is it to solve the alignment problem?

sunwillrise 25 Aug 2024 10:00 UTC
8 points
0
OK, but what is your “intent”? Presumably, it’s that something be done in accordance with your values-on-reflection, right?
No, I don’t think so at all. Pretty much the opposite, actually; if it was in accordance to my values-on-reflection, it would be value-aligned to me rather than intent-aligned. Collapsing the meaning of the latter into the former seems entirely unwise to me. After all, when I talk about my intent, I am explicitly not thinking about any long reflection process that gets at the “core” of my beliefs or anything like that;^[1] I am talking more about something like this:
I have preferences right now; this statement makes sense in the type of low-specificity conversation dominated by intuition where we talk about such words as though they referred to real concepts that point to specific areas of reality. Those preferences are probably not coherent, in the case that I can probably be money pumped by an intelligent enough agent that sets up a strange-to-my-current-self scenario. But they still exist, and one of them is to maintain a sufficient amount of money in my bank account to continue living a relatively high-quality life. Whether I “endorse” those preferences or not is entirely irrelevant to whether I have them right now; perhaps you could offer a rational argument to eventually convince me that you would make much better use of all my money, and then I would endorse giving you that money, but I don’t care about any of that right now. My current, unreflectively-endorsed self, doesn’t want to part with what’s in my bank account, and that’s what guiding my actions, not an idealized, reified future version.
None of this means anything conclusive about me ultimately endorsing these preferences in the reflective limit, of those preferences being stable under ontology shifts that reveal how my current ontology is hopelessly confused and reifies the analogues of ghosts, of there being any nonzero intersection between the end states of a process that tries to find my individual volition, of changes to my physical and neurological make-up keeping my identity the same (in a decision-relevant sense relative to my values) when my memories and path through history change.
In any case, I am very skeptical of this whole values-on-reflection business,^[2] as I have written about at length in many different spots (1, 2, 3 come to mind off the top of my head). I am loathe to keep copying the exposition of the same ideas over and over and over again (it also probably gets annoying to read at some point), but here is a relevant sample:
Whenever I see discourse about the values or preferences of beings embedded in a physical universe that goes beyond the boundaries of the domains (namely, low-specificity conversations dominated by intuition) in which such ultimately fake frameworks function reasonably well, I get nervous and confused. I get particularly nervous if the people participating in the discussions are not themselves confused about these matters (I am not referring to [Wei Dai] in particular here, since [Wei Dai] has already signaled an appropriate level of confusion about this). Such conversations stretch our intuitive notions past their breaking point by trying to generalize them out of distribution without the appropriate level of rigor and care.
What counts as human “preferences”? Are these utility function-like orderings of future world states, or are they ultimately about universe-histories, or maybe a combination of those, or maybe something else entirely? Do we actually have any good reason to think that (some form of) utility maximization explains real-world behavior, or are the conclusions broadly converged upon on LW ultimately a result of intuitions about what powerful cognition must be like whose source is a set of coherence arguments that do not stretch as far as they were purported to? What do we do with the fact that humans don’t seem to have utility functions and yet lingering confusion about this remained as a result of many incorrect and misleading statements by influential members of the community?
How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so, how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?
In any case, are they indexical or not? If we are supposed to think about preferences in terms of revealed preferences only, what does this mean in a universe (or an Everett branch, if you subscribe to that particular interpretation of QM) that is deterministic? Aren’t preferences thought of as being about possible worlds, so they would fundamentally need to be parts of the map as opposed to the actual territory, meaning we would need some canonical framework of translating the incoherent and yet supposedly very complex and multidimensional set of human desires into something that actually corresponds to reality? What additional structure must be grafted upon the empirically-observable behaviors in order for “what the human actually wants” to be well-defined?
[...]
What do we mean by morality as fixed computation in the context of human beings who are decidedly not fixed and whose moral development through time is almost certainly so path-dependent (through sensitivity to butterfly effects and order dependence) that a concept like “CEV” probably doesn’t make sense? The feedback loops implicit in the structure of the brain cause reward and punishment signals to “release chemicals that induce the brain to rearrange itself” in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not “update” its brain chemistry the same way that a biological being does be “human” in any decision-relevant way?).
I do have some other thoughts on other parts of the post, which I might write out at some point.
1. ^
  Except in so much as my current, unreflectively-endorsed version has preferences over what preferences I should have or how they should develop in the future (which I do, but their aggregate effect does not dominate in these spots).
2. ^
  By which I mean, I am skeptical it exists as a coherent concept.
- Charlie Steiner 27 Aug 2024 16:26 UTC
  9 points
  2
  Parent
  I agree and yet I think it’s not actually that hard to make progress.
  There is no canonical way to pick out human values,^[1] and yet using an AI to make clever long-term plans implicitly makes some choice. You can’t dodge choosing how to interpret humans, if you think you’re dodging it you’re just doing it in an unexamined way.
  Yes, humans are bad at philosophy and are capable of making things worse rather than better by examining them. I don’t have much to say other than get good. Just kludging together how the AI interprets humans seems likely to lead to problems to me, especially in a possible multipolar future where there’s more incentive for people to start using AI to make clever plans to steer the world.
  This absolutely means disposing of appealing notions like a unique CEV, or even an objectively best choice of AI to build, even as we make progress on developing standards for good AI to build.
  1. ^
    See the Reducing Goodhart sequence for me on this, which starts sketching some ways to deal with humans not being agents.
  - Seth Herd 29 Aug 2024 1:09 UTC
    2 points
    2
    Parent
    I agree and I think this is critical. The standard of getting >90% of the possible value from our lightcone, or similar, seems ridiculously high given the seemingly very real possibility of achieving zero or negative value.
    
    And it seems certain that there’s no absolute standard for achieving human values. What they are is path dependent.
    
    But we can still achieve an unimaginably good future by achieving ASI that does anything that humans roughly want.
- Vladimir_Nesov 25 Aug 2024 15:57 UTC
  2 points
  0
  Parent
  
  morality as fixed computation … decidedly not fixed … path-dependent
  
  Updatelessness teaches us that looking at the tree of possibilities as a whole is a saner point of view than looking at any one leaf, to the point that in the limit and where feasible you want to put the map of the whole tree in charge of the decision making at every leaf. So path-dependence is not necessarily a problem in principle, only in practice.
  
  Another problem is influence of others, and boundaries/membranes or respect for autonomy seem like clues towards abstracting these influences away without removing them altogether as sources of more possibilities, so that only appropriate external influences remain permitted to enter the updateless dataset of possible trajectories of reflection on morality. And each trajectory has potential to access the map of all trajectories, though a membrane might need to gate access to such a map.
  - sunwillrise 25 Aug 2024 16:22 UTC
    1 point
    0
    Parent
    Updatelessness sure seems nice from a theoretical perspective, but it has a ton of problems that go beyond what you just mentioned and which seem to me to basically doom the entire enterprise (at least with regards to what we are currently discussing, namely people):
    I am not aware of any method of operationalizing even a weak version of updatelessness in the context of cognitively limited human beings that do not have access to their own source code
    I am pretty sure that a large portion of my values (and, by extension, the values of the vast majority of people) are indexical in nature, at least partly because my access to the outside world is mediated through sense data, which my S1 seems to value “terminally” and not as a mere proxy for preferences over current world-states. Indexicality seems to me to play very poorly with updatelesness (although I suspect you would know more about this than me, given your work in this area?)
    I don’t currently know of a way that humans can remain updateless even under (what seems to be like an inordinately optimistic) world in which we can actually access the “source code” by figuring out how to model the abstract classical computation performed by a particular (and reified) subset of the brain’s electronic circuit, basically because of the reasons I gave in my comment to Wei Dai that I referenced earlier (“The feedback loops implicit in the structure of the brain cause reward and punishment signals to “release chemicals that induce the brain to rearrange itself” in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not “update” its brain chemistry the same way that a biological being does be “human” in any decision-relevant way?”)
    I have a much broader skepticism about whether the concepts of “beliefs” and “values” make sense as distinct, coherent concepts that carve reality at the joints, and which I think is reflected in some of the other points I made in my long list of questions and confusions about these matters. It doesn’t really seem to me like updatelessness solves this, or even necessarily offers a concrete path forward on it.
    Of course, I don’t expect that you are trying to literally say that going updateless gets rid of all the issues, but rather that thinking about it in those terms, after internalizing that perspective, helps put us in the right frame of mind to make progress on these philosophical and metaphilosophical matters moving forward. But, as I said at the end of my comment to Wei Dai:
    I do not have answers to the very large set of questions I have asked and referenced in this comment. Far more worryingly, I have no real idea of how to even go about answering them or what framework to use or what paradigm to think through. Unfortunately, getting all this right seems very important if we want to get to a great future. Based on my reading of the general pessimism you have been signaling throughout your recent posts and comments, it doesn’t seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us.
    Perhaps if a group of really smart philosophy-inclined people who have internalized the lessons of the Sequences without being wedded to the very specific set of conclusions MIRI has reached about what AGI cognition must be like and which seem to be contradicted by the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models were to be given a ton of funding and access and 10 years to work on this problem as part of a proto-Long Reflection, something interesting would come out. But that is quite a long stretch at this point.
    - Vladimir_Nesov 27 Aug 2024 4:38 UTC
      2 points
      0
      Parent
      Making maps is practical even when they are not as precise as the whole territory. The point is, path dependence happens in some space of possibilities, and it’s possible to make maps of that whole space and to make use of them to navigate the possibilities jointly, as opposed to getting caught in any one of them. This doesn’t need to involve global coherence across all possibilities (of moral reflection, in this case), just as optimization of the world doesn’t need to involve steamrolling it into repetition of some perfect pattern. But some parts will have similarities and shared issues with other parts, and can inform each other in their development.
      
      Updatelessness closer to something practical is consulting an external map of possibilities that gives advice on acting in the current situation and explains how following its advice influences the possibilities (in their further development that results from following the advice). That is, you don’t need to yourself “be updateless”, the essential observation is that a single computation can exist in many possible situations, and by being the same thing its evaluation will give the same results in all these situations, coordinating what happens in them (without the use of causal influence of some physical thing). This computation doesn’t need to be the whole agent, for example a calculator on Mars computes the same results as a calculator (of a different make) on Earth, and both implementing the same computation thus coordinate what happens on Mars with what happens on Earth without a need to physically communicate. This becomes a matter of decision theory when the coordinating computation is itself an agent. But it doesn’t need to be the same agent as a user of this decision theory as a whole, it doesn’t need to be something like a human, it can be much smaller and more legible, more like a calculator.