sunwillrise comments on The Standard Analogy

sunwillrise 5 Jun 2024 12:17 UTC
26 points
16
I agree that it’s plausible. I even think a strong form of moral realism (denial of orthogonality thesis) is plausible.
Those are good and highly salient points (with the added comment that I would not go as far as to say possibility #1 in your post is “plausible”, as that seems to suggest a rather high subjective probability of it being true compared to what the mere “possible” does; Roko’s old post and the associated comment thread are highly relevant here). Nevertheless, I think the situation is even trickier and more confusing than you have illustrated here. Quoting Charlie Steiner’s excellent Reducing Goodhart sequence:
Humans don’t have our values written in Fortran on the inside of our skulls, we’re collections of atoms that only do agent-like things within a narrow band of temperatures and pressures. It’s not that there’s some pre-theoretic set of True Values hidden inside people and we’re merely having trouble getting to them—no, extracting any values at all from humans is a theory-laden act of inference, relying on choices like “which atoms exactly count as part of the person” and “what do you do if the person says different things at different times?”
The natural framing of Goodhart’s law—in both mathematics and casual language—makes the assumption that there’s some specific True Values in here, some V to compare to U. But this assumption, and the way of thinking built on top of it, is crucially false when you get down to the nitty gritty of how to model humans and infer their values.
Whenever I see discourse about the values or preferences of beings embedded in a physical universe that goes beyond the boundaries of the domains (namely, low-specificity conversations dominated by intuition) in which such ultimately fake frameworks function reasonably well, I get nervous and confused. I get particularly nervous if the people participating in the discussions are not themselves confused about these matters (I am not referring to you in particular here, since you have already signaled an appropriate level of confusion about this). Such conversations stretch our intuitive notions past their breaking point by trying to generalize them out of distribution without the appropriate level of rigor and care.
What counts as human “preferences”? Are these utility function-like orderings of future world states, or are they ultimately about universe-histories, or maybe a combination of those, or maybe something else entirely? Do we actually have any good reason to think that (some form of) utility maximization explains real-world behavior, or are the conclusions broadly converged upon on LW ultimately a result of intuitions about what powerful cognition must be like whose source is a set of coherence arguments that do not stretch as far as they were purported to? What do we do with the fact that humans don’t seem to have utility functions and yet lingering confusion about this remained as a result of many incorrect and misleading statements by influential members of the community?
How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so, how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?
In any case, are they indexical or not? If we are supposed to think about preferences in terms of revealed preferences only, what does this mean in a universe (or an Everett branch, if you subscribe to that particular interpretation of QM) that is deterministic? Aren’t preferences thought of as being about possible worlds, so they would fundamentally need to be parts of the map as opposed to the actual territory, meaning we would need some canonical framework of translating the incoherent and yet supposedly very complex and multidimensional set of human desires into something that actually corresponds to reality? What additional structure must be grafted upon the empirically-observable behaviors in order for “what the human actually wants” to be well-defined?
On the topic of agency, what exactly does that refer to in the real world? Do we not “first need a clean intuitively-correct mathematical operationalization of what “powerful agent” even means”? Are humans even agents, and if not, what exactly are we supposed to get out of approaches that are ultimately all about agency? How do we actually get from atoms to agents? (note that the posts in that eponymous sequence do not even come close to answering this question) More specifically, is a real-world being actually the same as the abstract computation its mind embodies? Rejections of souls and dualism, alongside arguments for physicalism, do not prove the computationalist thesis to be correct, as physicalism-without-computationalism is not only possible but also (as the very name implies) a priori far more faithful to the standard physicalist worldview.
What do we mean by morality as fixed computation in the context of human beings who are decidedly not fixed and whose moral development through time is almost certainly so path-dependent (through sensitivity to butterfly effects and order dependence) that a concept like “CEV” probably doesn’t make sense? The feedback loops implicit in the structure of the brain cause reward and punishment signals to “release chemicals that induce the brain to rearrange itself” in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not “update” its brain chemistry the same way that a biological being does be “human” in any decision-relevant way?). We can think about a continuous personal identity through the lens of mutual information about memories, personalities etc, but our current understanding of these topics is vastly incomplete and inadequate, and in any case the naive (yet very widespread, even on LW) interpretation of “the utility function is not up for grabs” as meaning that terminal values cannot be changed (or even make sense as a coherent concept) seems totally wrong.
The way communities make progress on philosophical matters is by assuming that certain answers are correct and then moving on. After all, you can’t ever get to the higher levels that require a solid foundation if you aren’t allowed to build such a foundation in the first place. But I worry, for reasons that have been stated before, that the vast majority of the discourse by “lay lesswrongers” (and, frankly, even far more experienced members of the community working directly on alignment research; as a sample illustration, see a foundational report’s failure to internalize the lesson of “Reward is not the optimization target”) is based on conclusions reached through informal and non-rigorous intuitions that lack the feedback loops necessary to ground themselves to reality because they do not do enough “homework problems” to dispel misconceptions and lingering confusions about complex and counterintuitive matters.
I do not have answers to the very large set of questions I have asked and referenced in this comment. Far more worryingly, I have no real idea of how to even go about answering them or what framework to use or what paradigm to think through. Unfortunately, getting all this right seems very important if we want to get to a great future. Based on my reading of the general pessimism you have been signaling throughout your recent posts and comments, it doesn’t seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us.
Perhaps if a group of really smart philosophy-inclined people who have internalized the lessons of the Sequences without being wedded to the very specific set of conclusions MIRI has reached about what AGI cognition must be like and which seem to be contradicted by the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models were to be given a ton of funding and access and 10 years to work on this problem as part of a proto-Long Reflection, something interesting would come out. But that is quite a long stretch at this point.
What links here?
- Wei Dai 5 Jun 2024 23:37 UTC
  14 points
  7
  Parent
  In my view, the main good outcomes of the AI transition are 1) we luck out, AI x-safety is actually pretty easy across all the subproblems 2) there’s an AI pause, humans get smarter via things like embryo selection, then solve all the safety problems.
  
  I’m mainly pushing for #2, but also don’t want to accidentally make #1 less likely. It seems like one of the main ways in which I could end up having a negative impact is to persuade people that the problems are definitely too hard and hence not worth trying to solve, and it turns out the problems could have been solved with a little more effort.
  
  “it doesn’t seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us” is a bit worrying from this perspective, and also because my “effort spent on them” isn’t that great. As I don’t have a good approach to answering these questions, I mainly just have them in the back of my mind while my conscious effort is mostly on other things.
  
  BTW I’m curious what your background is and how you got interested/involved in AI x-safety. It seems rare for newcomers to the space (like you seem to be) to quickly catch up on all the ideas that have been developed on LW over the years, and many recently drawn to AGI instead appear to get stuck on positions/arguments from decades ago. For example, r/Singularity has 2.5M members and seems to be dominated by accelerationism. Do you have any insights about this? (How were you able to do this? How to help others catch up? Intelligence is probably a big factor which is why I’m hoping that humanity will automatically handle these problems better once it gets smarter, but many seem plenty smart and still stuck on primitive ideas about AI x-safety.)
  - sunwillrise 6 Jun 2024 8:28 UTC
    21 points
    0
    Parent
    It seems like one of the main ways in which I could end up having a negative impact is to persuade people that the problems are definitely too hard and hence not worth trying to solve, and it turns out the problems could have been solved with a little more effort.
    This isn’t unreasonable, but in order for this to be a meaningful concern, it’s possible that you would need to be close to enough people working on this topic to the point where you misleading them would have a nontrivial impact. And I guess this… just doesn’t seem to be the case (at least to an outsider like me)? Even otherwise interested and intelligent people are focusing on other stuff, and while I suppose there may be some philosophy PhD’s at OpenPhil or the Future of Humanity Institute (RIP) who are thinking about such matters, they seem few and far between.
    That is to say, it sure would be nice if we got to a point where the main concern was “Wei Dai is unintentionally misleading people working on this issue” instead of “there are just too few people working on this to produce useful results even absent any misleadingness”.
    My path to getting here is essentially the following:
    reading Ezra Klein and Matt Yglesias because they seem saner and more policy-focused than other journalists $\to$ Yglesias writes an interesting blog post in defense of the Slate Star Codex, which I had heard of before but had never really paid much attention to $\to$ I start reading the SSC out of curiosity, and I am very impressed by how almost every post is interesting, thoughtful, and gives me insights which I had never considered but which seemed obvious in retrospect $\to$ Scott occasionally mentions this “rationality community” kinda-sorta centered around LW $\to$ I start reading LW in earnest, and I enjoy the high quality of the posts (and especially of the discussions in the comments), but I mostly avoid the AI risk stuff because it seems scary and weird; I also read HPMOR, which I find to be very fun but not necessarily well-written $\to$ I mess around with the beta version of ChatGPT around September 2022 and I am shocked by how advanced and coherent the LLM seems $\to$ I realize the AI stuff is really important and I need to get over myself and actually take it seriously
    If I had to say what allowed me to reach this point, I would say the following properties, listed in no particular order, were critical (actually writing this list out feels kinda self-masturbatory, but oh well):
    I was non-conformist enough to not immediately bounce off Scott’s writing or off of LW (which contains some really strange, atypical stuff)
    I loved mathematics so much that I wasn’t thrown off by everything on this site (and in fact embraced the applications of mathematical ideas in everything)
    I cared about philosophy, so I didn’t shy away from meta discussions and epistemology and getting really into the weeds of confusing topics like personal identity, agency, values and preferences etc
    I had enough self-awareness to not become a parody of myself and to figure out what the important topics were instead of circlejerking over rationality or getting into rational fiction or other stuff like that
    I was sufficiently non-mindkilled that I was able to remain open to fundamental shifts in understanding and to change my opinion in crucial ways
    My sanity was resilient enough that I didn’t just run away or veer off in crazy directions when confronted with the frankly really scary AI stuff
    I was intelligent enough to understand (at least to some extent) what is going on and to think critically about these matters
    I was obsessive and focused enough to read everything available on LW (and on other related sites) about AI without getting bored or tired over time.
    The problem is that (at least from my perspective) all of these qualities were necessary in order for me to follow the path I did. I believe that if any single one of them had been removed while the other 7 remained, I would not have ended up caring about AI safety. As a sample illustration of what can go wrong when points 1 and 6 aren’t sufficiently satisfied but everything else is, you can check out what happened with Qiaochu (also, you probably knew him personally?, so there’s that too).
    As you can imagine, this strongly limits the supply of people that can think sanely and meaningfully about AI safety topics. The need to satisfy points 1, 5, and 7 above already lowers the population pool tremendously, and there are still all the other requirements to get through.
    For example, r/Singularity has 2.5M members and seems to be dominated by accelerationism.
    Man, these guys… I get the impression that they are mindkilled in a very literal political sense. They seem to desperately await the arrival of the glorious Singularity that will free them from the oppression and horrors of Modernity and Capitalism. Of course, the fact that they are living better material lives than 99.9% of humans that have ever existed doesn’t seem to register, and I guess you can call this the epitome of entitlement and privilege.
    But I don’t think most of them are bad people (in so far as it even makes sense to call a person bad). They just live in a very secure epistemic bubble that filters everything they read and think about and which prevents them from ever touching reality. I’ve written similar stuff about this before:
    society has always done, is doing, and likely will always do a tremendous job of getting us to self-sort into groups of people that are very similar to us in terms of culture, social and aesthetic preferences, and political leanings. This process has only been further augmented by the rise of social media and the entrenchment of large and self-sustaining information bubbles. In broad terms, people do not like to talk about the downsides of their proposed policies or general beliefs, and even more importantly, they do not communicate these downsides to other members of the bubble. Combined with the present reality of an ever-growing proportion of the population that relies almost entirely on the statements and attitudes of high-status members of the in-group as indicators of how to react to any piece of news, this leads to a rather remarkable equilibrium, in which otherwise sane individuals genuinely believe that the policies and goals they propose have 100% upside and 0% downside, and the only reason they don’t get implemented in the real world is because of wicked and stupid people on the other side who are evil or dumb enough to support policies that have 100% downside and 0% upside. Trade-offs are an inevitable consequence of any discussion about meaningful changes to our existing system; simple Pareto improvements are extremely rare. However, widespread knowledge or admission of the existence of trade-offs does not just appear out of nowhere; in order for this reality to be acknowledged, it must be the case that people are exposed to counterarguments (or at the very least calls for caution) to the most extreme versions of in-group beliefs by trusted members of the in-group because everyone else will be ignored or dismissed as a bad-faith supporters of the opposition. Due to the dynamic mentioned earlier, this happens less and less, and beliefs get reinforced into becoming more and more extreme. Human beings thus end up with genuine (although self-serving and biased) convictions and beliefs that a neutral observer could nonetheless readily identify as irrational or nonsensical. In the past, there used to be a moderating effect due to the much more shared nature of pop culture and group identity: if you were already predisposed to adopt extreme views, it was unlikely for you to find other similarly situated people in your neighborhood or coalition, as most groups you could belong to were far more mainstream and thus moderate. But now the Internet allows you to turn all that on its head with just a few mouse clicks; after all, no matter what intuition you may have about any slightly popular topic, there is very likely some community out there ready to take you in and tell you how smart and brave you are for thinking the right thoughts and not being one of the crazy, bad people who disagree.
    The sanity waterline is very low and only getting lower. As lc has said, “the vast majority of people alive today are the effective mental subjects of some religion, political party, national identity, or combination of the three”. I would have hoped that CFAR was trying to solve that, but that apparently was not close to being true even though it was repeatedly advertised as aiming to “help people develop the abilities that let them meaningfully assist with the world’s most important problems, by improving their ability to arrive at accurate beliefs, act effectively in the real world, and sustainably care about that world” by “widen[ing] the bottleneck on thinking better and doing more.” I guess the actual point of CFAR (there was a super long twitter thread by Qiaochu on this at some point) was to give the appearance of being about rationality while the underlying goal was to nerd-snipe young math-inclined students to go work on mathematical alignment at MIRI? Anyway, I’m slightly veering off-topic.
    How to help others catch up?
    I don’t have a good answer to this question. Due to the considerations mentioned earlier, the most effective short-term way to get people who could contribute anything useful into AI safety is through selective rather than corrective or structural means, but there are just too few people who fit the requirements for this to scale nicely.
    Over the long-term, you can try to reverse the trends on general societal inadequacy and sanity, but this seems really hard, it should have been done 20 years ago, and in any case requires actual decades before you can get meaningful outputs.
    I’ll think about this some more and I’ll let you know if I have anything else to say.
    What links here?
    sunwillrise's comment on Twitter thread on politics of AI safety by Richard_Ngo (31 Jul 2024 0:39 UTC; 7 points)
    - Wei Dai 6 Jun 2024 22:48 UTC
      5 points
      0
      Parent
      Thanks for your insightful answers. You may want to make a top-level post on this topic to get more visibility. If only a very small fraction of the world is likely to ever understand and take into account many important ideas/considerations about AI x-safety, that changes the strategic picture considerably, and people around here may not be sufficiently “pricing it in”. I think I’m still in the process of updating on this myself.
      
      Having more intelligence seems to directly or indirectly improve at least half of the items on your list. So doing an AI pause and waiting for (or encouraging) humans to become smarter still seems like the best strategy. Any thoughts on this?
      
      And I guess this… just doesn’t seem to be the case (at least to an outsider like me)?
      
      I may be too sensitive about unintentionally causing harm, after observing many others do this. I was also just responding to what you said earlier, where it seemed like I was maybe causing you personally to be too pessimistic about contributing to solving the problems.
      
      you probably knew him personally?
      
      No, I never met him and didn’t interact online much. He does seem like a good example of you’re talking about.
      What links here?
      Wei Dai's comment on AALWA: Ask any LessWronger anything by Will_Newsome (10 Jun 2024 7:44 UTC; 8 points)
  - jmh 5 Jul 2024 4:09 UTC
    2 points
    0
    Parent
    Could that just shift the problem a bit? If we get a class of really smart people they can subjugate everyone else pretty easily too—perhaps even better than some AGI as they start with a really good understanding of human nature, cultures, failing and how to exploit for their own purposes. Or they could simply be better suited to taking advantage of and surviving with a more dangerous AI on the loose. We end up in some hybrid world where humanity is not extinct but most peoples’ life is pretty poor.
    I suppose one might say that the speed and magnitude of the advances here might be such that we get to corrigible AI before we get incorrigible super humans.
    I’m currious about your thought.
    Quick, caveate, I’m trying to say all futures are bleak and no efforts lead where we want. I’m actually pretty positive about our future, even with AI (perhaps naively). We clearly already live in a world where the most intelligent could be said to “rule” but the rest of us average Joes are not slaves or surfs everywhere. Where the problems exist is more where we have cultural and legal failings rather than just outright subjugation by the brighter bulbs. But going back to the darker side here, the one’s that tend to successfully exploit/game/or ignore the rules are the smarter ones in the room.
    - Wei Dai 5 Jul 2024 7:57 UTC
      6 points
      2
      Parent
      If governments subsidize embryo selection, we should get a general uplift of everyone’s IQ (or everyone who decides to participate) so the resulting social dynamics shouldn’t be too different from today’s. Repeat that for a few generations, then build AGI (or debate/decide what else to do next). That’s the best scenario I can think of (aside from the “we luck out” ones).