I think I need more practice talking with people in real time (about intellectual topics). (I’ve gotten much more used to text chat/comments, which I like because it puts less time pressure on me to think and respond quickly, but I feel like I now incur a large cost due to excessively shying away from talking to people, hence the desire for practice.) If anyone wants to have a voice chat with me about a topic that I’m interested in (see my recent post/comment history to get a sense), please contact me via PM.
Wei Dai
I’m not sure that fear or coercion has much to do with it, because there’s often no internal conflict when someone is caught up in some extreme form of the morality game, they’re just going along with it wholeheartedly, thinking they’re just being a good person or helping to advance the arc of history. In the subagents frame, I would say that the subagents have an implicit contract/agreement that any one of them can seize control, if doing so seems good for the overall agent in terms of power or social status.
But quite possibly I’m not getting your point, in which case please explain more, or point to some specific parts of your articles that are especially relevant?
My early posts on LW often consisted of pointing out places in the Sequences where Eliezer wasn’t careful enough. Shut Up and Divide? and Boredom vs. Scope Insensitivity come to mind. And of course that’s not the only way to gain status here—the big status awards are given for coming up with novel ideas and backing them up with carefully constructed arguments.
To branch off the line of thought in this comment, it seems that for most of my adult life I’ve been living in the bubble-within-a-bubble that is LessWrong, where the aspect of human value or motivation that is the focus of our signaling game is careful/skeptical inquiry, and we gain status by pointing out where others haven’t been careful or skeptical enough in their thinking. (To wit, my repeated accusations that Eliezer and the entire academic philosophy community tend to be overconfident in their philosophical reasoning, don’t properly appreciate the difficulty of philosophy as an enterprise, etc.)
I’m still extremely grateful to Eliezer for creating this community/bubble, and think that I/we have lucked into the One True Form of Moral Progress, but must acknowledge that from the outside, our game must look as absurd as any other niche status game that has spiraled out of control.
- Mar 31, 2025, 8:32 AM; 9 points) 's comment on Why do many people who care about AI Safety not clearly endorse PauseAI? by (
How would this ideology address value drift? I’ve been thinking a lot about the kind quoted in Morality is Scary. The way I would describe it now is that human morality is by default driven by a competitive status/signaling game, where often some random or historically contingent aspect of human value or motivation becomes the focal point of the game, and gets magnified/upweighted as a result of competitive dynamics, sometimes to an extreme, even absurd degree.
(Of course from the inside it doesn’t look absurd, but instead feels like moral progress. One example of this that I happened across recently is filial piety in China, which became more and more extreme over time, until someone cutting off a piece of their flesh to prepare a medicinal broth for an ailing parent was held up as a moral exemplar.)
Related to this is my realization is that the kind of philosophy you and I are familiar with (analytical philosophy, or more broadly careful/skeptical philosophy) doesn’t exist in most of the world and may only exist in Anglophone countries as a historical accident. There, about 10,000 practitioners exist who are funded but ignored by the rest of the population. To most of humanity, “philosophy” is exemplified by Confucius (morality is everyone faithfully playing their feudal roles) or Engels (communism, dialectical materialism). To us, this kind of “philosophy” is hand waving and make things up out of thin air, but to them, philosophy is learned from a young age and unquestioned. (Or if questioned, they’re liable to jump to some other equally hand-wavy “philosophy” like China’s move from Confucius to Engels.)
Empowering a group like this… are you sure that’s a good idea? Or perhaps you have some notion of “empowerment” in mind that takes these issues into account already and produces a good outcome anyway?
If you only care about the real world and you’re sure there’s only one real world, then the fact that you at time 0 would sometimes want to bind yourself at time 1 (e.g., physically commit to some action or self-modify to perform some action at time 1) seems very puzzling or indicates that something must be wrong, because at time 1 you’re in a strictly better epistemic position, having found out more information about which world is real, so what sense does it make that your decision theory makes you-at-time-0 decide to override you-at-time-1′s decision?
(If you believed in something like Tegmark IV but your values constantly change to only care about the subset of worlds that you’re in, then time inconsistency, and wanting to override your later selves, would make more sense, as your earlier self and later self would simply have different values. But it seems counterintuitive to be altruistic this way.)
Better control solutions make AI more economically useful, which speeds up the AI race and makes it even harder to do an AI pause.
When we have controlled unaligned AIs doing economically useful work, they probably won’t be very useful for solving alignment. Alignment will still be philosophically confusing, and it will be hard to trust the alignment work done by such AIs. Such AIs can help solve some parts of alignment problems, parts that are easy to verify, but alignment as a whole will still be bottle-necked on philosophically confusing, hard to verify parts.
Such AIs will probably be used to solve control problems for more powerful AIs, so the basic situation will continue and just become more fragile, with humans trying to control increasingly intelligent unaligned AIs. This seems unlikely to turn out well. They may also persuade some of us to trust their alignment work, even though we really shouldn’t.
So to go down this road is to bet that alignment has no philosophically confusing or hard to verify parts. I see some people saying this explicitly in the comments here, but why do they think that? How do they know? (I’m afraid that some people just don’t feel philosophically confused about much of anything, and will push forward on that basis.) But you do seem to worry about philosophical problems, which makes me confused about the position you take here.
BTW I have similar objections to working on relatively easy forms of (i.e., unscalable) alignment solutions, and using the resulting aligned AIs to solve alignment for more powerful AIs. But at least there, one might gain some insights into the harder alignment problems from working on the easy problems, potentially producing some useful strategic information or making it easier to verify future proposed alignment solutions. So while I don’t think that’s a good plan, this plan seems even worse.
And I agree with Bryan Caplan’s recent take that friendships are often a bigger conflict of interest than money, so Open Phil higher-ups being friends with Anthropic higher-ups is troubling.
No kidding. From https://www.openphilanthropy.org/grants/openai-general-support/:
OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela.
Wish OpenPhil and EAs in general were more willing to reflect/talk publicly about their mistakes. Kind of understandable given human nature, but still… (I wonder if there are any mistakes I’ve made that I should reflect more on.)
To be clear, by “indexical values” in that context I assume you mean indexing on whether a given world is “real” vs “counterfactual,” not just indexical in the sense of being egoistic? (Because I think there are compelling reasons to reject UDT without being egoistic.)
I think being indexical in this sense (while being altruistic) can also lead you to reject UDT, but it doesn’t seem “compelling” that one should be altruistic this way. Want to expand on that?
Maybe breaking up certain biofilms held together by Ca?
Yeah there’s a toothpaste on the market called Livfree that claims to work like this.
IIRC, high EDTA concentration was found to cause significant amounts of erosion.
Ok, that sounds bad. Thanks.
ETA: Found an article that explains how Livfree works in more detail:
Tooth surfaces are negatively charged, and so are bacteria; therefore, they should repel each other. However, salivary calcium coats the negative charges on the tooth surface and bacteria, allowing them to get very close (within 10 nm). At this point, van der Waal’s forces (attractive electrostatic forces at small distances) take over, allowing the bacteria to deposit on the tooth surfaces, initiating biofilm formation.10 A unique formulation of EDTA strengthens the negative electronic forces of the tooth, allowing the teeth to repel harmful plaque. This special formulation quickly penetrates through the plaque down to the tooth surface. There, it changes the surface charge back to negative by neutralizing the positively charged calcium ions. This new, stronger negative charge on the tooth surface environment simply allows the plaque and the tooth surface to repel each other. This requires neither an abrasive nor killing the bacteria (Figure 3).
The authors are very positive on this toothpaste, although they don’t directly explain why it doesn’t cause tooth erosion.
I actually no longer fully endorse UDT. It still seems a better decision theory approach than any other specific approach that I know, but it has a bunch of open problems and I’m not very confident that someone won’t eventually find a better approach that replaces it.
To your question, I think if my future self decides to follow (something like) UDT, it won’t be because I made a “commitment” to do it, but because my future self wants to follow it, because he thinks it’s the right thing to do, according to his best understanding of philosophy and normativity. I’m unsure about this, and the specific objection you have is probably covered under #1 in my list of open questions in the link above.
(And then there’s a very different scenario in which UDT gets used in the future, which is that it gets built into AIs, and then they keep using UDT until they decide not to, which if UDT is reflectively consistent would be never. I dis-endorse this even more strongly.)
Any thoughts on edathamil/EDTA or nano-hydroxyapatite toothpastes?
This means that in the future, there will likely be a spectrum of AIs of varying levels of intelligence, some much smarter than humans, others only slightly smarter, and still others merely human-level.
Are you imagining that the alignment problem is still unsolved in the future, such that all of these AIs are independent agents unaligned with each other (like humans currently are)? I guess in my imagined world, ASIs will have solved the alignment (or maybe control) problem at least for less intelligent agents, so you’d get large groups of AIs aligned with each other that can for many purposes be viewed as one large AI.
Building on (5), I generally expect AIs to calculate that it is not in their interest to expropriate wealth from other members of society, given how this could set a precedent for future wealth expropriation that comes back and hurts them selfishly.
At some point we’ll reach technological maturity, and the ASIs will be able to foresee all remaining future shocks/changes to their economic/political systems, and probably determine that expropriating humans (and anyone else they decide to, I agree it may not be limited to humans) won’t cause any future problems.
Even if a tiny fraction of consumer demand in the future is for stuff produced by humans, that could ensure high human wages simply because the economy will be so large.
This is only true if there’s not a single human that decides to freely copy or otherwise reproduce themselves and drive down human wages to subsistence. And I guess yeah, maybe AIs will have fetishes like this, but (like my reaction to Paul Christiano’s “1/trillion kindness” argument) I’m worried whether AIs might have less benign fetishes. This worry more than cancels out the prospect that humans might live / earn a wage from benign fetishes in my mind.
This might be the most important point on my list, despite saying it last, but I think humans will likely be able to eventually upgrade their intelligence, better allowing them to “keep up” with the state of the world in the future.
I agree this will happen eventually (if humans survive), but think it will take a long time because we’ll have to solve a bunch of philosophical problems to determine how to do this safely (e.g. without losing or distorting our values) and we probably can’t trust AI’s help with these (although I’d love to change that, hence my focus on metaphilosophy), and in the meantime AIs will be zooming ahead partly because they started off thinking faster and partly because some will be reckless (like some humans currently are!) or have simple values that don’t require philosophical contemplation to understand, so the situation I described is still likely to occur.
It therefore seems perfectly plausible for AIs to simply get rich within the system we have already established, and make productive compromises, rather than violently overthrowing the system itself.
So assuming that AIs get rich peacefully within the system we have already established, we’ll end up with a situation in which ASIs produce all value in the economy, and humans produce nothing but receive an income and consume a bunch, through ownership of capital and/or taxing the ASIs. This part should be non-controversial, right?
At this point, it becomes a coordination problem for the ASIs to switch to a system in which humans no longer exist or no longer receive any income, and the ASIs get to consume or reinvest everything they produce. You’re essentially betting that ASIs can’t find a way to solve this coordination problem. This seems like a bad bet to me. (Intuitively it just doesn’t seem like a very hard problem, relative to what I imagine the capabilities of the ASIs to be.)
I’m simply arguing against the point that smart AIs will automatically turn violent and steal from agents who are less smart than they are unless they’re value aligned. This is a claim that I don’t think has been established with any reasonable degree of rigor.
I don’t know how to establish anything post-ASI “with any reasonable degree of rigor” but the above is an argument I recently thought of, which seems convincing, although of course you may disagree. (If someone has expressed this or a similar argument previously, please let me know.)
Why? Perhaps we’d do it out of moral uncertainty, thinking maybe we owe something to our former selves, but future people probably won’t think this.
Currently our utility is roughly log in money, partly because we spend money on instrumental goals and there’s diminishing returns due to limited opportunities being used up. This won’t be true of future utilitarians spending resources on their terminal values. So “one in hundred million fraction” of resources is a much bigger deal to them than to us.
I have a slightly different take, which is that we can’t commit to doing this scheme even if we want to, because I don’t see what we can do today that would warrant the term “commitment”, i.e., would be binding on our post-singularity selves.
In either case (we can’t or don’t commit), the argument in the OP loses a lot of its force, because we don’t know whether post-singularity humans will decide to do this kind scheme or not.
So the commitment I want to make is just my current self yelling at my future self, that “no, you should still bail us out even if ‘you’ don’t have a skin in the game anymore”. I expect myself to keep my word that I would probably honor a commitment like that, even if trading away 10 planets for 1 no longer seems like that good of an idea.
This doesn’t make much sense to me. Why would your future self “honor a commitment like that”, if the “commitment” is essentially just one agent yelling at another agent to do something the second agent doesn’t want to do? I don’t understand what moral (or physical or motivational) force your “commitment” is supposed to have on your future self, if your future self does not already think doing the simulation trade is a good idea.
I mean imagine if as a kid you made a “commitment” in the form of yelling at your future self that if you ever had lots of money you’d spend it all on comic books and action figures. Now as an adult you’d just ignore it, right?
Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof—without explicit construction—that it is possible.
The meta problem here is that you gave a “proof” (in quotes because I haven’t verified it myself as correct) using your own definitions of “aligned” and “superintelligence”, but if people asserting that it’s not possible in principle have different definitions in mind, then you haven’t actually shown them to be incorrect.
Apparently the current funding round hasn’t closed yet and might be in some trouble, and it seems much better for the world if the round was to fail or be done at a significantly lower valuation (in part to send a message to other CEOs not to imitate SamA’s recent behavior). Zvi saying that $150B greatly undervalues OpenAI at this time seems like a big unforced error, which I wonder if he could still correct in some way.
What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?
I’m very uncertain about it. Have you read Six Plausible Meta-Ethical Alternatives?
as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what’s happening in a way that corrupts thoughts which previously implemented values.
Yeah, agreed that how to safely amplify oneself and reflect for long periods of time may be hard problems that should be solved (or extensively researched/debated if we can’t definitely solve them) before starting something like CEV. This might involve creating the right virtual environment, social rules, epistemic norms, group composition, etc. A few things that seem easy to miss or get wrong:
Is it better to have no competition or some competition, and what kind? (Past “moral/philosophical progress” might have been caused or spread by competitive dynamics.)
How should social status work in CEV? (Past “progress” might have been driven by people motivated by certain kinds of status.)
No danger or some danger? (Could a completely safe environment / no time pressure cause people to lose motivation or some other kind of value drift? Related: What determines the balance between intelligence signaling and virtue signaling?)
can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its “true wants, needs, and hopes for the future”?
I think this is worth thinking about as well, as a parallel approach from the above. It seems related to metaphilosophy in that if we can discover what “correct philosophical reasoning” is, we can solve this problem by asking “What would this chunk of matter conclude if it were to follow correct philosophical reasoning?”
The One True Form of Moral Progress (according to me) is using careful philosophical reasoning to figure out what our values should be, what morality consists of, where our current moral beliefs are wrong, or generally, the contents of normativity (what we should and shouldn’t do). Does this still seem wrong to you?
The basic justification for this is that for any moral “progress” or change that is not based on careful philosophical reasoning, how can we know that it’s actually a change for the better? I don’t think I’ve written a post specifically about this, but Morality is Scary is related, in that it complains that most other kinds of moral change seem to be caused by status games amplifying random aspects of human values or motivation.