I think I need more practice talking with people in real time (about intellectual topics). (I’ve gotten much more used to text chat/comments, which I like because it puts less time pressure on me to think and respond quickly, but I feel like I now incur a large cost due to excessively shying away from talking to people, hence the desire for practice.) If anyone wants to have a voice chat with me about a topic that I’m interested in (see my recent post/comment history to get a sense), please contact me via PM.
Wei Dai
So the commitment I want to make is just my current self yelling at my future self, that “no, you should still bail us out even if ‘you’ don’t have a skin in the game anymore”. I expect myself to keep my word that I would probably honor a commitment like that, even if trading away 10 planets for 1 no longer seems like that good of an idea.
This doesn’t make much sense to me. Why would your future self “honor a commitment like that”, if the “commitment” is essentially just one agent yelling at another agent to do something the second agent doesn’t want to do? I don’t understand what moral (or physical or motivational) force your “commitment” is supposed to have on your future self, if your future self does not already think doing the simulation trade is a good idea.
I mean imagine if as a kid you made a “commitment” in the form of yelling at your future self that if you ever had lots of money you’d spend it all on comic books and action figures. Now as an adult you’d just ignore it, right?
Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof—without explicit construction—that it is possible.
The meta problem here is that you gave a “proof” (in quotes because I haven’t verified it myself as correct) using your own definitions of “aligned” and “superintelligence”, but if people asserting that it’s not possible in principle have different definitions in mind, then you haven’t actually shown them to be incorrect.
Apparently the current funding round hasn’t closed yet and might be in some trouble, and it seems much better for the world if the round was to fail or be done at a significantly lower valuation (in part to send a message to other CEOs not to imitate SamA’s recent behavior). Zvi saying that $150B greatly undervalues OpenAI at this time seems like a big unforced error, which I wonder if he could still correct in some way.
What hunches do you currently have surrounding orthogonality, its truth or not, or things near it?
I’m very uncertain about it. Have you read Six Plausible Meta-Ethical Alternatives?
as far as I can tell humans should by default see themselves as having the same kind of alignment problem as AIs do, where amplification can potentially change what’s happening in a way that corrupts thoughts which previously implemented values.
Yeah, agreed that how to safely amplify oneself and reflect for long periods of time may be hard problems that should be solved (or extensively researched/debated if we can’t definitely solve them) before starting something like CEV. This might involve creating the right virtual environment, social rules, epistemic norms, group composition, etc. A few things that seem easy to miss or get wrong:
Is it better to have no competition or some competition, and what kind? (Past “moral/philosophical progress” might have been caused or spread by competitive dynamics.)
How should social status work in CEV? (Past “progress” might have been driven by people motivated by certain kinds of status.)
No danger or some danger? (Could a completely safe environment / no time pressure cause people to lose motivation or some other kind of value drift? Related: What determines the balance between intelligence signaling and virtue signaling?)
can we find a CEV-grade alignment solution that solves the self-and-other alignment problems in humans as well, such that this CEV can be run on any arbitrary chunk of matter and discover its “true wants, needs, and hopes for the future”?
I think this is worth thinking about as well, as a parallel approach from the above. It seems related to metaphilosophy in that if we can discover what “correct philosophical reasoning” is, we can solve this problem by asking “What would this chunk of matter conclude if it were to follow correct philosophical reasoning?”
As a tangent to my question, I wonder how many AI companies are already using RLAIF and not even aware of it. From a recent WSJ story:
Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook.
When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.
So they detected the cheating that time, but in RLHF how would they know if contractors used AI to select which of two AI responses is more preferred?
BTW here’s a poem(?) I wrote for Twitter, actually before coming across the above story:
The people try to align the board. The board tries to align the CEO. The CEO tries to align the managers. The managers try to align the employees. The employees try to align the contractors. The contractors sneak the work off to the AI. The AI tries to align the AI.
but we only need one person or group who we’d be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this.
Why do you think this, and how would you convince skeptics? And there are two separate issues here. One is how to know their CEV won’t be corrupted relative to what their values really are or should be, and the other is how to know that their real/normative values are actually highly altruistic. It seems hard to know both of these, and perhaps even harder to persuade others who may be very distrustful of such person/group from the start.
Another is that even if we don’t die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.
Would be interested in understanding your perspective on this better. I feel like aside from AI, our world is not being eaten by molochs very quickly, and I prefer something like stopping AI development and doing (voluntary and subsidized) embryo selection to increase human intelligence for a few generations, then letting the smarter humans decide what to do next. (Please contact me via PM if you want to have a chat about this.)
It’s also not clear to me that most of the value of AI will accrue to them. I’m confused about this though.
I’m also uncertain, and its another reason for going long a broad index instead. I would go even broader than S&P 500 if I could, but nothing else has option chains going out to 2029.
If indeed OpenAI does restructure to the point where its equity is now genuine, then $150 billion seems way too low as a valuation
Why is OpenAI worth much more than $150B, when Anthropic is currently valued at only $30-40B? Also, loudly broadcasting this reduces OpenAI’s cost of equity, which is undesirable if you think OpenAI is a bad actor.
To clarify, I don’t actually want you to scare people this way, because I don’t know if people can psychologically handle it or if it’s worth the emotional cost. I only bring it up myself to counteract people saying things like “AIs will care a little about humans and therefore keep them alive” or when discussing technical solutions/ideas, etc.
Should have made it much scarier. “Superhappies” caring about humans “not in the specific way that the humans wanted to be cared for” sounds better or at least no worse than death, whereas I’m concerned about s-risks, i.e., risks of worse than death scenarios.
If a misaligned AI had 1/trillion “protecting the preferences of whatever weak agents happen to exist in the world”, why couldn’t it also have 1/trillion other vaguely human-like preferences, such as “enjoy watching the suffering of one’s enemies” or “enjoy exercising arbitrary power over others”?
From a purely selfish perspective, I think I might prefer that a misaligned AI kills everyone, and take my chances with continuations of myself (my copies/simulations) elsewhere in the multiverse, rather than face whatever the sum-of-desires of the misaligned AI decides to do with humanity. (With the usual caveat that I’m very philosophically confused about how to think about all of this.)
And his response was basically to say that he already acknowledged my concern in his OP:
I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.
Personally, I have a bigger problem with people (like Paul and Carl) who talk about AIs keeping people alive, and not talk about s-risks in the same breath or only mention it in a vague, easy to miss way, than I have with Eliezer not addressing Paul’s arguments.
- Oct 30, 2024, 4:40 AM; 3 points) 's comment on MIRI 2024 Communications Strategy by (
I’m thinking that the most ethical (morally least risky) way to “insure” against a scenario in which AI takes off and property/wealth still matters is to buy long-dated far out of the money S&P 500 calls. (The longest dated and farthest out of the money seems to be Dec 2029 10000-strike SPX calls. Spending $78 today on one of these gives a return of $10000 if SPX goes to 20000 by Dec 2029, for example.)
My reasoning here is that I don’t want to provide capital to AI industries or suppliers because that seems wrong given what I judge to be high x-risk their activities are causing (otherwise I’d directly invest in them), but I also want to have resources in a post-AGI future in case that turns out to be important for realizing my/moral values. Suggestions welcome for better/alternative ways to do this.
What is going on with Constitution AI? Does anyone know why no LLM aside from Claude (at least none that I can find) has used it? One would think that if it works about as well as RLHF (which it seems to), AI companies would be flocking to it to save on the cost of human labor?
Also, apparently ChatGPT doesn’t know that Constitutional AI is RLAIF (until I reminded it) and Gemini thinks RLAIF and RLHF are the same thing. (Apparently not a fluke as both models made the same error 2 out of 3 times.)
Once they get into CEV, they may not want to defer to others anymore, or may set things up with a large power/status imbalance between themselves and everyone else which may be detrimental to moral/philosophical progress. There are plenty of seemingly idealistic people in history refusing to give up or share power once they got power. The prudent thing to do seems to never get that much power in the first place, or to share it as soon as possible.
If you’re pretty sure you will defer to others once inside CEV, then you might as well do it outside CEV due to #1 in my grandparent comment.
The main asymmetries I see are:
Other people not trusting the group to not be corrupted by power and to reflect correctly on their values, or not trusting that they’ll decide to share power even after reflecting correctly. Thus “programmers” who decide to not share power from the start invite a lot of conflict. (In other words, CEV is partly just trying to not take power away from people, whereas I think you’ve been talking about giving AIs more power than they already have. “the sort of influence we imagine intentionally giving to AIs-with-different-values that we end up sharing the world with”)
The “programmers” not trusting themselves. I note that individuals or small groups trying to solve morality by themselves don’t have very good track records. They seem to too easily become wildly overconfident and/or get stuck in intellectual dead-ends. Arguably the only group that we have evidence for being able to make sustained philosophical progress is humanity as a whole.
To the extent that these considerations don’t justify giving every human equal power/weight in CEV, I may just disagree with Eliezer about that. (See also Hacking the CEV for Fun and Profit.)
About a week ago FAR.AI posted a bunch of talks at the 2024 Vienna Alignment Workshop to its YouTube channel, including Supervising AI on hard tasks by Jan Leike.
What do you think about my positions on these topics as laid out in and Six Plausible Meta-Ethical Alternatives and Ontological Crisis in Humans?
My overall position can be summarized as being uncertain about a lot of things, and wanting (some legitimate/trustworthy group, i.e., not myself as I don’t trust myself with that much power) to “grab hold of the whole future” in order to preserve option value, in case grabbing hold of the whole future turns out to be important. (Or some other way of preserving option value, such as preserving the status quo / doing AI pause.) I have trouble seeing how anyone can justifiably conclude “so don’t worry about grabbing hold of the whole future” as that requires confidently ruling out various philosophical positions as false, which I don’t know how to do. Have you reflected a bunch and really think you’re justified in concluding this?
E.g. in Ontological Crisis in Humans I wrote “Maybe we can solve many ethical problems simultaneously by discovering some generic algorithm that can be used by an agent to transition from any ontology to another?” which would contradict your “not expecting your preferences to extend into the distant future with many ontology changes” and I don’t know how to rule this out. You wrote in the OP “Current solutions, such as those discussed in MIRI’s Ontological Crises paper, are unsatisfying. Having looked at this problem for a while, I’m not convinced there is a satisfactory solution within the constraints presented.” but to me this seems like very weak evidence for the problem being actually unsolvable.
As long as all mature superintelligences in our universe don’t necessarily have (end up with) the same values, and only some such values can be identified with our values or what our values should be, AI alignment seems as important as ever. You mention “complications” from obliqueness, but haven’t people like Eliezer recognized similar complications pretty early, with ideas such as CEV?
It seems to me that from a practical perspective, as far as what we should do, your view is much closer to Eliezer’s view than to Land’s view (which implies that alignment doesn’t matter and we should just push to increase capabilities/intelligence). Do you agree/disagree with this?
It occurs to me that maybe you mean something like “Our current (non-extrapolated) values are our real values, and maybe it’s impossible to build or become a superintelligence that shares our real values so we’ll have to choose between alignment and superintelligence.” Is this close to your position?
I have a slightly different take, which is that we can’t commit to doing this scheme even if we want to, because I don’t see what we can do today that would warrant the term “commitment”, i.e., would be binding on our post-singularity selves.
In either case (we can’t or don’t commit), the argument in the OP loses a lot of its force, because we don’t know whether post-singularity humans will decide to do this kind scheme or not.