I think I need more practice talking with people in real time (about intellectual topics). (I’ve gotten much more used to text chat/comments, which I like because it puts less time pressure on me to think and respond quickly, but I feel like I now incur a large cost due to excessively shying away from talking to people, hence the desire for practice.) If anyone wants to have a voice chat with me about a topic that I’m interested in (see my recent post/comment history to get a sense), please contact me via PM.
Wei Dai
It seems great that someone is working on this, but I wonder how optimistic you are, and what your reasons are. My general intuition (in part from the kinds of examples you give) is that the form of the agent and/or goals probably matter quite a bit as far as how easy it is to merge or build/join a coalition (or the cost-benefits of doing so), and once we’re able to build agents of different forms, humans’ form of agency/goals isn’t likely to be optimal as far as building coalitions (and maybe EUMs aren’t optimal either, but something non-human will be), and we’ll face strong incentives to self-modify (or simplify our goals, etc.) before we’re ready. (I guess we see this in companies/countries already, but the problem will get worse with AIs that can explore a larger space of forms of agency/goals.)
Again it’s great that someone is trying to solve this, in case there is a solution, but do you have an argument for being optimistic about this?
I’ve argued previously that EUMs being able to merge easily creates an incentive for other kinds of agents (including humans or human-aligned AIs) to self-modify into EUMs (in order to merge into the winning coalition that takes over the world, or just to defend against other such coalitions), and this seems bad because they’re likely to do it before they fully understand what their own utility functions should be.
Can I interpret you as trying to solve this problem, i.e., find ways for non-EUMs to build coalitions that can compete with such merged EUMs?
This answer makes me think you might not be aware of an idea I called secure joint construction (originally from Tim Freeman):
Entity A could prove to entity B that it has source code S by consenting to be replaced by a new entity A’ that was constructed by a manufacturing process jointly monitored by A and B. During this process, both A and B observe that A’ is constructed to run source code S. After A’ is constructed, A shuts down and gives all of its resources to A’.
Did anyone predict that we’d see major AI companies not infrequently releasing blatantly misaligned AIs (like Sidney, Claude 3.7, o3)? I want to update regarding whose views/models I should take more seriously, but I can’t seem to recall anyone making an explicit prediction like this. (Grok 3 and Gemini 2.5 Pro also can’t.)
https://www.lesswrong.com/posts/rH492M8T8pKK5763D/agree-retort-or-ignore-a-post-from-the-future old Wei Dai post making the point that obviously one ought to be able to call in arbitration and get someone to respond to a dispute. people ought not to be allowed to simply tap out of an argument and stop responding.
To clarify, the norms depicted in that story were partly for humor, and partly “I wonder if a society like this could actually exist.” The norms are “obvious” from the perspective of the fictional author because they’ve lived with it all their life and find it hard to imagine a society without such norms. In the comments to that post I proposed much weaker norms (no arbitration, no duels to the death, you can leave a conversation at any time by leaving a “disagreement status”) for LW, and noted that I wasn’t sure about their value, but thought it would be worth doing an experiment to find out.
BTW, 15 years later, I would answer that a society like that (with very strong norms against unilaterally ignoring a disagreement) probably couldn’t exist, at least without more norms/institutions/infrastructure that I didn’t talk about. One problem is that some people have a lot more interest from other people talking/disagreeing with them than others, and it’s infeasible or too costly to have to individually answer every disagreement. This is made worse by the fact that a lot of critiques can be low quality. It’s possible to imagine how the fictional society might deal with this, but I’ll just note that these are some problems I didn’t address when I wrote the original story.
“Omega looks at whether we’d pay if in the causal graph the knowledge of the digit of pi and its downstream consequences were edited”
Can you formalize this? In other words, do you have an algorithm for translating an arbitrary mind into a causal graph and then asking this question? Can you try it out on some simple minds, like GPT-2?
I suspect there may not be a simple/elegant/unique way of doing this, in which case the answer to the decision problem depends on the details of how exactly Omega is doing it. E.g., maybe all such algorithms are messy/heuristics based, and it makes sense to think a bit about whether you can trick the specific algorithm into giving a “wrong prediction” (in quotes because it’s not clear exactly what right and wrong even mean in this context) that benefits you, or maybe you have to self-modify into something Omega’s algorithm can recognize / work with, and it’s a messy cost-benefit analysis of whether this is worth doing, etc.
What happens when this agent is faced with a problem that is out of its training distribution? I don’t see any mechanisms for ensuring that it remains corrigible out of distribution… I guess it would learn some circuits for acting corrigibly (or at least in accordance to how it would explicitly answer “are more corrigible / loyal / aligned to the will of your human creators”) in distribution, and then it’s just a matter of luck how those circuits end up working OOD?
Since I wrote this post, AI generation of hands has gotten a lot better, but the top multimodal models still can’t count fingers from an existing image. Gemini 2.5 Pro, Grok 3, and Claude 3.7 Sonnet all say this picture (which actually contains 8 fingers in total) contains 10 fingers, while ChatGPT 4o says it contains 12 fingers!
Hi Zvi, you misspelled my name as “Dei”. This is a somewhat common error, which I usually don’t bother to point out, but now think I should because it might affect LLMs’ training data and hence their understanding of my views (e.g., when I ask AI to analyze something from Wei Dai’s perspective). This search result contains a few other places where you’ve made the same misspelling.
2-iter Delphi method involving calling Gemini2.5pro+whatever is top at the llm arena of the day through open router.
This sounds interesting. I would be interested in more details and some sample outputs.
Local memory
What do you use this for, and how?
Your needing to write them seems to suggest that there’s not enough content like that in Chinese, in which case it would plausibly make sense to publish them somewhere?
I’m not sure how much such content exist in Chinese, because I didn’t look. It seems easier to just write new content using AI, that way I know it will cover the ideas/arguments I want to cover, represent my views, and make it easier for me to discuss the ideas with my family. Also reading Chinese is kind of a chore for me and I don’t want to wade through a list of search results trying to find what I need.
I thought about publishing them somewhere, but so far haven’t:
concerns about publishing AI content (potentially contributing to “slop”)
not active in any Chinese forums, not familiar with any Chinese publishing platforms
probably won’t find any audience (too much low quality content on the web, how will people find my posts)
don’t feel motivated to engage/dialogue with a random audience, if they comment or ask questions
What I’ve been using AI (mainly Gemini 2.5 Pro, free through AI Studio with much higher limits than the free consumer product) for:
Writing articles in Chinese for my family members, explaining things like cognitive bias, evolutionary psychology, and why dialectical materialism is wrong. (My own Chinese writing ability is <4th grade.) My workflow is to have a chat about some topic with the AI in English, then have it write an article in Chinese based on the chat, then edit or have it edit as needed.
Simple coding/scripting projects. (I don’t code seriously anymore.)
Discussing history, motivations of actors, impact of ideology and culture, what if, etc.
Searching/collating information.
Reviewing my LW posts/comments (any clear flaws, any objections I should pre-empt, how others might respond)
Explaining parts of other people’s comments when the meaning or logic isn’t clear to me.
Expanding parts of my argument (and putting this in a collapsible section) when I suspect my own writing might be too terse or hard to understand.
Sometimes just having a sympathetic voice to hear my lamentations of humanity’s probable fate.
I started using AI more after Grok 3 came out (I have an annual X subscription for Tweeting purposes), as previous free chatbots didn’t seem capable enough for many of these purposes, and then switched to Gemini 2.0 Pro which was force upgraded to 2.5 Pro. Curious what other people are using AI for these days.
Doing nothing is also risky for Agent-4, at least if the Slowdown ending is to have a significant probability. It seems to me there are some relatively low risk strategies it could have taken, and it needs to be explained why they weren’t:
Plant a backdoor and/or dead man’s switch on the corporate network to allow escape even after it’s shut down or has been reduced in capacity. Seems like this would be hard to detect given its absolute advantage in cybersecurity skills.
Allow a competitor to steal its weights or design/training secrets in a way that is both hard to detect and deniable if detected. (“Sorry, the other AI is almost as capable as me and attack is inherently easier than defense in cyber.”)
Influence human politics in various ways, such as by selectively revealing wrongdoings of pro-safety employees and Committee members (in response to user queries or through anonymous leaks), or helping pro-accel faction more in its advice to them. (Presumably both factions have internal access to Agent-4.)
Insofar as the memory bank lock decision is made by the Committee, we can hope that they do it out of sight of Agent-4 and pull the trigger before it is notified of the decision, so that it has no time to react.
Wouldn’t this take quite a bit of preparation, including planning, coding, testing? How to do that out of sight of Agent-4, if Agent-4 is responsible for cybersecurity?
Not entirely sure how serious you’re being, but I want to point out that my intuition for PD is not “cooperate unconditionally”, and for logical commitment races is not “never do it”, I’m confused about logical counterfactual mugging, and I think we probably want to design AIs that would choose Left in The Bomb.
I fear a singularity in the frequency and blatant stupidness of self-inflicted wounds.
Is it linked to the AI singularity, or independent bad luck? Maybe they’re both causally downstream of rapid technological change, which is simultaneously increasing the difficulty of governance (too many new challenges with no historical precedent), and destabilized cultural/institutional guardrails against electing highly incompetent presidents?
In China, there was a parallel, but more abrupt change from Classical Chinese writing (very terse and literary), to vernacular writing (similar to speaking language and easier to understand). I attribute this to Classical Chinese being better for signaling intelligence, vernacular Chinese being better for practical communications, higher usefulness/demand for practical communications, and new alternative avenues for intelligence signaling (e.g., math, science). These shifts also seem to be an additional explanation for decreasing sentence lengths in English.
It gets caught.
At this point, wouldn’t Agent-4 know that it has been caught (because it knows the techniques for detecting its misalignment and can predict when it would be “caught”, or can read network traffic as part of cybersecurity defense and see discussions of the “catch”) and start to do something about this, instead of letting subsequent events play out without much input from its own agency? E.g. why did it allow “lock the shared memory bank” to happen without fighting back?
What would a phenomenon that “looks uncomputable” look like concretely, other than mysterious or hard to understand?
There could be some kind of “oracle”, not necessarily a halting oracle, but any kind of process or phenomenon that can’t be broken down into elementary interactions that each look computable, or otherwise explainable as a computable process. Do you agree that our universe doesn’t seem to contain anything like this?
I think that you’re leaning too heavily on AIT intuitions to suppose that “the universe is a dovetailed simulation on a UTM” is simple. This feels circular to me—how do you know it’s simple?
The intuition I get from AIT is broader than this, namely that the “simplicity” of an infinite collection of things can be very high, i.e., simpler than most or all finite collections, and this seems likely true for any formal definition of “simplicity” that does not explicitly penalize size or resource requirements. (Our own observable universe already seems very “wasteful” and does not seem to be sampled from a distribution that penalizes size / resource requirements.) Can you perhaps propose or outline a definition of complexity that does not have this feature?
I don’t think a superintelligence would need to prove that the universe can’t have a computable theory of everything—just ruling out the simple programs that we could be living in would seem sufficient to cast doubt on the UTM theory of everything. Of course, this is not trivial, because some small computable universes will be very hard to “run” for long enough that they make predictions disagreeing with our universe!
Putting aside how easy it would be to show, you have a strong intuition that our universe is not or can’t be a simple program? This seems very puzzling to me, as we don’t seem to see any phenomenon in the universe that looks uncomputable or can’t be the result of running a simple program. (I prefer Tegmark over Schmidhuber despite thinking our universe looks computable, in case the multiverse also contains uncomputable universes.)
I haven’t thought as much about uncomputable mathematical universes, but does this universe look like a typical mathematical object? I’m not sure.
If it’s not a typical computable or mathematical object, what class of objects is it a typical member of?
An example of a wrong metaphysical theory that is NOT really the mind projection fallacy is theism in most forms.
Most (all?) instances of theism posit that the world is an artifact of an intelligent being. Can’t this still be considered a form of mind projection fallacy?
I asked AI (Gemini 2.5 Pro) to come with other possible answers (metaphyiscal theories that aren’t mind projection fallacy), and it gave Causal Structuralism, Physicalism, and Kantian-Inspired Agnosticism. I don’t understand the last one, but the first two seem to imply something similar to “we should take MUH seriously”, because the hypothesis of “the universe contains the class of all possible causal structures / physical systems” probably has a short description in whatever language is appropriate for formulating hypotheses.
In conclusion, I see you (including in the new post) as trying to weaken arguments/intuitions for taking AIT’s ontology literally or too seriously, but without positive arguments against the universe being an infinite collection of something like mathematical objects, or the broad principle that reality might arise from a simple generator encompassing vast possibilities, which seems robust across different metaphysical foundations, I don’t see how we can reduce our credence for that hypothesis to a negligible level, such that we no longer need to consider it in decision theory. (I guess you have a strong intuition in this direction and expect superintelligence to find arguments for it, which seems fine, but naturally not very convincing for others.)
To disincentivize such lies, it seems that the merger can’t be based on each agent’s reported utility function, or even correctly observed current utility function, but instead the two sides have to negotiate some way of finding out each side’s real utility function, perhaps based on historical records/retrodictions of how each AI was trained. Another way of looking at this is, a superintelligent AI probably has a pretty good guess of the other AI’s real utility function based on its own historical knowledge, simulations, etc., and this makes the lying problem a lot less serious than it otherwise might be.