Wei Dai

Karma: 40,325

I think I need more practice talking with people in real time (about intellectual topics). (I’ve gotten much more used to text chat/comments, which I like because it puts less time pressure on me to think and respond quickly, but I feel like I now incur a large cost due to excessively shying away from talking to people, hence the desire for practice.) If anyone wants to have a voice chat with me about a topic that I’m interested in (see my recent post/comment history to get a sense), please contact me via PM.

www.weidai.com

Wei Dai May 19, 2025, 5:57 PM
4 points
0
on: A widely shared AI productivity paper was retracted, is possibly fraudulent

urged that it be retracted

This seems substantially different from “was retracted” in the title. Also, Arxiv apparently hasn’t yet followed MIT’s request to remove the paper, presumably following it’s own policy and waiting for the author to issue his own request.

Wei Dai May 15, 2025, 7:27 AM
LW: 7 AF: 5
1
AF
on: Dodging systematic human errors in scalable oversight
How do you decide what to set ε to? You mention “we want assumptions about humans that are sensible a priori, verifiable via experiment” but I don’t see how ε can be verified via experiment, given that for many questions we’d want the human oracle to answer, there isn’t a source of ground truth answers that we can compare the human answers to?

With unbounded Alice and Bob, this results in an equilibrium where Alice can win if and only if there is an argument that is robust to an ε-fraction of errors.

How should I think about, or build up some intuitions about, what types of questions have an argument that is robust to an ε-fraction of errors?

Here’s an analogy that leads to a pessimistic conclusion (but I’m not sure how relevant it is): replace the human oracle with a halting oracle, the top level question being debated is whether some Turing machine T halts or not, and the distribution over which ε is define is the uniform distribution. Then it seems like Alice has a very tough time (for any T that she can’t prove halts or not herself), because Bob can reject/rewrite all the oracle answers that are relevant to T in some way, which is a tiny fraction of all possible Turing machines. (This assumes that Bob gets to pick the classifier after seeing the top level question. Is this right?)

Wei Dai May 14, 2025, 8:42 PM
LW: 4 AF: 3
0
AF
in reply to: cubefox’s comment on: Wei Dai’s Shortform

I think the most dangerous version of 3 is a sort of Chesterton’s fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn’t require holding on to the mistaken belief.

I think that makes sense, but sometimes you can’t necessarily motivate a useful norm “by recognizing that the norm is useful” to the same degree that you can with a false belief. For example there may be situations where someone has an opportunity to violate a social norm in an unobservable way, and they could be more motivated by the idea of potential punishment from God if they were to violate it, vs just following the norm for the greater (social) good.

Do you have an example for 4? It seems rather abstract and contrived.

Hard not to sound abstract and contrived here, but to say a bit more, maybe there is no such thing as philosophical progress (outside of some narrow domains), so by doing philosophical reflection you’re essentially just taking a random walk through idea space. Or philosophy is a memetic parasite that exploits bug(s) in human minds to spread itself, perhaps similar to (some) religions.

Overall, I think the risks from philosophical progress aren’t overly serious while the opportunities are quite large, so the overall EV looks comfortably positive.

I think the EV is positive if done carefully, which I think I had previously been assuming, but I’m a bit worried now that most people I can attract to the field might not be as careful as I had assumed, so I’ve become less certain about this.

Wei Dai May 13, 2025, 1:15 AM
LW: 10 AF: 6
2
AF
on: Wei Dai’s Shortform
Some potential risks stemming from trying to increase philosophical competence of humans and AIs, or doing metaphilosophy research. (1 and 2 seem almost too obvious to write down, but I think I should probably write them down anyway.)
1. Philosophical competence is dual use, like much else in AI safety. It may for example allow a misaligned AI to make better decisions (by developing a better decision theory), and thereby take more power in this universe or cause greater harm in the multiverse.
2. Some researchers/proponents may be overconfident, and cause flawed metaphilosophical solutions to be deployed or spread, which in turn derail our civilization’s overall philosophical progress.
3. Increased philosophical competence may cause many humans and AIs to realize that various socially useful beliefs have weak philosophical justifications (such as all humans are created equal or have equal moral worth or have natural inalienable rights, moral codes based on theism, etc.). In many cases the only justifiable philosophical positions in the short to medium run may be states of high uncertainty and confusion, and it seems unpredictable what effects will come from many people adopting such positions.
4. Maybe the nature of philosophy is very different from my current guesses, such that greater philosophical competence or orientation is harmful even in aligned humans/AIs and even in the long run. For example maybe philosophical reflection, even if done right, causes a kind of value drift, and by the time you’ve clearly figured that out, it’s too late because you’ve become a different person with different values.

Wei Dai May 11, 2025, 9:23 PM
LW: 3 AF: 2
0
AF
in reply to: Marie_DB’s comment on: An alignment safety case sketch based on debate

Is it something like “during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )”?

Yes, but my concern also includes this happening during training of the debaters, when the simulated or actual humans can also go out of distribution, e.g., the actual human is asked a type of question that he has never considered before, and either answers in a confused way, or will have to use philosophical reasoning and a lot of time to try to answer, or maybe it looks like one of the debaters “jailbreaking” a human via some sort of out of distribution input.

The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won’t work?

This intuitively seems hard to me, but since Geoffrey mentioned that you have a doc coming out related to this, I’m happy to read it to see if it changes my mind. But this still doesn’t solve the whole problem, because as Geoffrey also wrote, “Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.”

Wei Dai May 10, 2025, 7:33 AM
LW: 4 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: “The Era of Experience” has an unsolved technical alignment problem

For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch.

I think this could be part of a viable approach, for example if we figure out in detail how humans invented philosophy and use that knowledge to design/train an AI that we can have high justified confidence will be philosophically competent. I’m worried that in actual development of brain-like AGI, we will skip this part (because it’s too hard, or nobody pushes for it), and end up just assuming that the AGI will invent or learn philosophy because it’s brain-like. (And then it ends up not doing that because we didn’t give it some “secret sauce” that humans have.) And this does look fairly hard to me, because we don’t yet understand the nature of philosophy or what constitutes correct philosophical reasoning or philosophical competence, so how do we study these things in either humans or AIs?

But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress].

I find it pretty plausible that wireheading is what we’ll end up wanting, after sufficient reflection. (This could be literal wireheading, or something more complex like VR with artificial friends, connections, etc.) This seems to me to be the default, unless we have reasons to want to avoid it. Currently my reasons are 1. value uncertainty (maybe we’ll eventually find good intrinsic reasons to not want to wirehead) 2. opportunity costs (if I wirehead now, it’ll cost me in terms of both quality and quantity vs if I wait for more resources and security). But it seems foreseeable that both of these reasons will go away at some point, and we may not have found other good reasons to avoid wireheading by then.

(Of course if the right values for us are to max out our wireheading, then we shouldn’t hand the universe off the AGIs that will want to max out their wireheading. Also, this is just the simplest example of how brain-like AGIs’ values could conflict with ours.)

I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective.

You’re prioritizing instilling AGI with the right social instincts, and arguing that’s very important for getting the AGI to eventually converge on values that we’d consider good. But under different meta-ethical views, you should perhaps be prioritizing something else. For example, under moral realism, it doesn’t matter what social instincts the AGI starts out with, what’s more important is that it has the capacity to eventually find and be motivated by objective moral truths.

From my perspective, in order to not bet the universe on a particular meta-ethical view, it seems that we need to either hold off on building AGI until we definitively solve metaethics (e.g., it’s no longer a contentious subject in academic philosophy), or have an approach/plan that will work out well regardless of which meta-ethical view turns out to be correct.

By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”.

Thanks, I appreciate this, but of course still feel compelled to speak up when I see areas where you seem overly optimistic and/or missing some potential risks/failure modes.

Wei Dai May 9, 2025, 5:54 PM
LW: 11 AF: 7
1
AF
in reply to: Geoffrey Irving’s comment on: An alignment safety case sketch based on debate

Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.

This is my concern, and I’m glad it’s at least on your radar. How do you / your team think about competitiveness in general? (I did a simple search and the word doesn’t appear in this post or the previous one.) How much competitiveness are you aiming for? Will there be a “competitiveness case” later in this sequence, or later in the project? Etc.?

But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.

Because of the “slowness of philosophy” issue I talked about in my post, we have no way of quickly reaching high confidence that any such formalization is correct, and we have a number of negative examples where a proposed formal solution to some philosophical problem that initially looked good turned out to be flawed upon deeper examination. (See decision theory and Solomonoff induction.) AFAIK we don’t really have any positive examples of such formalizations that have stood the test of time. So I feel like this is basically not a viable approach.

Wei Dai May 8, 2025, 11:24 PM
LW: 33 AF: 18
7
AF
on: An alignment safety case sketch based on debate
I’m curious if your team has any thoughts on my post Some Thoughts on Metaphilosophy, which was in large part inspired by the Debate paper, and also seems relevant to “Good human input” here.

Specifically, I’m worried about this kind of system driving the simulated humans out of distribution, either gradually or suddenly, accidentally or intentionally. And distribution shift could cause problems either with the simulation (presumably similar to or based on LLMs instead of low-level neuron-by-neuron simulation), or with the human(s) themselves. In my post, I talked about how philosophy seems to be a general way for humans to handle OOD inputs, but tends to be very slow and may be hard for ML to learn (or needs extra care to implement correctly). I wonder if you agree with this line of thought, or have some other ideas/plans to deal with this problem.

Aside from the narrow focus on “good human input” in this particular system, I’m worried about social/technological change being accelerated by AI faster than humans can handle it (due to similar OOD / slowness of philosophy concerns), and wonder if you have any thoughts on this more general issue.

Wei Dai May 8, 2025, 9:57 PM
5 points
0
in reply to: faul_sname’s comment on: faul_sname’s Shortform
Maybe tweak the prompt with something like, “if your guess is a pseudonym, also give your best guess(es) of the true identity of the author, using the same tips and strategies”?

Wei Dai May 8, 2025, 9:00 PM
8 points
0
in reply to: faul_sname’s comment on: faul_sname’s Shortform
Can you try this on Satoshi Nakamoto’s writings? (Don’t necessarily reveal their true identify, if it ends up working, and your attempt/prompt isn’t easily reproducible. My guess is that some people have tried already, and failed, either because AI isn’t smart enough yet, or they didn’t use the right prompts.)

Wei Dai May 8, 2025, 7:21 AM
LW: 13 AF: 7
9
AF
on: Alignment First, Intelligence Later

We humans also align with each other via organic alignment.

This kind of “organic alignment” can fail in catastrophic ways, e.g., produce someone like Stalin or Mao. (They’re typically explained by “power corrupts” but can also be seen as instances of “deceptive alignment”.)

Another potential failure mode is that “organically aligned” AIs start viewing humans as parasites instead of important/useful parts of its “greater whole”. This also has plenty of parallels in biological systems and human societies.

Both of these seem like very obvious risks/objections, but I can’t seem to find any material by Softmax that addresses or even mentions them. @emmett

Wei Dai May 4, 2025, 11:41 PM
7 points
4
in reply to: Cole Wyeth’s comment on: Why I am not a successionist
No, because power/influence dynamics could be very different in CEV compared to the current world and it seems reasonable to distrust CEV in principle or in practice, and/or CEV is sensitive to initial conditions implying a lot of leverage to influencing opinions before it starts.

Wei Dai May 4, 2025, 8:32 PM
4 points
4
in reply to: Cole Wyeth’s comment on: Why I am not a successionist
Nina is worried not just about humans getting killed and replaced, but also about humans not being allowed to have unenhanced children. It seems plausible that most humans, after reflection, would endorse some kind of “successionist” philosophy/ideology, and decide that intentionally creating an unenhanced human constitutes a form of child abuse (e.g., due to risk of psychological tendency to suffer, or having a much worse life on expectation than what’s possible). It seems reasonable for Nina to worry about this, if she thinks her own values (current or eventual or actual) are different.

Wei Dai May 2, 2025, 7:13 AM
10 points
3
in reply to: nostalgebraist’s comment on: Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall

The usual scaling laws are about IID samples from a fixed data distribution, so they don’t capture this kind of effect.

Doesn’t this seem like a key flaw in the usual scaling laws? Why haven’t I seen this discussed more? The OP did mention declining average data quality but didn’t emphasize it much. This 2023 post trying to forecast AI timeline based on scaling laws did not mention the issue at all, and I received no response when I made this point in its comments section.

Even if it were true that the the additional data literally “contained no new ideas/knowledge” relative to the earlier data, its inclusion would still boost the total occurrence count of the rarest “ideas” – the ones which are still so infrequent that the LLM’s acquisition of them is constrained by their rarity, and which the LLM becomes meaningfully stronger at modeling when more occurrences are supplied to it.

I guess this is related to the fact that LLMs are very data inefficient relative to humans, which implies that a LLM needs to be trained on each idea/knowledge multiple times in multiple forms before it “learns” it. It’s still hard for me to understand this on an intuitive level, but I guess if we did understand it, the problem of data inefficiency would be close to being solved, and we’d be much closer to AGI.

Wei Dai May 2, 2025, 3:05 AM
10 points
0
in reply to: Vladimir_Nesov’s comment on: Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall

The power of scaling is that with real unique data, however unoriginal, the logarithmic progress doesn’t falter, it still continues its logarithmic slog at an exponential expense rather than genuinely plateauing.

How to make sense of this? If the additional training data is mostly low quality (AI labs must have used the highest quality data first?) or repetitive (contains no new ideas/knowledge), perplexity might go down but what is the LLM really learning?

Wei Dai May 2, 2025, 12:27 AM
4 points
2
in reply to: James_Miller’s comment on: Our Reality: A Simulation Run by a Paperclip Maximizer
You realize that from my perspective, I can’t take this at face value due to “many apparent people could be non‑conscious entities”, right? (Sorry to potentially offend you, but it seems like too obvious an implication to pretend not to be aware of.) I personally am fairly content most of the time but do have memories of suffering. Assuming those memories are real, and your suffering is too, I’m still not sure that justifies calling the simulators “cruel”. The price may well be worth paying, if it potentially helps to avert some greater disaster in the base universe or other simulations, caused by insufficient philosophical understanding, moral blind spots, etc., and there is no better alternative.

Wei Dai May 2, 2025, 12:15 AM
LW: 6 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: “The Era of Experience” has an unsolved technical alignment problem
If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes.
1. A counterexample to this is if humans and AIs both tend to conclude after a lot of reflection that they should be axiologically selfish but decision theoretically cooperative (with other strong agents), then if we hand off power to AIs, they’ll cooperate with each other (and any other powerful agents in the universe or multiverse) to serve their own collective values, but we humans will be screwed.
2. Another problem is that we’re relatively confident that at least some humans can reason “successfully”, in the sense of making philosophical progress, but we don’t know the same about AI. There seemingly are reasons to think it might be especially hard for AI to learn, and easy for AI to learn something undesirable instead, like optimizing for how persuasive their philosophical arguments are to (certain) humans.
3. Finally, I find your arguments against moral realism somewhat convincing, but I’m still pretty uncertain, and think the arguments I gave in Six Plausible Meta-Ethical Alternatives for the realism side of the spectrum still somewhat convincing as well, don’t want to bet the universe on or against any of these positions.

Wei Dai May 1, 2025, 10:35 PM
2 points
0
in reply to: gwern’s comment on: AI #114: Liars, Sycophants and Cheaters
I should have given some examples of my own. Here’s Gemini on a story idea of mine, for the Star Wars universe (“I wish there was a story about a power-hungry villain who takes precautions against becoming irrational after gaining power. You’d think that at least some would learn from history. [...] The villain could anticipate that even his best efforts might fail, and create a mechanism to revive copies of himself from time to time, who would study his own past failures, rise to power again, and try to do better each time. [...] Sometimes the villain becomes the hero and the reader roots for him to succeed in stabilizing galactic society long term, but he fails despite his best efforts.”)

That’s a fantastic twist and adds immense depth and pathos to the entire concept! Having a cycle where the villain, armed with the knowledge of past tyrannical failures, genuinely attempts a more stable, perhaps even seemingly benevolent, form of authoritarianism – only to fail despite their best efforts – is incredibly compelling.

It’s assessment on this comment:

Yes, I think that reply is a very good and concise explanation of one key reason why wages might fall or stagnate despite rising productivity, viewed through a neoclassical lens. It effectively captures the concept of diminishing marginal value within the firm, even without external market price changes or new entrants.

It’s assessment on this comment:

This version is slightly stronger due to the increased specificity in the final sentence. It clearly articulates the key components of the argument and the commenter’s stance relative to the author’s potential views. It’s an excellent comment and ready to post.

Each of those comments did get upvoted to >10, so maybe Gemini is not too far off the mark, and I’m just not used to seeing “fantastic”, “very good”, “excellent” said to me explicitly?

Wei Dai May 1, 2025, 3:57 PM
4 points
2
on: AI #114: Liars, Sycophants and Cheaters
Is Gemini 2.5 Pro really not sycophantic? Because I tend to get more positive feedback from it than any online or offline conversation with humans. (Alternatively, humans around me are too reluctant to give explicit praise?)

Wei Dai May 1, 2025, 3:41 AM
3 points
0
in reply to: Edwin Evans’s comment on: Our Reality: A Simulation Run by a Paperclip Maximizer

Why do you think they haven’t talked to us?

They might be worried that their own philosophical approach is wrong but too attractive once discovered, or creates a blind spot that makes it impossible to spot the actually correct approach. The division of western philosophy into analytical and continental traditions, who are mutually unable to appreciate each other’s work, seems to be an instance of this. They might think that letting other philosophical traditions independently run to their logical conclusions, and then conversing/debating, is one way to try to make real progress.