Raymond Douglas

Karma: 1,428

Healthy AI relationships as a microcosm

Raymond DouglasJul 23, 2025, 3:59 PM

9 points

0 comments2 min readLW link

Raymond Douglas Jul 8, 2025, 9:14 PM
2 points
0
in reply to: Cleo Nardo’s comment on: ‘AI for societal uplift’ as a path to victory
Ah! Ok, yeah, I think we were talking past each other here.
I’m not trying to claim here that the institutional case might be harder than the AI case. When I said “less than perfect at making institutions corrigible” I didn’t mean “less compared to AI” I meant “overall not perfect”. So the square brackets you put in (2) was not something I intended to express.
The thing I was trying to gesture at was just that there are kind of institutional analogs for lots of alignment concepts, like corrigibility. I wasn’t aiming to actually compare their difficulty—I think like you I’m not really sure, and it does feel pretty hard to pick a fair standard for comparison.

Raymond Douglas Jul 8, 2025, 10:04 AM
2 points
0
in reply to: Cleo Nardo’s comment on: ‘AI for societal uplift’ as a path to victory
I’m not sure I understand what you mean by relevant comparison here. What I was trying to claim in the quote is that humanity already faces something analogous to the technical alignment problem in building institutions, which we haven’t fully solved.
If you’re saying we can sidestep the institutional challenge by solving technical alignment, I think this is partly true—you can pass the buck of aligning the fed onto aligning Claude-N, and in turn onto whatever Claude-N is aligned to, which will either be an institution (same problem!) or some kind of aggregation of human preferences and maybe the good (different hard problem!).

Raymond Douglas Jul 8, 2025, 9:59 AM
5 points
3
in reply to: Kaarel’s comment on: ‘AI for societal uplift’ as a path to victory
Sure, I’m definitely eliding a bunch of stuff here. Actually one of the things I’m pretty confused about is how to carve up the space, and what the natural category for all this is: epistemics feels like a big stretch. But there clearly is some defined thing that’s narrower than ‘get better at literally everything’.

Raymond Douglas Jul 8, 2025, 9:57 AM
5 points
3
in reply to: Kaarel’s comment on: ‘AI for societal uplift’ as a path to victory
Yeah agreed, I think the feasible goal is passing some tipping point where you can keep solving more problems as they come up, and that what comes next is likely to be a continual endeavour.

Raymond Douglas Jul 5, 2025, 2:23 PM
5 points
2
in reply to: Fabien Roger’s comment on: ‘AI for societal uplift’ as a path to victory
Yeah, I fully expect that current level LMs will by default make the situation both better and worse. I also think that we’re still a very long way from fully utilising the things that the internet has unlocked.
My holistic take is that this approach would be very hard, but not obviously harder than aligning powerful AIs and likely complementary. I also think it’s likely we might need to do some of this ~societal uplift anyway so that we do a decent job if and when we do have transformative AI systems.
Some possible advantages over the internet case are:
- People might be more motivated towards by the presence of very salient and pressing coordination problems
  - For example, I think the average head of a social media company is maybe fine with making something that’s overall bad for the world, but the average head of a frontier lab is somewhat worried about causing extinction
- Currently the power over AI is really concentrated and therefore possibly easier to steer
- A lot of what matters is specifically making powerful decision makers more informed and able to coordinate, which is slightly easier to get a handle on
As for the specific case of aligned super-coordinator AIs, I’m pretty into that, and I guess I have a hunch that there might be a bunch of available work to do in advance to lay the ground for that kind of application, like road-testing weaker versions to smooth the way for adoption and exploring form factors that get the most juice out of the things LMs are comparatively good at. I would guess that there are components of coordination where LMs are already superhuman, or could be with the right elicitation.

Raymond Douglas Jul 5, 2025, 2:04 PM
4 points
4
in reply to: sloonz’s comment on: ‘AI for societal uplift’ as a path to victory
I think this is possible but unlikely, just because the number of things you need to really take off the table isn’t massive, unless we’re in an extremely vulnerable world. It seems very likely we’ll need to do some power concentration, but also that tech will probably be able to expand the frontier in ways that means this doesn’t trade so heavily against individual liberty.

Raymond Douglas Jul 5, 2025, 2:01 PM
11 points
9
in reply to: Raemon’s comment on: ‘AI for societal uplift’ as a path to victory
Yeah strongly agree with the flag. In my mind one of the big things missing here is a true name for the direction, which will indeed likely involve a lot of non-LM stuff, even if LMs are yielding a lot of the unexpected affordances.
One of the places I most differ from the ‘tech for thinking’ picture is that I think the best version of this might need to involve giving people some kinds of direct influence and power, rather than mere(!) reasoning and coordination aids. But I’m pretty confused about how true/central that is, or how to fold it in.

Raymond Douglas Jul 4, 2025, 7:47 PM
5 points
7
in reply to: ryan_greenblatt’s comment on: ‘AI for societal uplift’ as a path to victory
Definitely. But I currently suspect that for this approach:
1. We currently have a big overhang: we could be getting a lot even out of the models we already have
2. There’s some tipping point beyond which society is uplifted enough to correctly prioritise getting more uplifted
3. Getting to that tipping point wouldn’t require massively more advanced AI capabilities in a lot of the high-diffusion areas (i.e. Claude 4 might well be good enough for anything that requires literally everyone to have access to their own model)
4. The areas that might require more advanced capabilities require comparatively little diffusion (e.g. international coordination, lab oversight)
So definitely this fails if takeoff is really fast, but I think it might work given current takeoff trends if we were fast enough at everything else.

‘AI for societal uplift’ as a path to victory

Raymond DouglasJul 4, 2025, 3:32 PM

85 points

22 comments2 min readLW link

Upcoming workshop on Post-AGI Civilizational Equilibria

David Duvenaud, Jan_Kulveit, Raymond Douglas, Nora_Ammann and David Scott Krueger (formerly: capybaralet)

Jun 21, 2025, 3:57 PM

25 points

0 comments1 min readLW link

Raymond Douglas Jun 10, 2025, 10:23 AM
4 points
0
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
Interesting! Two questions:
- What about the 5-and-10 problem makes it particularly relevant/interesting here? What would a ‘solution’ entail?
- How far are you planning to build empirical cases, model them, and generalise from below, versus trying to extend pure mathematical frameworks like geometric rationality? Or are there other major angles of attack you’re considering?

Raymond Douglas Jun 3, 2025, 2:22 PM
5 points
1
in reply to: Daniel Tan’s comment on: Gradual Disempowerment: Concrete Research Projects
To me the reason the agent/model distinction matters is that there are ways in which an LLM is not an agent, so inferences (behavioural or mechanistic) that would make sense for an agent can be incorrect. For example, a LM’s outputs (“I’ve picked a secret answer”) might give the impression that it has internally represented something when it hasn’t, and so intent-based concepts like deception might not apply in the way we expect them to.
I think the dynamics of model personas seem really interesting! To me the main puzzle is methodological: how do you even get traction on it empirically? I’m not sure how you’d know if you were identifying real structure inside the model, so I don’t see any obvious ways in. But I think progress here could be really valuable! I guess the closest concrete thing I’ve been thinking about is studying the dynamics of repeatedly retraining models on interactions with users who have persistent assumptions about the models, and seeing how much that shapes the distribution of personality traits. Do you have ideas in mind?

Raymond Douglas May 30, 2025, 10:42 PM
2 points
0
in reply to: ryan_greenblatt’s comment on: Gradual Disempowerment: Concrete Research Projects
Sure, briefly replying:
- On the first point: you’re right that this does in some ways make the problem worse; my current best guess is that it’s basically necessary for a solution. I’m planning to write this up in more detail some time soon and I hope to get your thoughts when I do!
- On the second: Yeah, I find this kind of thing pretty hard to be confident about. I could totally see you being right here, and I’d love for someone to think it through in detail.
And I think the differences in 3 and 4 indeed probably come down to deeper assumptions that would be hard to unpick in this thread: I’d tentatively guess I’m putting more weight on the societal impacts of AI, and on the eventual shape of AGI/ASI being easier to affect.
This comment thread probably isn’t the place, but if it ever seems like it would be important/feasible, I’d be happy to try to go deeper on where our models are differing.

Raymond Douglas May 30, 2025, 5:31 PM
8 points
0
in reply to: Daniel Tan’s comment on: Gradual Disempowerment: Concrete Research Projects
Certainly! My top three conceptual picks are Simulators, Role-Play with Large Language Models, and the Three Layer Model of LLM Psychology, which all cover pretty similar ground but make pretty different claims.
As for north stars and empirical studies, I should disclaim that I’m no expert here, but with that caveat here are some takes:
- LMs will say that they’ve made a hidden choice without that actually fixing the output (e.g. if you ask them to play 20 questions). What’s up with that? What’s going on mechanistically? How does it relate to deception and/or hallucination?
- There are lots of standard terms for model behaviour that imply agent-level intent (‘sycophancy’, ‘sandbagging’, ‘alignment faking’). But how much is happening on the level of the model as opposed to the agent? For example, a model trained on dialogues where people happen to mostly talk to their political tribe should also display ‘sycophantic’ outputs, but not because the agent is trying to flatter the user. Can we disentangle these effects?
- A related but slightly weirder thing I’m particularly interested in is feedback loops between user expectations and model training data / agent self-image: how are the assumptions we make about current LMs shaping the nature of future LMs? It would be great to show empirically that this is even happening at all (e.g. by iteratively retraining)
- One of my all-time favourite papers is Shaking the Foundations, which I think gives a very nice formal model of hallucination (or ‘autosuggestive delusion’). I think it’d be great to test how far it actually applies to LMs.
The general theme here is something like ‘what are the intuitive reasons people end up being compelled by these semi-formal conceptual frameworks, and how can we actually empirically check if they’re true?’

Raymond Douglas May 30, 2025, 5:01 PM
11 points
5
in reply to: ryan_greenblatt’s comment on: Gradual Disempowerment: Concrete Research Projects
Sure, I agree we probably end up in full automation eventually by default. I also think this is much more relevant in some tasks than others: “generically make human labor more uplifted” doesn’t feel like it quite captures the thing I care about here.
Some intuitions I have:
- That period where AIs are more capable than human, but human+AI is even more capable, seems like a particularly crucial window for doing useful things, so extending it is pretty valuable. In particular, both bringing forward augmented human capability, and also pushing back human redundance.
  - This is basically the main reason, and I don’t think I can guess why you’d disagree.
- In parallel, I think that a lot of work is defaulting towards ‘fully general agent AI’ because it is an easy and natural target, not because it is the best one, and that if people knew what other kinds of interfaces to build for, that would actually suck some energy out of investing in getting long-term planning/drop-in replacements for everything as soon as possible
  - This might be wrong for jevons paradox-y reasons though, and it depends on specifics I haven’t thought about
- I kinda think that if we were doing more complementarity research, we’d have a larger dataset of healthy AI<>human interactions, and that could maybe help with steering us more towards the kinds of eventual AIs that are naturally friendly. I am pretty unsure here, but I do wish someone had thought hard about it. I weakly guess that I put a lot more weight than you on feedback loops from how people use AI.
- The focus on independent/autonomous AIs is, I suspect, making people underinvest in figuring out what effect AI interactions have on humans, or on trying to make those effects good, and I can imagine this biting us hard down the line
  - Like, if there were a nice suite of evals to tell you how emotionally healthy/toxic a given model was, then there would be a sort of legible target to hill climb towards. My guess is companies kind of don’t care enough to prioritise doing this themselves, but they’d take easy steps towards it.
I should emphasise that I don’t think this is the all time top most important work, I just think it’s currently pretty neglected and I wouldn’t be surprised if there were some pretty interesting insights that came out of thinking hard about it for a while, or some pretty high leverage work available.

Gradual Disempowerment: Concrete Research Projects

Raymond DouglasMay 29, 2025, 6:55 PM

99 points

10 comments10 min readLW link

Raymond Douglas May 1, 2025, 3:23 PM
11 points
3
in reply to: ChristianKl’s comment on: Bandwidth Rules Everything Around Me: Oliver Habryka on OpenPhil and GoodVentures
I am extremely not Dustin, and I do not want to veer into psychologising, but I very tentatively interpret him as also conveying some mix of:
- legitimately feeling that there are some things it might be bad to fund, and feeling morally responsible for making sure the money doesn’t go to such bad things, and neither trusting OP to make those judgment, nor trusting that the good and bad will essentially balance out somehow
- finding it somewhat stressful and draining to be responsible (not just reputationally) for things you don’t have time to scrutinise, where those are in fact finite resources that need to be spent carefully
- hoping that if other people do fill in the funding gaps, they’ll also share the load on the other tacit resources (which, to be fair, is complicated by the general problems with donor funging that do seem to have been handled suboptimally)
I reiterate that all the comments are just there on the other post for anyone to scrutinise, rather than taking my word for it. I make no claim as to whether these are cruxes. But in my estimation these are some of the implications.
I would also offer this quote, because I think the meta-dynamic here is an important piece of the puzzle:
I’m not detailing specific decisions for the same reason I want to invest in fewer focus areas: additional information is used as additional attack surface area. The attitude in EA communities is “give an inch, fight a mile”. So I’ll choose to be less legible instead.

Raymond Douglas May 1, 2025, 11:19 AM
9 points
7
in reply to: habryka’s comment on: Bandwidth Rules Everything Around Me: Oliver Habryka on OpenPhil and GoodVentures
Yeah, I meant that he was pushing back on the framing as an oversimplification, not that he was pushing back on the claim that reputation was part of the calculation—this I feel he did straightforwardly and consistently do, with actual substantive reasons, e.g.
“reputational risks” [..] narrows the mind too much on what is going on here
I can’t know all our grantees, and my estimation is I can’t divorce myself from responsibility for them, reputationally or otherwise. [emphasis original]
“PR risk” is an unnecessarily narrow mental frame for why we’re focusing [...] there are other bandwidth issues: energy, attention, stress, political influence. Those are more finite than capital.
Framing the costs as “PR” limits the way people think about mitigating costs. It’s not just “lower risk” but more shared responsibility and energy to engage with decision making, persuading, defending, etc.
Again, really leaning into trying to give the opposite side here, I think that rounding things off to “Dustin Moskovitz became more concerned about his reputation” is actually losing a lot of important nuance mostly in a way that makes Dustin look bad, and in a way that he correctly identified and objected to. Which is not to say there hasn’t been a cursed miasma causing who knows how much harm, but I think the differences in implication here are subtle and important.

Raymond Douglas Apr 30, 2025, 9:57 AM
17 points
−13
on: Bandwidth Rules Everything Around Me: Oliver Habryka on OpenPhil and GoodVentures
Interesting stuff! For the sake of multi-sidedness, I’d note that this description of the shift being because of Dustin caring about his reputation is something Dustin himself repeatedly pushed back on in the original GV update comment thread, for being an oversimplification. I might also recommend Dustin’s big Medium essay on philanthropy to anyone curious about how he conceives of what he does.

Raymond Douglas

Healthy AI re­la­tion­ships as a microcosm

‘AI for so­cietal up­lift’ as a path to victory

Up­com­ing work­shop on Post-AGI Civ­i­liza­tional Equilibria

Grad­ual Disem­pow­er­ment: Con­crete Re­search Projects

Healthy AI relationships as a microcosm

‘AI for societal uplift’ as a path to victory

Upcoming workshop on Post-AGI Civilizational Equilibria

Gradual Disempowerment: Concrete Research Projects