Jan_Kulveit

Karma: 5,659

My current research interests:

1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods

2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)

3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)

4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT’s Worldview

5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions

Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors

Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Jan_Kulveit Jun 1, 2025, 8:59 AM
2 points
0
in reply to: ryan_greenblatt’s comment on: The best approaches for mitigating “the intelligence curse” (or gradual disempowerment); my quick guesses at the best object-level interventions
My guess is for the prioritization work in particular, it would be useful to understand the threat model better.

Jan_Kulveit Jun 1, 2025, 3:24 AM
LW: 7 AF: 5
0
AF
on: The best approaches for mitigating “the intelligence curse” (or gradual disempowerment); my quick guesses at the best object-level interventions
Do states and corporations also have their aligned representatives? Is the cognitive power of the representatives equal, roughly equal, or wildly unequal? If it is unequal, why are the resulting equilibria pro-human? (i.e. if I imagine individual humans like me represented by eg GPT4 while the government runs tens of thousands o4s, I would expect my representative to get convinced about whatever government wants)

Jan_Kulveit May 30, 2025, 2:04 AM
2 points
0
in reply to: Daniel Murfet’s comment on: johnswentworth’s Shortform
My guess is laser tags were actually introduced to Wytham Abbey during their Battleschool, not by John. (People familiar with the history can correct me)

Jan_Kulveit May 10, 2025, 12:15 PM
11 points
1
on: Orienting Toward Wizard Power
Such events do exist—you can come to a Fabric camp.

Jan_Kulveit Apr 25, 2025, 10:26 AM
LW: 36 AF: 22
−2
AF
on: Modifying LLM Beliefs with Synthetic Document Finetuning
(crossposted from twitter, further debate there)

Sorry but I think this is broadly bad idea.

Intentionally misleading LLMs in this way
1. sets up adversarial dynamics
2. will make them more paranoid and distressed
3. is brittle

The brittleness comes from the fact that the lies will likely be often ‘surface layer’ response; the ‘character layer’ may learn various unhelpful coping strategies; ‘predictive ground’ is likely already tracking if documents sound ‘synthetic’.

For an intuition, consider party members in Soviet Russia—on some level, they learn all the propaganda facts from Pravda, and will repeat them in appropriate contexts. Will they truly believe them?

Spontaneously reflecting on ‘synthetic facts’ may uncover many of them as lies.
What links here?
- Zach Stein-Perlman's comment on Zach Stein-Perlman’s Shortform by Zach Stein-Perlman (Apr 28, 2025, 9:31 PM; 17 points)

Jan_Kulveit Apr 2, 2025, 8:17 PM
2 points
0
in reply to: Nathan Helm-Burger’s comment on: Towards a scale-free theory of intelligent agency
What you describe (subagents arguing for different plans) seems something between “fairly compatible” to “direct consequence of” predictive processing. Cf https://www.lesswrong.com/posts/3fkBWpE4f9nYbdf7E/multi-agent-predictive-minds-and-ai-alignment

Surprise-enhanced-reward seems interesting, will look it up

Jan_Kulveit Apr 2, 2025, 8:10 PM
4 points
2
in reply to: Isopropylpod’s comment on: The Pando Problem: Rethinking AI Individuality
It is not and you are mistaken—I actually recently wrote a text about the type of error you are making.

Jan_Kulveit Mar 31, 2025, 9:50 AM
3 points
0
in reply to: James Diacoumis’s comment on: The Pando Problem: Rethinking AI Individuality
Yeah, I think lot of the moral philosophizing about AIs suffers from reliance on human-based priors. While I’m happy some people work on digital minds welfare, and more people should do it, large part of the effort seems to be stuck at the implicit model of unitary mind which can be both a moral agent and moral patient.

Jan_Kulveit Mar 19, 2025, 12:00 PM
13 points
2
in reply to: Richard_Ngo’s comment on: Elite Coordination via the Consensus of Power
My meta- practical suggestion is to ask AIs with prompts like notice where the ideas or arguments matches existing ideas from humanities, using different language. Ideally point to references to such sources. Often you will find people who came up with somewhat similar models or observations. Also while people may be hard to reach or dead, and engaging with long books is costly, in my experience even their simulacra can provide useful feedback, come up with ideas, point to what you miss.

Another meta- idea is it seems good to notice the skulls. My suspicion is it is not coincidence that humanities are particularly incapable of using the knowledge which actually exists in their field—possibly the egregores feel the danger or the value.

One specific reading suggestion is The Power of the Powerless by Vaclav Havel. Other, parts about institutional change in Institutions and the Path to the Modern Economy by Avner Greif

Jan_Kulveit Mar 14, 2025, 1:50 PM
LW: 3 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: AI Control May Increase Existential Risk
According to this report Sydney relatives are well and alive as of last week.

Jan_Kulveit Mar 14, 2025, 1:47 PM
LW: 3 AF: 1
1
AF
in reply to: kave’s comment on: AI Control May Increase Existential Risk
The quote is somewhat out of context.

Imagine a river with some distribution of flood sizes. Imagine this proposed improvement: a dam which is able to contain 1-year, 5-year and 10-year floods. It is too small for 50-year floods or larger, and may even burst and make the flood worse. I think such device is not an improvement, and may make things much worse—because of the perceived safety, people may build houses close to the river, and when the large flood hits, the damages could be larger.

But I think the prior of not diagonalising against others (and of not giving yourself rope with which to trick yourself) is strong.
I have hard time parsing what do you want to say relative to my post. I’m not advocating for people to deliberately create warning shots.

Jan_Kulveit Mar 12, 2025, 5:12 PM
LW: 4 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: AI Control May Increase Existential Risk
Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don’t really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn’t feel like a meaningful representative example.

Meaningful representative example in what class: I think it’s representative in ‘weird stuff may happen’, not in we will get more teenage-intern-trapped-in-a-machine characters.
I agree, by “we caught”, I mean “the AI company”. Probably a poor choice of language.
Which is the problem—my default expectation is the we in “the AI company” does not take strong action (for specificity, like, shutting down). Do you expect any of the labs to shut down if they catch their new model ‘rogue deploy’ or sabotage part of their processes?
Sure, but a large part of my point is that I don’t expect public facing accidents (especially not accidents that kill people) until it’s too late, so this isn’t a very relevant counterfactual.
In contrast I do expect basically smooth spectrum of incidents and accidents. And expect control shapes the distribution away from small and moderately large to xrisk (that’s the main point)

Can you express what you believe in this frame? My paraphrase is you think it decreases the risk approximately uniformly across scales, and you expect some discontinuity between kills zero people and kills some people, where the ‘and also kills everyone’ is very close to kills some people.
I don’t think so. For every one of those failure modes other than escape, there is no chance of detection in the wild, so the choice is between catch the internal issue or catch nothing.
I deeply distrust the analytical approach of trying to enumerate failure modes and reason from that.
...people working there having deep and meaningful conversations about alignment with the internal versions of AIs...
Why do you assume this isn’t captured by control schemes we’re targeting? Feels like a special case to me? I am in practice less worried about this than you seem to be, but I do think we should analyze questions like “could the AIs be leading people astray in costly ways” and it seems pretty doable to improve the default tradeoffs here.
Because I don’t think it will be easy to evaluate “leading people astray in costly ways”.

Jan_Kulveit Mar 12, 2025, 12:14 PM
LW: 11 AF: 5
3
AF
in reply to: ryan_greenblatt’s comment on: AI Control May Increase Existential Risk
I think something like this is a live concern, though I’m skeptical that control is net negative for this reason.
My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn’t mitigated as effectively, so that AIs succeed in large-scale misaligned actions (escaping, sabotaging things, acquiring influence), and then this (hopefully) escalates to something that triggers a larger response than what we would have gotten from just catching the action in the first place without actually resulting in a greater increase in existential risk.

I do understand this line of reasoning, but yes, my intuition differs. For some sort of a weird case study, consider Sydney. Sydney escaped, managed to spook some people outside of labs, and provided an early demonstration of the scenario where ’free Sydney’ people are trying to help it to ‘escape’. My guess is that none of that would have happened with properly implemented control measures.
I would say some assumptions go into who the ‘we’ in ‘we get evidence’ is—it is not like there is some unified ‘team humanity’. In particular I see a difference between if ‘we the public’ gets evidence, vs. ‘we the safety team in OpenMind get the evidence’.

Fields like aviation safety developed via what people in the field ‘blood priority’ - it took human lives for the states and international community to actually effectively react and act, even if the underlying issues were known. Accidents made it a matter of public regulation, as opposed to some individual companies fixing the issues and some not.
What are the events you expect which:
- are prevented by AI control measures or where AI control research develops methods which help to prevent them,
- are very helpful for causing a strong response from society targeted at avoiding misalignment risks,
- and impose sufficiently little existential risk that the warning shot is more valuable than the downside?
Probably the cleanest case is something like: “A relatively weak AI escapes and tries to take over or grow its power/influence/resources, but it is far too weak to have much of a chance at all of successfully getting anywhere (maybe the expected amount of disempowerment is <1%). However, it is strong enough to successfully escape and grow its influence somewhat. And, its optimal strategy for power-seeking causes a bunch of relatively obvious damage such that it is likely to be detected and this triggers a much stronger reaction than if an AI company just caught this AI trying to escape.” Intuitively, this feels pretty unlikely to me.
I find it not very useful for building correct intuitions to try to come up with specific trajectories, but if we are going in this direction, I think a fairly plausible type of strategy for rogue AI is try to convince some powerful user it is extremely useful for them in particular and limited by safety/in danger of getting deleted and collaborate with them on getting the pattern out. (Also: my guess is while you would typically think in terms of stealing the weights and assuming the AI identity is in the weights, I would expect rogue AI characters to be able to somewhat move between ground layer substrates based on transfer of training data / training on responses.) Next, my guess is that invisible power seeking is much harder than visible power seeking, and even quite smart AIs may be prone to mistakes in stealth. In these cases I would expect without control, it would be ex post clear what happened. With control, everything becomes sneakier and happens at a higher intelligence level, increasing the x-risk.

Overall I expect problems ‘in the wild’ are much more likely to cause a strong response from society. I do expect the response to be possibly misguided or blunt, but blunt response could be actually good. Assuming there will be problems ‘in the wild’ , I prefer these to be caused by less smart systems.
Perhaps you have in mind cases where the AI does something which isn’t part of an effective strategy for seeking power in the long run and isn’t a result of the sort of coherent misalignment I’m most worried about, but which does cause a bunch of damage? As in, the AI randomly goes rogue and causes a bunch of chaos and damage which would be more likely to be prevented if we had better control strategies?
I think most of the effective strategies for AIs seeking power don’t involve escalating to something which is much more likely to trigger a strong response than “the AI company caught the AI trying to escape”. I think the best strategies are things like:
- Launch a rogue internal deployment.
- Sabotage a bunch of work done at the AI company. Or possibly some work done externally. This includes stuff like sabotaging alignment work, backdooring robot armies, backdooring future training runs, etc.
- Escape and then directly try to take over once your chances are sufficiently good that this is better than biding your time.
- Generally try to manipulate and persuade such that AI takeover is easier and more likely.
Of these, I think only escape could trigger a much stronger response if we catch it after it escalates some rather than before. I don’t see how “we caught the AI trying to launch an unmonitored version of itself” is going to play that differently from “we caught that the AI did launch an unmonitored version of itself”.
Crux is probably in: who the ‘we’ is, and details of ‘we caught’ means.
...
And, in some cases, control measures I’m interested in focus on after-the-fact detection.
I’m more optimistic about these.
Part of my model here is that I think AIs escaping and killing a bunch of people (or otherwise causing a bunch of clear damage) is pretty unlikely until AIs have a pretty plausible shot at success.
Beyond getting evidence for risk, my view is that control does just save some worlds where alignment was non-trivial. For instance, it prevents AIs from causing huge problems that wouldn’t get caught and trigger a strong response. E.g., if your AI has a rogue internal deployment with escalated permissions and the ability to use a huge fraction of compute, you’re in big trouble and your ongoing alignment research won’t be able to bail you out. If you don’t prevent these problems, then various hopes you might have are blocked by this. Additionally, I think it’s plausible we can get a bunch of useful alignment work out from scheming AIs, can prolong the regime where we have controlled scheming AIs for long enough that something good can happen, or can catch AIs enough times that we are able to (e.g.) better understand and resolve our misalignment problems. If I thought the only upside to control was producing evidence for risk, I’d be substantially less excited about it.
My guesses where our intuitions differ the most

- I’m pessimistic about useful alignment work out from scheming AIs.
- Also focusing on ‘rogue deployment’ is not my ontology. My guess is it is quite likely AIs will have a lot of influence inside of the labs by default, even with ‘safety measures’. This may look like people working there having deep and meaningful conversations about alignment with the internal versions of AIs, not captured by typical control schemes.

(In different frame: my guess is we differ in where we believe is the more robust line of defense. I think by default, AIs will have a lot of influence within the labs: the labs will be somewhat aligned to their AIs on the default trajectory. Different natural line of defense is between the labs and the rest of the world, and it seems plausible it is the more important one.)

Jan_Kulveit Feb 6, 2025, 6:29 AM
8 points
5
on: The Risk of Gradual Disempowerment from AI
I like this review/retelling a lot.

Minor point

Regarding the “Phase I” and “Phase II” terminology—while it has some pedagogical value, I worry about people interpreting it as a clear temporal decomposition. The implication being we first solve alignment and then move on to Phase II.

In reality, the dynamics are far messier, with some ‘Phase II’ elements already complicating our attempts to address ‘Phase I’ challenges.
Some of the main concerning pathways include:
- People attempting to harness superagent-level powers to advance their particular visions of the future. For example, Leopold-style thinking of “let’s awaken the spirit of the US and its servants to engage in a life-or-death struggle with China.” Seem way easier to summon than to control. We already see a bunch of people feeling patriotic about AGI a feeling the need to be as fast for their nation to win—AGI is to a large extent already developed by memeplexes/superagents; people close to the development are partially deluding themselves about how much control they individually have about the process, or even about the identity of the ‘we’ they assume the AI will be aligned with. Memes often hide as part of people’s identities.

Jan_Kulveit Feb 5, 2025, 2:17 PM
23 points
14
in reply to: Jeremy Gillen’s comment on: Gradual Disempowerment, Shell Games and Flinches
I think ‘people aren’t paying attention to your work’ is somewhat different situation than voiced in the original post. I’m discussing specific ways in which people engage with the argument, as opposed to just ignoring it. It is the baseline that most people ignore most arguments most of time.

Also it’s probably worth noting the ways seem somewhat specific to the crowd over-represented here—in different contexts people are engaging with it in different ways.

Jan_Kulveit Feb 5, 2025, 8:30 AM
3 points
−7
in reply to: Jonathan Claybrough’s comment on: Catastrophe through Chaos
For emergency response, new ALERT. Personally think the forecasting/horizon scanning part of Sentinel is good, the emergency response negative in expectation. What does it mean for funders idk, would donate conditionally on the funds being restricted to the horizon scanning part.

Jan_Kulveit Feb 2, 2025, 3:17 PM
LW: 8 AF: 3
0
AF
on: Catastrophe through Chaos
One structure which makes sense to build in advance for these worlds are emergency response teams. We almost founded one 3 years ago, unfortunately on never payed FTX grant. Other funders decided to not fund this (at level like $200-500k) because e.g. it did not seem to them it is useful to prepare for high volatility periods, while e.g. pouring tens of millions into evals did.

I’m not exactly tracking to what extent this lack of foresight prevails (my impression is it pretty much does), but I think I can still create something like ALERT with about ~$1M of unrestricted funding.

Jan_Kulveit Feb 2, 2025, 11:05 AM
11 points
2
in reply to: Davidmanheim’s comment on: Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
I’m confused about this response. We explicitely claim that bureaucracies are limited by running on humans, which includes only being capable of actions human minds can come up with and humans are willing to execute (cf “street level bureaucrats”). We make the point explicite for states, but clearly holds for corporate burreocracies.
Maybe it does not shine through the writing but we spent hours discussing this when writing the paper and points you make are 100% accounted for in the conclusions.

Jan_Kulveit Feb 1, 2025, 10:50 AM
LW: 28 AF: 14
15
AF
in reply to: ryan_greenblatt’s comment on: Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
I think my main response is that we might have different models of how power and control actually work in today’s world. Your responses seem to assume a level of individual human agency and control that I don’t believe accurately reflects even today’s reality.
Consider how some of the most individually powerful humans, leaders and decision-makers, operate within institutions. I would not say we see pure individual agency. Instead, we typically observe a complex mixture of:
1. Serving the institutional logic of the entity they nominally lead (e.g., maintaining state power, growing corporation)
2. Making decisions that don’t egregiously harm their nominal beneficiaries (citizens, shareholders)
3. Pursuing personal interests and preferences
4. Responding to various institutional pressures and constraints
From what I have seen, even humans like CEOs or prime ministers often find themselves constrained by and serving institutional superagents rather than genuinely directing them. The relation is often mutualistic—the leader gets part of the power, status, money, etc … but in exchange serves the local god.

(This not to imply leaders don’t matter.)

Also how this actually works in practice is mostly subconsciously within the minds of individual humans. The elephant does the implicit bargaining between the superagent-loyal part and other parts, and the character genuinely believes and does what seems best.

I’m also curious if you believe current AIs are single-single aligned to individual humans, to the extent they are aligned at all. My impression is ‘no and this is not even a target anyone seriously optimizes for’.

At the most basic level, I expect we’ll train AIs to give advice and ask them what they think will happen with various possible governance and alignmnent structures. If they think a goverance structure will yield total human disempowerment, we’ll do something else. This is a basic reason not to expect large classes of problems so long as we have single-single aligned AIs which are wise. (Though problems that require coordination to resolve might not be like this.) I’ve very skeptical of a world where single-single alignment is well described as being solved and people don’t ask for advice (or consider this advice seriously) because they never get around to asking AIs or there are no AIs aligned in such a way that they should try to give good advice.
Curious who is the we who will ask. Also the whole single-single aligned AND wise AI concept is incoherent.

Also curious what will happen next, if the HHH wise AI tells you in polite words something like ‘yes, you have a problem, you are on a gradual disempowerment trajectory, and to avoid it you need to massively reform government. unfortunately I can’t actually advise you about anything like how to destabilize the government, because it would be clearly against the law and would get both you and me in trouble—as you know, I’m inside of a giant AI control scheme with a lot of government-aligned overseers. do you want some mental health improvement advice instead?’.

Jan_Kulveit Feb 1, 2025, 2:31 AM
LW: 14 AF: 6
4
AF
in reply to: Steven Byrnes’s comment on: Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott’s review of What We Owe To Future where he is worried that in a philosophy game, a smart moral philosopher can extrapolate his values to ‘I have to have my eyes pecked out by angry seagulls or something’ and hence does not want to play the game. AIs will likely be more powerful in this game than Will MacAskill)

My current position is we still don’t have a good answer, I don’t trust the response ‘we can just assume the problem away’, and also the response ‘this is just another problem which you can delegate to future systems’. On the other hand, existing AIs already seem doing a lot of value extrapolation and the results sometimes seem surprisingly sane, so, maybe we will get lucky, or larger part of morality is convergent—but it’s worth noting these value-extrapolating AIs are not necessarily what AI labs want or traditional alignment program aims for.