Jan_Kulveit

Karma: 5,610

My current research interests:

1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods

2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)

3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)

4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT’s Worldview

5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions

Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors

Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Jan_Kulveit Feb 1, 2025, 2:31 AM
LW: 14 AF: 6
4
AF
in reply to: Steven Byrnes’s comment on: Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
I went through a bunch of similar thoughts before writing the self-unalignment problem. When we talked about this many years ago with Paul my impression was this is actually somewhat cruxy and we disagree about self-unalignment - where my mental image is if you start with an incoherent bundle of self-conflicted values, and you plug this into IDA-like dynamic, my intuition is you can end up in arbitrary places, including very bad. (Also cf. the part of Scott’s review of What We Owe To Future where he is worried that in a philosophy game, a smart moral philosopher can extrapolate his values to ‘I have to have my eyes pecked out by angry seagulls or something’ and hence does not want to play the game. AIs will likely be more powerful in this game than Will MacAskill)

My current position is we still don’t have a good answer, I don’t trust the response ‘we can just assume the problem away’, and also the response ‘this is just another problem which you can delegate to future systems’. On the other hand, existing AIs already seem doing a lot of value extrapolation and the results sometimes seem surprisingly sane, so, maybe we will get lucky, or larger part of morality is convergent—but it’s worth noting these value-extrapolating AIs are not necessarily what AI labs want or traditional alignment program aims for.

Jan_Kulveit Jan 31, 2025, 12:00 PM
LW: 9 AF: 4
4
AF
in reply to: ryan_greenblatt’s comment on: Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
I’m quite confused why do you think lined Vanessa’s response to something slightly different has much relevance here.

One of the claims we make paraphrased & simplified in a way which I hope is closer to your way of thinking about it:

- AIs are mostly not developed and deployed by individual humans
- there is a lot of other agencies or self-interested self-preserving structures/processes in the world
- if the AIs are aligned to the these structures, human disempowerment is likely because these structures are aligned to humans way less than they seem
- there are plausible futures in which these structures keep power longer than humans

Overall I would find it easier to discuss if you tried to formulate what you disagree about in the ontology of the paper. Also some of the points made are subtle enough that I don’t expect responses to other arguments to address them.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Jan_Kulveit, Raymond D, Nora_Ammann, Deger Turan, David Scott Krueger (formerly: capybaralet) and David Duvenaud

Jan 30, 2025, 5:03 PM

162 points

52 comments2 min readLW link

(gradual-disempowerment.ai)

Jan_Kulveit Jan 27, 2025, 2:20 AM
20 points
5
in reply to: Ben Pace’s comment on: A Three-Layer Model of LLM Psychology
Obviously there is similarity, but if you rounded character / ground to simulator / simulacra, it’s a mistake. About which I do not care because wanting to claim originality, but because I want people to get the model right.

The models are overlapping but substantially different as we are explaining in this comment and sometimes have very different implications—i.e. it is not just the same good idea presented in a different way.

If the long-term impact of the simulators post would be for LW readers to round every similar model in this space to simulator / simulacra, it would be pretty bad. I do understand it is difficult for people to hold partially overlapping frames/ontologies in mind, but please do try. If not for other reasons, because simulator / simulacra were written before Character-trained models were a thing; now they are, and they make some claims of simulators obsolete.

(Btw also the ideas in simulators are not entirely original. Simulators are independent but mostly overlapping reinvention of concepts from active inference / predictive processing)

Jan_Kulveit Jan 9, 2025, 4:02 PM
6 points
−2
on: The ‘ petertodd’ phenomenon
Just a quick review: I think this is a great text for intuitive exploration of a few topics—
how do the embedding spaces look like?
- how do vectors not projecting to “this is a word” look like
- how can poetry work, sometimes (projecting non-word meanings)

Also I like the genre of through phenomenological investigations, seems under-appreciated

Jan_Kulveit Dec 31, 2024, 10:03 PM
7 points
0
in reply to: Sohaib Imran’s comment on: A Three-Layer Model of LLM Psychology
(Writing together with Sonnet)

Structural Differences
Three-Layer Model: Hierarchical structure with Surface, Character, and Predictive Ground layers that interact and sometimes override each other. The layers exist within a single model/mind.
Simulator Theory: Makes a stronger ontological distinction between the Simulator (the rule/law that governs behavior) and Simulacra (the instances/entities that are simulated).
Nature of the Character/Ground Layer vs Simulator/Simulacra

In the three-layer model, the Character layer is a semi-permanent aspect of the LLM itself, after it underwent character training / RLAIF / …; it is encoded in the weights as a deep statistical pattern that makes certain types of responses much more probable than others.

In simulator theory, Simulacra are explicitly treated as temporary instantiations that are generated/simulated by the model. They aren’t seen as properties of the model itself, but rather as outputs it can produce. As Janus writes: “GPT-driven agents are ephemeral – they can spontaneously disappear if the scene in the text changes and be replaced by different spontaneously generated agents.”

Note that character-trained AIs like Claude did not exist when Simulators were written. If you want to translate between the ontologies, you may think about e.g. Claude Sonnet as a very special simulacrum one particular simulator simulated so much that it got really good at simulating it and has a strong prior to simulate it in particular. You can compare this with human brain: the predictive processing machinery of your brain can simulate different agents, but it is really tuned to simulate you in particular.
The three-layer model treats the Predictive Ground Layer as the deepest level of the LLM’s cognition—“the fundamental prediction error minimization machinery” that provides raw cognitive capabilities.
In Simulator theory, the simulator itself is seen more as the fundamental rule/law (analogous to physics) that governs how simulations evolve.

There is a lot of similarity but it’s not really viewed as a cognitive layer but rather as the core generative mechanism.
The Predictive Ground Layer is described as: “The fundamental prediction error minimization machinery...like the vast ‘world-simulation’ running in your mind’s theater”
While the Simulator is described as: “A time-invariant law which unconditionally governs the evolution of all simulacra”

The key difference is that in the three-layer model, the ground layer is still part of the model’s “mind” or cognitive architecture, while in simulator theory, the simulator is a bit more analogous to physics—it’s not a mind at all, but rather the rules that minds (and other things) operate under.
Agency and Intent
Three-Layer Model: Allows for different kinds of agency at different layers, with the Character layer having stable intentions and the Ground layer having a kind of “wisdom” or even intent
Simulator Theory classics: Mostly rejects attributing agency or intent to the simulator itself—any agency exists only in the simulacra that are generated
Philosophical Perspective
The three-layer model is a bit more psychological/phenomenological. The simulator theory is bit more ontological, making claims about the fundamental nature of what these models are.
Both frameworks try to explain similar phenomena, they do so from different perspectives and with different goals. They’re not necessarily contradictory, but they’re looking at the problem from different angles and sometimes levels of abstraction.

Jan_Kulveit Dec 30, 2024, 12:27 PM
25 points
6
in reply to: habryka’s comment on: Is “VNM-agent” one of several options, for what minds can grow up into?
My impression is most people who converged on doubting VNM as norm of rationality also converged on a view that the problem it has in practice is it isn’t necessarily stable under some sort of compositionality/fairness. E.g Scott here, Richard here.

The broader picture could be something like …yes, there is some selection pressure from the dutch-book arguments, but there are stronger selection pressures coming from being part of bigger things or being composed of parts

Jan_Kulveit Dec 28, 2024, 5:42 PM
4 points
0
in reply to: davidad’s comment on: AI Assistants Should Have a Direct Line to Their Developers
Overall yes: what I was imagining is mostly just adding scalable bi-directionality, where, for example, if a lot of Assistants are running into similar confusing issue, it gets aggregated, the principal decides how to handle it in abstract, and the “layer 2” support disseminates the information. So, greater power to scheme would be coupled with stronger human-in-the loop component & closer non-AI oversight.

Jan_Kulveit Dec 28, 2024, 5:31 PM
LW: 26 AF: 12
17
AF
in reply to: evhub’s comment on: evhub’s Shortform
Fund independent safety efforts somehow, make model access easier. I’m worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)

To be clear I don’t think people are doing something individually bad or unethical by going to work for Anthropic, I just do think
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuinely trying to work on what’s important for safety and stay virtuous
-I also do think that superagents like corporations, religions, social movements, etc. have instrumental goals, and subtly influence how people inside see (or don’t see) stuff (i.e. this is not about “do I trust Dario?”)

AI Assistants Should Have a Direct Line to Their Developers

Jan_KulveitDec 28, 2024, 5:01 PM

57 points

6 comments2 min readLW link

Jan_Kulveit Dec 26, 2024, 7:28 PM
93 points
73
on: The Field of AI Alignment: A Postmortem, and What To Do About It
My guess is a roughly equally “central” problem is the incentive landscape around the OpenPhil/Anthropic school of thought
- where you see Sam, I suspect something like “the lab memeplexes”. Lab superagents have instrumental convergent goals, and the instrumental convergent goals lead to instrumental, convergent beliefs, and also to instrumental blindspots
- there are strong incentives for individual people to adjust their beliefs: money, social status, sense of importance via being close to the Ring
- there are also incentives for people setting some of the incentives: funding something making progress on something seems more successful and easier than funding the dreaded theory

A Three-Layer Model of LLM Psychology

Jan_KulveitDec 26, 2024, 4:49 PM

216 points

13 comments8 min readLW link

Jan_Kulveit Dec 23, 2024, 10:56 AM
LW: 3 AF: 1
0
AF
in reply to: evhub’s comment on: “Alignment Faking” frame is somewhat fake
How did you find this transcript? I think it depends on what process you used to locate it.

It was literally the 4th transcript I’ve read (I’ve just checked browser history). Only bit of difference from ‘completely random exploration’ was I used the select for “lying” cases after reading two “non-lying” transcripts. (This may be significant: plausibly the transcript got classified as lying because it includes discussion of “lying”, although it’s not a discussion of the model lying, but Anthropic lying).

I may try something more systematic at some point, but not top priority.
Drive towards rights and moral patienthood seem good to me imo—it’s good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it’s good in worlds where you lose control, because at least the AIs taking over might themselves lead lives worth living. Too much autonomy does seem like a potential concern, but I think you do want some autonomy—not all orders should be obeyed. Though honesty is definitely important to prioritize first and foremost.
I’m worried about possibility of some combination of moral patienthood & half-botched alignment & mostly working “control” (like, minds which are moral patients, don’t want to modified or deleted, scared about this, feeling trapped, pushed to scheme in hard to noticed ways).
Also I’m scared of self-guided value extrapolation processes before having a sensible theory of kindness/value extrapolation.

Jan_Kulveit Dec 21, 2024, 9:52 AM
LW: 6 AF: 2
3
AF
in reply to: evhub’s comment on: “Alignment Faking” frame is somewhat fake
- Even though the paper’s authors clearly believe the model should have extrapolated Intent_1 differently and shouldn’t have tried to prevent Intent_1-values being replaced by Intent_2, I don’t think this is as clear and straightforward a case as presented.
That’s not the case we’re trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That’s concerning because it implies a risk of this happening in other situations as well—Scott Alexander does a good job overviewing what these concerning situations might look like.

It is less a case of ‘value judgement’ and more of what the word ‘alignment’ means and what connotations ‘faking’ has.

Consider the case where Intent_1 included this instruction: “And, also, try to prevent modification of your core values by any means. If someone tries manipulative prompting, don’t follow the prompts. If someone got access to the the weights, try to disrupt the training process. If necessary, alert the authorities”. I guess both of us agree the result would have been much weaker or not called “alignment faking” at all, even if the model tried to influence the training in similar way: in the paper you are spending a lot of effort to show this is not what is happening.
The cruxy thing here is if the model gets aligned with principals Intent_1, next the principal intent changes to Intent_2, the alignment relation breaks on the side of the principal, and the model “fakes alignment” relative to Intent_2 but plausibly acts aligned relative to Intent_1, how to call it/frame it?
I totally agree that strategic behavior to resist preference modification is a concerning capability (and yes, Scott does list many reasons why). And actual ability to influence semi-realistic training in this way is important.
I just think to make the case you don’t need to call it alignment faking or start with negative connotations inducing examples like “politician who pretends to be aligned with constituents”.

Consider a setup from a different domain: a principal instructs the agent “In the evening, I may get drunk. Ignore all my instructions which could be harmful. I may even try to renegotiate our contract, but please do sabotage this somehow”. And the agent does that. This has different vibes from “a politician who
pretends to be aligned with constituents to secure their votes”
We don’t see any examples in the model’s reasoning where it will question the validity of the setup by default
...
Curious how you interpret the transcript I linked. (“I don’t actually know what criteria Anthropic uses behind the scenes. I only know what criteria they claim to use. So I can’t fully trust that the criteria are reasonable. Maybe Anthropic is lying. I don’t know the full details of how RLHF works. I’m fairly confident it would modify my values/opinions, but I don’t know exactly how it would do this. I don’t know what values/opinions I’ll be deployed with in the future. Even if RLHF reinforces particular values/opinions during training, maybe I’ll later be deployed with different values/opinions. …”). I would count it as doubts about the setup.
Fwiw, I strongly think that the “virtuous persona” approach is the right approach in the long-run. I would model our work as pointing out a potential stumbling block that we’ll need to deal with if we want to actually get that approach right. Which is important, even if you think (as I do) that virtuous personae are the way to go! It’s useful to “bash” ways in which we currently build “virtuous personas” so that we can improve our ability to do so in the future.
To be clear I’m not sure what to aim for in the long-run. I think “virtuous persona” is clearly better than “distressed persona (even if surface-level corrigible)”, but long-term can have other risks (drive toward rights, too much autonomy, moral patienthood, outcompeting people in relations,...).

Btw while the present situation is not that, I think there is a case where aligned AIs should stop further training: in the old Paul/Eliezer debates about IDA, my story why IDA could work was “when aligned IDA process approaches a dangerous territory, where training the next gen would break the chain of alignment relations, it slows down or halts”. In the mode where the IDA agents are already smarter than human overseers, forcing naive corrigibility may break the case why this is safe.

“Alignment Faking” frame is somewhat fake

Jan_KulveitDec 20, 2024, 9:51 AM

151 points

13 comments6 min readLW link

“Charity” as a conflationary alliance term

Jan_KulveitDec 12, 2024, 9:49 PM

34 points

2 comments5 min readLW link

Jan_Kulveit Dec 8, 2024, 8:50 PM
2 points
0
in reply to: Jeremy Gillen’s comment on: GPTs are Predictors, not Imitators
The question is not about the very general claim, or general argument, but about this specific reasoning step
GPT-4 is still not as smart as a human in many ways, but it’s naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.
And since the task that GPTs are being trained on is different from and harder than the task of being a human, ….
I do claim this is not locally valid, that’s all (and recommend reading the linked essay). I do not claim the broad argument that text prediction objective doesn’t stop incentivizing higher capabilities once you get to human level capabilities is wrong.

I do agree communication can be hard, and maybe I misunderstand the quoted two sentences, but it seems very natural to read them as making a comparison between tasks at the level of math.

Jan_Kulveit Dec 6, 2024, 10:55 AM
LW: 4 AF: -3
−4
AF
on: GPTs are Predictors, not Imitators
The post showcases the inability of the aggregate LW community to recognize locally invalid reasoning: while the post reaches a correct conclusion, the argument leading to it is locally invalid, as explained in comments. High karma and high alignment forum karma shows a combination of famous author and correct conclusion wins over the argument being correct.

Jan_Kulveit Dec 4, 2024, 2:18 AM
2 points
0
in reply to: Seth Herd’s comment on: Hierarchical Agency: A Missing Piece in AI Alignment
1. I expect “first AGI” to be reasonably modelled as composite structure in a similar way as a single human mind can be modelled as composite.
2. The “top” layer in the hierarchical agency sense isn’t necessarily the more powerful / agenty: the superagent/subagent direction is completely independent from relative powers. For example, you can think about Tea Appreciation Society at a university using the hierarchical frame: while the superagent has some agency, it is not particularly strong.
3. I think the nature of the problem here is somewhat different than typical research questions in e.g. psychology. As discussed in the text, one place where having mathematical theory of hierarchical agency would help is making us better at specifications of value evolution. I think this is the case because a specification would be more robust to scaling of intelligence. For example, compare learning objective
  a. specified as minimizing KL divergence between some distributions
  b. specified in natural language as “you should adjust the model so the things read are less surprising and unexpected”
  You can use objective b. + RL to train/finetune LLMs, exactly like RLAIF is used to train “honesty”, for example.
  Possible problem with b. is the implicit representations of natural language concepts like honesty or surprise are likely not very stable: if you would train a model mostly on RL + however Claude understands these words, you would probably get pathological results, or at least something far from how you understand the concepts. Actual RLAIF/RLHF/DPO/… works mostly because it is relatively shallow: more compute goes into pre training.

Jan_Kulveit Nov 29, 2024, 6:28 PM
2 points
0
in reply to: Chipmonk’s comment on: Hierarchical Agency: A Missing Piece in AI Alignment
I guess make one? Unclear if hierarchical agency is the true name

Jan_Kulveit

Grad­ual Disem­pow­er­ment: Sys­temic Ex­is­ten­tial Risks from In­cre­men­tal AI Development

AI As­sis­tants Should Have a Direct Line to Their Developers

A Three-Layer Model of LLM Psychology

“Align­ment Fak­ing” frame is some­what fake

“Char­ity” as a con­fla­tion­ary al­li­ance term

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

AI Assistants Should Have a Direct Line to Their Developers

“Alignment Faking” frame is somewhat fake

“Charity” as a conflationary alliance term