Jeremy Gillen

Karma: 1,941

I’m interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek’s team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.

Jeremy Gillen Apr 13, 2025, 12:37 PM
3 points
0
on: How training-gamers might function (and win)
Nice, you’ve expressed the generalization argument for expecting goal-directedness really well. Most of the post seems to match my beliefs.
I’m moderately optimistic about blackbox control (maybe 50-70% risk reduction on high-stakes failures?).
I want you to clarify what this means, and try to get some of the latent variables behind it.
One interpretation is that you mean any specific high-stakes attempt to subvert control measures is 50-70% likely to fail. But if we kept doing approximately the same set-up after this, then an attempt would soon succeed with high probability.
The way I think about blackbox control, the success condition is “getting an alignment solution that we trust”. So another interpretation is you’re saying 30-50% chance of this (maybe conditioned on no low-stakes sabotage)? If so, this implies a guess about the scale of effort required, could you give description of that scale? Because if you’re thinking Manhattan-project-scale with lots of exploration and freedom of experimentation, then this is a very different level of optimism than if you’re thinking about a single PhD project scale.
I’m high on scheming probability (65% or higher on inside view but lower once I defer to people).
Why not higher? I don’t see where the inside view uncertainty is coming from. Is it uncertainty over how future training will be done? Or uncertainty over the shape of mind-space-in-general, where accidental implicit biases related to e.g. {”conservativeness”, “trying-hard”, “follow heuristics instead of supergoal reasoning”} might make instrumental reward seeking unlikely by default?

Jeremy Gillen Apr 9, 2025, 5:24 PM
20 points
12
in reply to: Alexander Gietelink Oldenziel’s comment on: abramdemski’s Shortform
It’s not about building less useful technology, that’s not what Abram or Ryan are talking about (I assume). The field of alignment has always been about strongly superhuman agents. You can have tech that is useful and also safe to use, there’s no direct contradiction here.
Maybe one weak-ish historical analogy is explosives? Some explosives are unstable, and will easily explode by accident. Some are extremely stable, and can only be set off by a detonator. Early in the industrial chemistry tech tree, you only have access to one or two ways to make explosives. If you’re desperate, you use these whether or not they are stable, because the risk-usefulness tradeoff is worth it. A bunch of your soldiers will die, and your weapons caches will be easier to destroy, but that’s a cost you might be willing to pay. As your industrial chemistry tech advances, you invent many different types of explosive, and among these choices you find ones that are both stable explosives and effective, because obviously this is better in every way.
Maybe another is medications? As medications advanced, as we gained choice and specificity in medications, we could choose medications that had both low side-effects and were effective. Before that, there was often a choice, and the correct choice was often to not use the medicine unless you were literally dying.
In both these examples, sometimes the safety-usefulness tradeoff was worth it, sometimes not. Presumably people in both cases people often made the choice not to use unsafe explosives or unsafe medicine, because the risk wasn’t worth it.
As it is with these technologies, so it is with AGI. There are a bunch future paradigms of AGI building. The first one we stumble into isn’t looking like one where we can precisely specify what it wants. But if we were able to keep experimenting and understanding and iterating after the first AGI, and we gradually developed dozens of ways of building AGI, then I’m confident we could find one that is just as intelligent and also could have its goals precisely specified.
My two examples above don’t quite answer your question, because “humanity” didn’t steer away from using them, just individual people at particular times. For examples where all or large sections of humanity steered away from using an extremely useful tech whose risks purportedly outweighed benefits: Project Plowshare, nuclear power in some countries, GMO food in some countries, viral bioweapons (as far as I know), eugenics, stem cell research, cloning. Also {CFCs, asbestos, leaded petrol, CO2 to some extent, radium, cocaine, heroin} after the negative externalities were well known.
I guess my point is that safety-usefulness tradeoffs are everywhere, and tech development choices that take into account risks are made all the time. To me, this makes your question utterly confused. Building technology that actually does what you want (which is be safe and useful) is just standard practice. This is what everyone does, all the time, because obviously safety is one of the design requirements of whatever you’re building.
The main difference with between above technologies and AGI is that it’s a trapdoor. The cost of messing up AGI is that you lose any chance to try again. AGI shares with some of the above technologies an epistemic problem. For many of them it isn’t clear in advance, to most people, how much risk there actually is, and therefore whether the tradeoff is worth it.
After writing this, it occurred to me that maybe by “competitive” you meant “earlier in the tech tree”? I interpreted it in my comment as a synonym of “useful” in a sense that excluded safe-to-use.

Jeremy Gillen Apr 7, 2025, 2:58 PM
5 points
1
in reply to: J Bostock’s comment on: Jemist’s Shortform
Can you link to where RP says that?

Jeremy Gillen Apr 4, 2025, 8:44 PM
2 points
0
in reply to: Garrett Baker’s comment on: Changing my mind about Christiano’s malign prior argument
Do you not see how they could be used here?
This one. I’m confused about what the intuitive intended meaning of the symbol is. Sorry, I see why “type signature” was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe $L_{A}$ is a boolean fact that is edited? But if so I don’t know which fact it is, and I’m confused by the way you described it.
Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −10¹⁰10^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
Can we replace this with: “The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to −10¹⁰ utility.”? This is what it’s like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.

Jeremy Gillen Apr 4, 2025, 8:05 PM
2 points
0
in reply to: Garrett Baker’s comment on: Changing my mind about Christiano’s malign prior argument
I’m not sure what the type signature of $L_{A}$ is, or what it means to “not take into account $M$ ’s simulation”. When $A$ makes decisions about which actions to take, it doesn’t have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to “not take it into account”?
So the way in which the agent “gets its beliefs” about the structure of the decision theory problem is via these logical-counterfactual-conditional operation
I think you’ve misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it’s in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That’s all I meant by “get its beliefs”.

Jeremy Gillen Apr 4, 2025, 5:44 PM
2 points
0
in reply to: Garrett Baker’s comment on: Changing my mind about Christiano’s malign prior argument
Well my response to this was:
In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?
But I’ll expand: An agent doing that kind of game-theory reasoning needs to model the situation it’s in. And to do that modelling it needs a prior. Which might be malign.
Malign agents in the prior don’t feel like malign agents in the prior, from the perspective of the agent with the prior. They’re just beliefs about the way the world is. You need beliefs in order to choose actions. You can’t just decide to act in a way that is independent of your beliefs, because you’ve decided your beliefs are out to get you.
On top of this, how would you even decide that your beliefs are out to get you? Isn’t this also a belief?

Jeremy Gillen Apr 4, 2025, 5:35 PM
2 points
0
in reply to: Lucius Bushnaq’s comment on: Changing my mind about Christiano’s malign prior argument
Yeah I know that bound, I’ve seen a very similar one. The problem is that mesa-optimisers also get very good prediction error when averaged over all predictions. So they exist well below the bound. And they can time their deliberately-incorrect predictions carefully, if they want to survive for a long time.

Jeremy Gillen Apr 4, 2025, 5:18 PM
2 points
0
in reply to: Garrett Baker’s comment on: Changing my mind about Christiano’s malign prior argument
How does this connect to malign prior problems?

Jeremy Gillen Apr 4, 2025, 5:08 PM
2 points
0
in reply to: Garrett Baker’s comment on: Changing my mind about Christiano’s malign prior argument
But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn’t matter what the decision theory is.

Jeremy Gillen Apr 4, 2025, 4:25 PM
2 points
0
in reply to: Lucius Bushnaq’s comment on: Changing my mind about Christiano’s malign prior argument
To respond to your edit: I don’t see your reasoning, and that isn’t my intuition. For moderately complex worlds, it’s easy for the description length of the world to be longer than the description length of many kinds of inductor.
Because we have the prediction error bounds.
Not ones that can rule out any of those things. My understanding is that the bounds are asymptotic or average-case in a way that makes them useless for this purpose. So if a mesa-inductor is found first that has a better prior, it’ll stick with the mesa-inductor. And if it has goals, it can wait as long as it wants to make a false prediction that helps achieve its goals. (Or just make false predictions about counterfactuals that are unlikely to be chosen).
If I’m wrong then I’d be extremely interested in seeing your reasoning. I’d maybe pay $400 for a post explaining the reasoning behind why prediction error bounds rule out mesa-optimisers in the prior.

Jeremy Gillen Apr 4, 2025, 3:51 PM
2 points
0
in reply to: Lucius Bushnaq’s comment on: Changing my mind about Christiano’s malign prior argument
You also want one that generalises well, and doesn’t do preformative predictions, and doesn’t have goals of its own. If your hypotheses aren’t even intended to be reflections of reality, how do we know these properties hold?
Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is actually about finding more effective approximate models for things in different regimes.
When we compare theories, we don’t consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.
E.g. the theory of evolution isn’t quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn’t involve comparing the complexity of the set of approximations we used to work out the consequences of each theory.
Edit to respond to your edit: I don’t see your reasoning, and that isn’t my intuition. For moderately complex worlds, it’s easy for the description length of the world to be longer than the description length of many kinds of inductor.

Jeremy Gillen Apr 4, 2025, 3:16 PM
2 points
0
in reply to: Garrett Baker’s comment on: Changing my mind about Christiano’s malign prior argument
In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?

Jeremy Gillen Apr 4, 2025, 3:12 PM
2 points
0
in reply to: Lucius Bushnaq’s comment on: Changing my mind about Christiano’s malign prior argument
One thing to keep in mind is that time cut-offs will usually rule out our own universe as a hypothesis. Our universe is insanely compute inefficient.
So the “hypotheses” inside your inductor won’t actually end up corresponding to what we mean by a scientific hypothesis. The only reason this inductor will work at all is that it’s done a brute force search over a huge space of programs until it finds one that works. Plausibly it’ll just find a better efficient induction algorithm, with a sane prior.

Jeremy Gillen Apr 2, 2025, 4:09 PM
2 points
0
in reply to: mattmacdermott’s comment on: Is instrumental convergence a thing for virtue-driven agents?
I’m not sure whether it implies that you should be able to make a task-based AGI.
Yeah I don’t understand what you mean by virtues in this context, but I don’t see why consequentialism-in-service-of-virtues would create different problems than the more general consequentialism-in-service-of-anything-else. If I understood why you think it’s different then we might communicate better.
(Later you mention unboundedness too, which I think should be added to difficulty here)
By unbounded I just meant the kind of task where it’s always possible to do better by using a better plan. It basically just means that an agent will select the highest difficulty version of the task that is achievable. I didn’t intend it as a different thing from difficulty, it’s basically the same.
I’m not sure about that, because the fact that the task is being completed in service of some virtue might limit the scope of actions that are considered for it. Again I think it’s on me to paint a more detailed picture of the way the agent works and how it comes about in order for us to be able to think that through.
True, but I don’t think the virtue part is relevant. This applies to all instrumental goals, see here (maybe also the John-Max discussion in the comments).

Jeremy Gillen Apr 2, 2025, 1:46 PM
8 points
0
on: Is instrumental convergence a thing for virtue-driven agents?
It could still be a competent agent that often chooses actions based on the outcomes they bring about. It’s just that that happens as an inner loop in service of an outer loop which is trying to embody certain virtues.
I think you’ve hidden most of the difficulty in this line. If we knew how to make a consequentialist sub-agent that was acting “in service” of the outer loop, then we could probably use the same technique to make a Task-based AGI acting “in service” of us. Which I think is a good approach! But the open problems for making a task-based AGI still apply, in particular the inner alignment problems.
agents with many different goals will end up pursuing the same few subgoals, which includes things like “gain as much power as possible”.
Obvious nitpick: It’s just “gain as much power as is helpful for achieving whatever my goals are”. I think maybe you think instrumental convergence has stronger power-seeking implications than it does. It only has strong implications when the task is very difficult.^[1]
But what if we get AIs that aren’t pure consequentialists, for example because they’re ultimately motivated by virtues? Do we still have to worry that unless such AIs are motivated by certain very specific virtues, they will want to take over the world?
[...]
Is there any reason to expect that the best way to be a schmoyal schmend is to take over the world?
(Assuming that the inner loop <-> outer loop interface problem is solved, so the inner loop isn’t going to take control). Depends on the tasks that the outer loop is giving to the part-capable-of-consequentialism. If it’s giving nice easy bounded tasks, then no, there’s no reason to expect it to take over the world as a sub-task.
But since we ultimately want the AGI to be useful for avoiding takeover from other AGIs, it’s likely that some of the tasks will be difficult and/or unbounded. For those difficult unbounded tasks, becoming powerful enough to take over the world is often the easiest/best path.
1. ^
  I’m assuming soft optimisation here. Without soft optimisation, there’s an incentive to gain power as long as that marginally increases the chance of success, which it usually does. Soft optimisation solves that problem.

Jeremy Gillen Mar 21, 2025, 1:14 PM
21 points
0
on: Towards a scale-free theory of intelligent agency
But in practice, agents represent both of these in terms of the same underlying concepts. When those concepts change, both beliefs and goals change.
I like this reason to be unsatisfied with the EUM theory of agency.
One of the difficulties in theorising about agency is that all the theories are flexible enough to explain anything. Each theory is incomplete and vague in some way, so this makes the problem worse, but even when you make a detailed model of e.g. active inference, it ends up being pretty much formally equivalent to EUM.
I think the solution to this is to compare theories using engineering desiderata. Our goal is ultimately to build a safe AGI, so we want a theory that helps us reason about safety desiderata.
One of the really important safety desiderata is some kind of goal stability. When we build a powerful agent, we don’t want it to change its mind about what’s important. It should act to achieve known, predictable outcomes, even when it discovers facts and concepts we don’t know about.
So my criticism of this research direction is that I don’t think it’ll be a good framework for making goal-stable agents. You want a framework that naturally models internal conflict of goals, and in particular you want to model this as conflict between agents. Conflict and cooperation between bounded, not-quite-rational agents is messy and hard to predict. Multi-agent systems are complex and detail dependent. Therefore it seems difficult to show that the overall agent will be stable.
(A reasonable response would be “but no proposed vague theories of bounded agency have this goal stability property, maybe this coalitional approach will turn out to help us come up with a solution”, and that’s true and fair enough, but I think research directions like this seem more promising).

Jeremy Gillen Mar 14, 2025, 4:03 PM
2 points
0
in reply to: james.lucassen’s comment on: Evaluating Stability of Unreflective Alignment
I think the scheme you’re describing caps the agent at moderate problem-solving capabilities. Not being able to notice past mistakes is a heck of a disability.

Jeremy Gillen Feb 26, 2025, 4:21 AM
2 points
0
in reply to: ryan_greenblatt’s comment on: Training AI to do alignment research we don’t already know how to do
It’s not entirely clear to me that the math works out for AIs being helpful on net relative to humans just doing it, because of the supervision required, and the trust and misalignment issues.
But on this question (for AIs that are just capable of “prosaic and relatively unenlightened ML research”) it feels like shot-in-the-dark guesses. It’s very unclear to me what is and isn’t possible.

Jeremy Gillen Feb 26, 2025, 3:50 AM
3 points
0
in reply to: ryan_greenblatt’s comment on: Training AI to do alignment research we don’t already know how to do
Thanks, I appreciate the draft. I see why it’s not plausible to get started on now, since much of it depends on having AGIs or proto-AGIs to play with.
I guess I shouldn’t respond too much in public until you’ve published the doc, but:
- If I’m interpreting correctly, a number of the things you intend to try involve having a misaligned (but controlled) proto-AGI run experiments involving training (or otherwise messing with in some way) an AGI. I hope you have some empathy the internal screaming I have toward this category of things.
- A bunch of the ideas do seem reasonable to want to try (given that you had AGIs to play with, and were very confident that doing so wouldn’t allow them to escape or otherwise gain influence). I am sympathetic to the various ideas that involve gaining understanding of how to influence goals better by training in various ways.
- There are chunks of these ideas that definitely aren’t “prosaic and relatively unenlightened ML research”, and involve very-high-trust security stuff or non-trivial epistemic work.
- I’d be a little more sympathetic to these kinda desperate last-minute things if I had no hope in literally just understanding how to build task-AGI properly, in a well understood way. We can do this now. I’m baffled that almost all of the EA-alignment-sphere has given up on even trying to do this. From talking to people this weekend this shift seems downstream of thinking that we can make AGIs do alignment work, without thinking this through in detail.
The total quantity of risk reduction is unclear, but seems substantial to me. I’d guess takeover risk goes from 50% to 5% if you do a very good job at executing on huge amounts of prosaic and relatively unenlightened ML research at the relevant time
Agree it’s unclear. I think the chance of most of the ideas being helpful depends on some variables that we don’t clearly know yet. I think 90% risk improvement can’t be right, because there’s a lot of correlation between each of the things working or failing. And a lot of the risk comes from imperfect execution of the control scheme, which adds on top.
One underlying intuition that I want to express: The world where we are making proto-AGIs run all these experiments is pure chaos. Politically and epistemically and with all the work we need to do. I think pushing toward this chaotic world is much worse than other worlds we could push for right now.
But if I thought control was likely to work very well and saw a much more plausible path to alignment among the “stuff to try”, I’d think it was a reasonable strategy.
I also think that getting the ML community to work on things effectively is probably substantially harder than getting AIs to work on things effectively
On some axes, but won’t there to be axes where AIs are more difficult than humans also? Sycophancy&slop being the most salient. Misalignment issues being another.
This work isn’t extremely easy to verify or scale up (such that I don’t think “throw a billion dollars at it” just works),
This makes sense now. But I think this line should make you worry about whether you can make controlled AIs do it.

Jeremy Gillen Feb 25, 2025, 11:49 PM
2 points
0
in reply to: ryan_greenblatt’s comment on: Training AI to do alignment research we don’t already know how to do
I think if the model is scheming it can behave arbitrarily badly in concentrated ways (either in a small number of actions or in a short period of time), but you can make it behave well in the average case using online training.
I think we kind of agree here. The cruxes remain: I think that the metric for “behave well” won’t be good enough for “real” large research acceleration. And “average case” means very little when it allows room for deliberate-or-not mistakes sometimes when they can be plausibly got-away-with. [Edit: Or sabotage, escape, etc.]
Also, you need hardcore knowledge restrictions in order for the AI not to be able to tell the difference between I’m-doing-original-research vs humans-know-how-to-evaluate-this-work. Such restrictions are plausibly crippling for many kinds of research assistance.
FWIW, I don’t think “data-efficient long-horizon RL” (which is sample efficient in a online training sense) implies you can make faithful simulations.
I think there exists an extremely strong/unrealistic version of believing in “data-efficient long-horizon RL” that does allow this. I’m aware you don’t believe this version of the statement, I was just using it to illustrate one end of a spectrum. Do you think the spectrum I was illustrating doesn’t make sense?