Jeremy Gillen

Karma: 2,007

I’m interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.

I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek’s team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.

Jeremy Gillen Jun 11, 2025, 10:16 AM
11 points
3
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
Have you personally done the thing successfully with another person, with both of you actually picking up on the other person’s hints?
Yes. But usually the escalation happens over weeks or months, over multiple conversations (at least in my relatively awkward nerd experience). So it’d be difficult to notice people doing this. Maybe twice I’ve been in situations where hints escalated within a day or two, but both were building from a non-zero level of suspected interest. But none of these would have been easy to notice from the outside, except maybe at a couple of moments.

Jeremy Gillen May 30, 2025, 9:07 PM
3 points
0
in reply to: Aharon Azulay’s comment on: Interpretability Will Not Reliably Find Deceptive AI
Everyone agrees that sufficiently unbalanced games can allow a human to beat a god. This isn’t a very useful fact, since it’s difficult to intuit how unbalanced the game needs to be.
If you can win against a god with queen+knight odds you’ll have no trouble reliably beating Leela with the same odds. I’d bet you can’t win more than 6 out of 10? $20?

Jeremy Gillen May 29, 2025, 10:04 PM
2 points
0
in reply to: faul_sname’s comment on: Interpretability Will Not Reliably Find Deceptive AI
Yeah I didn’t expect that either, I expected earlier losses (although in retrospect that wouldn’t make sense, because stockfish is capable of recovering from bad starting positions if it’s up a queen).
Intuitively, over all the games I played, each loss felt different (except for the substantial fraction that were just silly blunders). I think if I learned to recognise blunders in the complex positions I would just become a better player in general, rather than just against LeelaQueenOdds.
Just tried hex, that’s fun.

Jeremy Gillen May 29, 2025, 9:01 PM
9 points
0
in reply to: faul_sname’s comment on: Interpretability Will Not Reliably Find Deceptive AI
I don’t think that’d help a lot. I just looked back at several computer analyses, and the (stockfish) evaluation of the games all look like this:
This makes me think that Leela is pushing me into a complex position and then letting me blunder. I’d guess that looking at optimal moves in these complex positions would be good training, but probably wouldn’t have easy to learn patterns.

Jeremy Gillen May 29, 2025, 7:37 PM
4 points
0
in reply to: faul_sname’s comment on: Interpretability Will Not Reliably Find Deceptive AI
I haven’t heard of any adversarial attacks, but I wouldn’t be surprised if they existed and were learnable. I’ve tried a variety of strategies, just for fun, and haven’t found anything that works except luck. I focused on various ways of forcing trades, and this often feels like it’s working but almost never does. As you can see, my record isn’t great.
I think I started playing it when I read simplegeometry’s comment you linked in your shortform.
It seems to be gaining a lot of ground by exploiting my poor openings. Maybe one strategy would be to memorise a specialised opening much deeper than usual? That could be enough. But it’d feel like cheating to me if I used an engine to find that opening. It’d also feel like cheating because it’s exploiting Leela’s lack of memory of past games. It’d be easy to modify it to deliberately play diverse games when playing against the same person.

Jeremy Gillen May 29, 2025, 11:27 AM
5 points
2
in reply to: Aharon Azulay’s comment on: Interpretability Will Not Reliably Find Deceptive AI
Can you beat this bot though?

Jeremy Gillen May 26, 2025, 10:45 AM
1 point
0
in reply to: CSDD’s comment on: CSDD’s Shortform
I highly recommend reading the sequences. I re-read some of them recently. Maybe Yudkowsky’s Coming of Age is the most relevant to your shortform.

Jeremy Gillen May 22, 2025, 2:39 PM
15 points
8
in reply to: Algon’s comment on: Eliezer and I wrote a book: If Anyone Builds It, Everyone Dies
One notable difficulty with talking to ordinary people about this stuff is that often, you lay out the basic case and people go “That’s neat. Hey, how about that weather?” There’s a missing mood, a sense that the person listening didn’t grok the implications of what they’re hearing.
I kinda think that people are correct to do this, given the normal epistemic environment. My model is this: Everyone is pretty frequently bombarded with wild arguments and beliefs that have crazy implications. Like conspiracy theories, political claims, spiritual claims, get-rich-quick schemes, scientific discoveries, news headlines, mental health and wellness claims, alternative medicine, claims about which lifestyles are better. We don’t often have the time (nor expertise or skill or sometimes intelligence) to evaluate them properly. So we usually keep track of a bunch of these beliefs and arguments, and talk about them, but usually require nearby social proof in order to attach the arguments/beliefs to actions and emotions. Rationalists (and the more culty religions and many activist groups, etc.) are extreme in how much they change their everyday lives based on their beliefs.
I think it’s probably okay to let people maintain this detachment? Maybe even better, because it avoids activating antibodies. It’s (usefully) something that’s hard to change with argument. It will plausibly fix itself later, if there ever comes a time when their friends are voting or protesting or something.
I recently told my dad that I wasn’t trying to save for retirement. This horrified him far more than when I had previously told him that I didn’t expect anyone to survive the next couple of decades. The contrast was funny.

Jeremy Gillen May 18, 2025, 5:34 PM
LW: 3 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
I think different views about the extent to which future powerful AIs will deeply integrate their superhuman abilities versus these abilities being shallowly attached partially drive some disagreements about misalignment risk and what takeoff will look like.
I think this might be wrong when it comes to our disagreements, because I don’t disagree with this shortform.^[1] Maybe a bigger crux is how valuable (1) is relative to (2)? Or the extent to which (2) is more helpful for scientific progress than (1)?
1. ^
  As long as “downstream performance” doesn’t include downstream performance on tasks that themselves involve a bunch of integrating/generalising.

Jeremy Gillen May 18, 2025, 12:22 PM
7 points
−1
in reply to: Seth Herd’s comment on: Problems with instruction-following as an alignment target
If you have an alternate theory of the likely form of first takeover-capable AGI, I’d love to hear it!
I’m not claiming anything about the first takeover-capable AGI, and I’m not claiming it won’t be LLM-based. I’m just saying that there’s a specific reasoning step that you’re using a lot (current tech has property X, therefore AGI has property almost-X) which I think is invalid (when X is entangled with properties of AGI that LLMs don’t currently have).
Maybe a slightly insulting analogy (sorry): That type of reasoning looks a lot like bad scifi ideas about AI, where people reason like “AI is a program on a computer, programs on computers can’t do {intuition, fuzzy reasoning, logical paradoxes, emotion}, therefore AI will be {logical, calculator-like, vulnerable to paradoxes, not understand emotion, etc.}”. The reasoning step doesn’t work, because it’s focusing on the “logical program” part over the “AGI” part. I think you’re focusing too much on the “LLM-based” part of “LLM-based AGI”, even in cases where the “AGI” part tells you much more.
(We’re having two similar discussions in parallel, so I’m responding to this in a way that might be useful to other people, but I don’t expect it to be useful to you, since I’ve already said this in the other discussion).

Jeremy Gillen May 16, 2025, 9:26 AM
7 points
2
on: Problems with instruction-following as an alignment target
(A small rant, sorry) In general, it seems you’re massively overanchored on current AI technology, to an extent that it’s stopping you from clearly reasoning about future technology. One example is the jailbreaking section:
There has been no noticeable trend toward real jailbreak resistance as LLMs have progressed, so we should probably anticipate that LLM-based AGI will be at least somewhat vulnerable to jailbreaks.
You’re talking about AGI here. An agent capable autonomously doing research, play games with clever adversaries, detecting and patching it’s own biases, etc. It should be obvious that you can’t use current LLM flaws as a method of extrapolating the adversarial robustness of this program.
Second,
If IF were the only target of alignment training in its starting state, it seems likely this strong “center of gravity” would guide it to adopt IF as a reflectively stable goal. If it has multiple goals (e.g., following instructions in some cases, refusing instructions for ethical reasons in other cases, and RL for various criteria), its evolution and ultimate alignment seems much less easy to predict.
No. You’re entirely ignoring inner alignment difficulties. The main difficulties. There are several degrees of freedom in goal specification that a pure goal target fails to nail down. These lead to an unpredictable reflective equilibrium.

Jeremy Gillen May 3, 2025, 6:03 PM
4 points
2
in reply to: Lucius Bushnaq’s comment on: RA x ControlAI video: What if AI just keeps getting smarter?
Good point, I shouldn’t have said dishonest. For some reason while writing the comment I was thinking of it as deliberately throwing vaguely related math at the viewer and trusting that they won’t understand it. But yeah likely it’s just a misunderstanding.

Jeremy Gillen May 3, 2025, 12:12 PM
10 points
10
on: RA x ControlAI video: What if AI just keeps getting smarter?
The way we train AIs draws on fundamental principles of computation that suggest any intellectual task humans can do, a sufficiently large AI model should also be able to do. [Universal approximation theorem on screen]
IMO it’s dishonest to show the universal approximation theorem. Lots of hypothesis spaces (e.g. polynomials, sinusoids) have the same property. It’s not relevant to predictions about how well the learning algorithm generalises. And that’s the vastly more important factor for general capabilities.

Jeremy Gillen May 2, 2025, 4:36 PM
4 points
0
in reply to: Seth Herd’s comment on: Veedrac’s Shortform
If we can clearly tie the argument for AGI x-risk to agency, I think it won’t have the same problem
Yeah agreed, and it’s really hard to get the implications right here without a long description. In my mind entities didn’t trigger any association with agents, but I can see how it would for others.
This thread helped inspire me to write the brief post Anthropomorphizing AI might be good, actually.
I broadly agree that many people would be better off anthropomorphising future AI systems more. I sometimes push for this in arguments, because in my mind many people have massively overanchored on the particular properties of current LLMs and LLM agents. I’m less a fan of your part of that post that involves accelerating anything.
One could say “well LLMs are already superhuman at some stuff and they don’t seem to have instrumental goals”. And that will become more compelling as LLMs keep getting better in narrow domains.
Yeah, but the line “capable systems necessarily have instrumental goals” helps clarify what you mean by “capable systems”. It must be some definition that (at least plausibly) implies instrumental goals.
Kat Woods’ tweet is an interesting case. I actually think her point is absolutely right as far as it goes
Huh I suspect that the disagreement about that tweet might come from dumb terminology fuzziness. I’m not really sure what she means by “the specification problem” when we’re in the context of generative models trained to imitate. It’s a problem that makes sense in a different context. But the central disagreement is that she thinks current observations (of “alignment behaviour” in particular) are very surprising, which just seems wrong. My response was this:

Jeremy Gillen May 1, 2025, 11:31 AM
9 points
2
in reply to: Seth Herd’s comment on: Veedrac’s Shortform
This seems rhetorically better, but I think it is implicitly relying on instrumental goals and it’s hiding that under intuitions about smartness and human competition. This will work for people who have good intuitions about that stuff, but won’t work for people who don’t see the necessity of goals and instrumental goals. I like Veedrac’s better in terms of exposing the underlying reasoning.
I think it’s really important to avoid making arguments that are too strong and fuzzy, like yours. Imagine a person reads your argument and now beliefs that intuitively smart AI entities will be dangerous, via outsmarting us etc. Then Claude 5 comes out and matches their intuition for smart AI entity, but (let’s assume) still isn’t great at goal-directedness. Then after Claude 5 hasn’t done any damage for a while, they’ll conclude that the reasoning leading to dangerousness must be wrong. Maybe they’ll think that alignment actually turned out to be easy.
Something like this seems to have already happened to a bunch of people. E.g. I’ve heard someone at Deemind say “Doesn’t constitutional AI solve alignment?”. Kat’s post here^[1] seems to be basically the same error, in that Kat seems to have predicted more overt evilness from LLM agents and is surprised by the lack of that, and has thereby updated that maybe some part of alignment is actually easy. Possibly Turntrout is another example, although there’s more subtly there. I think he’s correct that, given his beliefs about where capabilities come from, the argument for deceptive alignment (an instrumental goal) doesn’t go through.
In other words, your argument is too easily “falsified” by evidence that isn’t directly relevant to the real reason for being worried about AI. More precision is necessary to avoid this, and I think Veedrac’s summary mostly succeeds at that.
1. ^

Jeremy Gillen Apr 13, 2025, 12:37 PM
3 points
0
on: How training-gamers might function (and win)
Nice, you’ve expressed the generalization argument for expecting goal-directedness really well. Most of the post seems to match my beliefs.
I’m moderately optimistic about blackbox control (maybe 50-70% risk reduction on high-stakes failures?).
I want you to clarify what this means, and try to get some of the latent variables behind it.
One interpretation is that you mean any specific high-stakes attempt to subvert control measures is 50-70% likely to fail. But if we kept doing approximately the same set-up after this, then an attempt would soon succeed with high probability.
The way I think about blackbox control, the success condition is “getting an alignment solution that we trust”. So another interpretation is you’re saying 30-50% chance of this (maybe conditioned on no low-stakes sabotage)? If so, this implies a guess about the scale of effort required, could you give description of that scale? Because if you’re thinking Manhattan-project-scale with lots of exploration and freedom of experimentation, then this is a very different level of optimism than if you’re thinking about a single PhD project scale.
I’m high on scheming probability (65% or higher on inside view but lower once I defer to people).
Why not higher? I don’t see where the inside view uncertainty is coming from. Is it uncertainty over how future training will be done? Or uncertainty over the shape of mind-space-in-general, where accidental implicit biases related to e.g. {”conservativeness”, “trying-hard”, “follow heuristics instead of supergoal reasoning”} might make instrumental reward seeking unlikely by default?

Jeremy Gillen Apr 9, 2025, 5:24 PM
20 points
12
in reply to: Alexander Gietelink Oldenziel’s comment on: abramdemski’s Shortform
It’s not about building less useful technology, that’s not what Abram or Ryan are talking about (I assume). The field of alignment has always been about strongly superhuman agents. You can have tech that is useful and also safe to use, there’s no direct contradiction here.
Maybe one weak-ish historical analogy is explosives? Some explosives are unstable, and will easily explode by accident. Some are extremely stable, and can only be set off by a detonator. Early in the industrial chemistry tech tree, you only have access to one or two ways to make explosives. If you’re desperate, you use these whether or not they are stable, because the risk-usefulness tradeoff is worth it. A bunch of your soldiers will die, and your weapons caches will be easier to destroy, but that’s a cost you might be willing to pay. As your industrial chemistry tech advances, you invent many different types of explosive, and among these choices you find ones that are both stable explosives and effective, because obviously this is better in every way.
Maybe another is medications? As medications advanced, as we gained choice and specificity in medications, we could choose medications that had both low side-effects and were effective. Before that, there was often a choice, and the correct choice was often to not use the medicine unless you were literally dying.
In both these examples, sometimes the safety-usefulness tradeoff was worth it, sometimes not. Presumably people in both cases people often made the choice not to use unsafe explosives or unsafe medicine, because the risk wasn’t worth it.
As it is with these technologies, so it is with AGI. There are a bunch future paradigms of AGI building. The first one we stumble into isn’t looking like one where we can precisely specify what it wants. But if we were able to keep experimenting and understanding and iterating after the first AGI, and we gradually developed dozens of ways of building AGI, then I’m confident we could find one that is just as intelligent and also could have its goals precisely specified.
My two examples above don’t quite answer your question, because “humanity” didn’t steer away from using them, just individual people at particular times. For examples where all or large sections of humanity steered away from using an extremely useful tech whose risks purportedly outweighed benefits: Project Plowshare, nuclear power in some countries, GMO food in some countries, viral bioweapons (as far as I know), eugenics, stem cell research, cloning. Also {CFCs, asbestos, leaded petrol, CO2 to some extent, radium, cocaine, heroin} after the negative externalities were well known.
I guess my point is that safety-usefulness tradeoffs are everywhere, and tech development choices that take into account risks are made all the time. To me, this makes your question utterly confused. Building technology that actually does what you want (which is be safe and useful) is just standard practice. This is what everyone does, all the time, because obviously safety is one of the design requirements of whatever you’re building.
The main difference with between above technologies and AGI is that it’s a trapdoor. The cost of messing up AGI is that you lose any chance to try again. AGI shares with some of the above technologies an epistemic problem. For many of them it isn’t clear in advance, to most people, how much risk there actually is, and therefore whether the tradeoff is worth it.
After writing this, it occurred to me that maybe by “competitive” you meant “earlier in the tech tree”? I interpreted it in my comment as a synonym of “useful” in a sense that excluded safe-to-use.

Jeremy Gillen Apr 7, 2025, 2:58 PM
5 points
1
in reply to: J Bostock’s comment on: Jemist’s Shortform
Can you link to where RP says that?

Jeremy Gillen Apr 4, 2025, 8:44 PM
2 points
0
in reply to: Garrett Baker’s comment on: Changing my mind about Christiano’s malign prior argument
Do you not see how they could be used here?
This one. I’m confused about what the intuitive intended meaning of the symbol is. Sorry, I see why “type signature” was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe $L_{A}$ is a boolean fact that is edited? But if so I don’t know which fact it is, and I’m confused by the way you described it.
Because we’re talking about priors and their influence, all of this is happening inside the agent’s brain. The agent is going about daily life, and thinks “hm, maybe there is an evil demon simulating me who will give me −10¹⁰10^10 utility if I don’t do what they want for my next action”. I don’t see why this is obviously ill-defined without further specification of the training setup.
Can we replace this with: “The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to −10¹⁰ utility.”? This is what it’s like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.

Jeremy Gillen Apr 4, 2025, 8:05 PM
2 points
0
in reply to: Garrett Baker’s comment on: Changing my mind about Christiano’s malign prior argument
I’m not sure what the type signature of $L_{A}$ is, or what it means to “not take into account $M$ ’s simulation”. When $A$ makes decisions about which actions to take, it doesn’t have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to “not take it into account”?
So the way in which the agent “gets its beliefs” about the structure of the decision theory problem is via these logical-counterfactual-conditional operation
I think you’ve misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it’s in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That’s all I meant by “get its beliefs”.