Geoffrey Irving

Karma: 719

Chief Scientist at the UK AI Safety Institute (AISI). Previously, DeepMind, OpenAI, Google Brain, etc.

Geoffrey Irving May 29, 2025, 8:42 AM
LW: 1 AF: 1
0
AF
in reply to: Wei Dai’s comment on: An alignment safety case sketch based on debate
Continuing with the Newtonian physics analogy, the case for optimism would be:

1. We have some theories with limited domain of applicability. Say, theory A.
2. Theory A is wrong at some limit, where it is replaced by theory B. Theory B is still wrong, but it has a larger domain of applicability.
3. We don’t know theory B, and can’t access it despite our best scalable oversight techniques, even though the AIs do figure out theory B. (This is the hard case: I think there other cases where scalable oversight does work.)
4. However, we do have some purchase on the domain of applicability of theory A: we know the limits of where it’s been tested (energy levels, length scales, etc.).
5. Scalable oversight has an easier job talking about these limits to theory A than it doesn’t about theory B itself. Concretely, what this means is that you can express arguments like “theory A doesn’t resolve question Q, as the answer depends on applying theory A beyond it’s decent-confidence domain of applicability”.
6. Profit.

This gives you a capability cap: the AIs know theory B but you can’t use it. But I do think if you can pull off the necessary restriction to which questions you can answer you can muddle through, even if you know only theory A and have some sense of its limits. The limits of Newtonian physics started to appear long before the replacement theories (relativity and quantum). I think we’re in a similar place with the philosophical worries: we have both a bunch of specific games that fail with older theories, and a bunch of proposals (say, variants of FDT) without a clear winner.

The additional big thing you need here is a property of the world that makes that capability cap okay: if the only way to succeed is find perfect solutions using theory B, say because that gives you a necessary edge in an adversarial competition between multiple AIs, then lacking theory B sinks you. But I think we have a shot about not being in the worst case here.

(Sorry as well for delay! Was sick.)

Geoffrey Irving May 15, 2025, 9:37 AM
LW: 4 AF: 4
0
AF
in reply to: Wei Dai’s comment on: An alignment safety case sketch based on debate
The Dodging systematic human errors in scalable oversight post is out as you saw, we can mostly take the conversation over there. But briefly, I think I’m mostly just more bullish on the margin than you about the (1) the probability that we can in fact make purchase on the hard philosophy, should that be necessary and (2) the utility we can get out of solving other problems should the hard philosophy problems remain unsolved. The goal with the dodging human errors post would be that if fail at case (1), we’re more likely to recognise it and try to get utility out of (2) on other questions.

Part of this is that my mental model of formalisations standing the test of time is that we do have a lot of these: both of the links you point to are formalisations that have stood the test of time and have some reasonable domain of applicability in which they say useful things. I agree they aren’t bulletproof, but I think I’d place more chance than you of muddling through with imperfect machinery. This is similar to physics: I would argue for example that Newtonian physics has stood the test of time even though it is wrong, as it still applies across a large domain of applicability.

That said, I’m not all confident in this picture: I’d place a lower probability than you on these considerations biting, but not that low.

Geoffrey Irving May 15, 2025, 9:29 AM
LW: 4 AF: 4
0
AF
in reply to: Wei Dai’s comment on: Dodging systematic human errors in scalable oversight
I think there are roughly two things you can do:
1. In some cases, we will be able to get more accurate answers if we spend more resources (teams of people with more expertise taking more time, etc.). If we can do that, and we know μ (which is hard), we can get some purchase on ε.
2. We set tune ε not based on what’s safe, but based on what is competitive. I.e., we want to solve some particular task domain (AI safety research or the like), and we increase ε until it starts to break making progress, then dial it back a bit. This option isn’t amazing, but I do think is a move we’ll have a take for a bunch of safety parameters, assuming there are parameters which have some capability cost.

Geoffrey Irving May 9, 2025, 12:26 PM
LW: 5 AF: 5
0
AF
in reply to: Wei Dai’s comment on: An alignment safety case sketch based on debate
I broadly agree with these concerns. I think we can split it into (1) the general issue of AGI/ASI driving humans out of distribution and (2) the specific issue of how assumptions about human data quality as used in debate will break down. For (2), we’ll have a short doc soon (next week or so) which is somewhat related, along the lines of “assume humans are right most of the time on a natural distribution, and search for protocols which report uncertainty if the distribution induced by a debate protocol on some new class of questions is sufficiently different”. Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.
One general note is that scalable oversight is a method for accelerating an intractable computation built out of tractable components, and these components can include both human and conventional software. So if you understand the domain somewhat well, you can try mitigate failures of (2) (and potentially gain more traction on (1)) by formalising part of the domain. And this formalisation can be bootstrapped: you can use on-distribution human data to check specifications, and then use those specifications (code, proofs, etc.) in order to rely on human queries for a smaller portion of the over next-stage computation. But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.

Geoffrey Irving May 9, 2025, 12:12 PM
LW: 6 AF: 3
0
AF
in reply to: Charlie Steiner’s comment on: UK AISI’s Alignment Team: Research Agenda
These buckets seem reasonable, and +1 to it being important that some of the resulting ideas are independent of debate. In particular on the inner alignment this exercise (1) made me think exploration hacking might be a larger fraction of the problem than I had thought before, which is encouraging as it might be tractable, but (2) there may be an opening for learning theory that tries to say something about residual error along the lines of https://x.com/geoffreyirving/status/1920554467558105454.

On the systematic human error front, we’ll put out a short post on that soon (next week or so), but broadly the framing is to start with a computation which consults humans, and instead of assuming the humans are have unbiased error instead assume that the humans are wrong on some unknown ε-fraction of queries w.r.t. to some distribution. You can then try to change the debate protocol so that it detects if you can choose the ε-fraction to flip the answer, and report uncertainty in this case. This still requires you to make some big assumption about humans, but is a weaker assumption, and leads to specific ideas for protocol changes.

Geoffrey Irving Feb 19, 2025, 9:23 PM
2 points
0
in reply to: Jordan Taylor’s comment on: Eliciting bad contexts
I do mostly think this requires whitebox methods to reliably solve, so it would be less “narrow down the search space” and more “search via a different parameterisation that takes advantage of whitebox knowledge”.

Geoffrey Irving Feb 7, 2025, 2:50 PM
LW: 2 AF: 2
0
AF
in reply to: harfe’s comment on: Debate, Oracles, and Obfuscated Arguments
You’d need some coupling argument to know that the problems have related difficulty, so that if A is constantly saying “I don’t know” to other similar problems it counts as evidence that A can’t reliably know the answer to this one. But to be clear, we don’t know how to make this particular protocol go through, since we don’t know how to formalise that kind of similarity assumption in a plausibly useful way. We do know a different protocol with better properties (coming soon).

Geoffrey Irving Feb 6, 2025, 9:56 AM
LW: 9 AF: 6
0
AF
in reply to: TurnTrout’s comment on: Eliciting bad contexts
I’m very much in agreement that this is a problem, and among other things blocks us from knowing how to use adversarial attack methods (and AISI teams!) from helping here. Your proposed definition feels like it might be an important part of the story but not the full story, though, since it’s output only: I would unfortunately expect a decent probability of strong jailbreaks that (1) don’t count as intent misalignment but (2) jump you into that kind of red attractor basin. Certainly ending up in that kind of basin could cause a catastrophe, and I would like to avoid it, but I think there is a meaningful notion of “the AI is unlikely to end up in that basin of its own accord, under nonadversarial distributions of inputs”.

Have you seen good attempts at input-side definitions along those lines? Perhaps an ideal story here would be a combination of an input-side definition and the kind of output-side definition you’re pointing at.

Geoffrey Irving Nov 12, 2024, 1:37 PM
LW: 1 AF: 1
0
AF
in reply to: Martín Soto’s comment on: Automation collapse
Yes, that is a clean alternative framing!

Geoffrey Irving Oct 22, 2024, 7:44 AM
LW: 2 AF: 2
0
AF
in reply to: Raymond Douglas’s comment on: Automation collapse
Bounding the space of controller actions more is the key bit. The (vague) claim is that if you have an argument that an empirically tested automated safety scheme is safe, in sense that you’ll know if the output is correct, you may be able to find a more constrained setup where more of the structure is human-defined and easier to analyze, and that the originally argument may port over to the constrained setup.

I’m not claiming this is always possible, though, just that it’s worth searching for. Currently the situation is that we don’t have well-developed arguments that we can recognize the correctness of automated safety work, so it’s hard to test the “less automation” hypothesis concretely.

I don’t think all automated research is automated safety: certainly you can do automated pure capabilities. But I may have misunderstood that part of the question.

Geoffrey Irving Oct 21, 2024, 7:44 PM
LW: 2 AF: 2
0
AF
in reply to: Dave Orr’s comment on: Automation collapse
To clarify, such a jailbreak is a real jailbreak; the claim is that it might not count as much evidence of “intentional misalignment by a model”. If we’re happy to reject all models which can be jailbroken we’ve falsified the model, but if we want to allow models which can be jailbroken but are intent aligned we have a false negative signal for alignment.

Geoffrey Irving Oct 15, 2024, 1:58 PM
1 point
0
in reply to: davidad’s comment on: Proveably Safe Self Driving Cars
I would expect there to be a time complexity blowup if you try to drive the entropy all the way to zero, unfortunately: such things usually have a multiplier like $log (1 / ϵ)$ where $ϵ$ is the desired entropy leakage. In practice I think that would make it feasible to not leak something like a bit per sentence, and then if you have 1000 sentence you have 1000 bits. That may mean you can get a “not 1GB” guarantee, but not something smaller than that.

Geoffrey Irving Jul 29, 2024, 7:02 PM
1 point
0
in reply to: Amalthea’s comment on: Robert Caro And Mechanistic Models In Biography
I can confirm it’s very good!

Geoffrey Irving Jul 29, 2024, 6:38 PM
LW: 5 AF: 5
0
AF
on: Scalable oversight as a quantitative rather than qualitative problem
+1 to the quantitative story. I’ll do the stereotypical thing and add self-links: https://arxiv.org/abs/2311.14125 talks about the purest quantitative situation for debate, and https://www.alignmentforum.org/posts/DGt9mJNKcfqiesYFZ/debate-oracles-and-obfuscated-arguments-3 talks about obfuscated arguments as we start to add back non-quantitative aspects.

Geoffrey Irving Jul 29, 2024, 6:22 PM
LW: 3 AF: 3
0
AF
in reply to: Beth Barnes’s comment on: Debate, Oracles, and Obfuscated Arguments
Thank you!

I think my intuition is that weak obfuscated arguments occur often in the sense that it’s easy to construct examples where Alice thinks for a certain amount time and produces her best possible answer so far, but where she might know that further work would uncover better answers. This shows up for any task like “find me the best X”. But then for most such examples Bob can win if he gets to spend more resources, and then we can settle things by seeing if the answer flips based on who gets more resources.

What’s happening in the primality case is that there is an extremely wide gap between nothing and finding a prime factor. So somehow you have to show that this kind of wide gap only occurs along with extra structure that can be exploited.

Geoffrey Irving Jul 15, 2024, 2:11 PM
0 points
0
in reply to: adamShimi’s comment on: Robert Caro And Mechanistic Models In Biography
We’re not disagreeing: by “covers only two people” I meant “has only two book series”, not “each book series covers literally a single person”.

Geoffrey Irving 15 Jul 2024 10:22 UTC
3 points
0
on: Robert Caro And Mechanistic Models In Biography
Unfortunately all the positives of these books come paired with a critical flaw: Caro only manages to cover two people, and hasn’t even finished the second one!

Have you found other biographers who’ve reached a similar level? Maybe the closest I’ve found was “The Last Lion” by William Manchester, but it doesn’t really compare giving how much the author fawns over Churchill.

Geoffrey Irving 2 Jun 2024 11:19 UTC
149 points
11
on: Non-Disparagement Canaries for OpenAI
To be more explicit, I’m not under any nondisparagement agreement, nor have I ever been. I left OpenAI prior to my cliff, and have never had any vested OpenAI equity.

I am under a way more normal and time-bounded nonsolicit clause with Alphabet.

Geoffrey Irving 1 Jun 2024 22:48 UTC
34 points
9
in reply to: Neel Nanda’s comment on: Non-Disparagement Canaries for OpenAI
I endorse Neel’s argument.

(Also see more explicit comment above, apologies for trying to be cute. I do think I have already presented extensive evidence here.)

Geoffrey Irving 7 Mar 2024 15:17 UTC
LW: 28 AF: 18
9
AF
in reply to: Wei Dai’s comment on: Many arguments for AI x-risk are wrong
I certainly do think that debate is motivated by modeling agents as being optimized to increase their reward, and debate is an attempt at writing down a less hackable reward function. But I also think RL can be sensibly described as trying to increase reward, and generally don’t understand the section of the document that says it obviously is not doing that. And then if the RL algorithm is trying to increase reward, and there is a meta-learning phenomenon that cause agents to learn algorithms, then the agents will be trying to increase reward.

Reading through the section again, it seems like the claim is that my first sentence “debate is motivated by agents being optimized to increase reward” is categorically different than “debate is motivated by agents being themselves motivated to increase reward”. But these two cases seem separated only by a capability gap to me: sufficiently strong agents will be stronger if they record algorithms that adapt to increase reward in different cases.