Fabien Roger comments on Research directions Open Phil wants to fund in technical AI safety

Fabien Roger 8 Feb 2025 19:10 UTC
LW: 8 AF: 6
−1
AF
- Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
- Mild optimization. I’m particularly interested in MONA (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
I think additional non-moonshot work in these domains will have a very hard time helping.
[low confidence] My high level concern is that non-moonshot work in these clusters may be the sort of things labs will use anyway (with or without safety push) if this helped with capabilities because the techniques are easy to find, and won’t use if it didn’t help with capabilities because the case for risk reduction is weak. This concern is mostly informed by my (relatively shallow) read of recent work in these clusters.
[edit: I was at least somewhat wrong, see comment threads below]
Here are things that would change my mind:
- If I thought people were making progress towards techniques with nicer safety properties and no alignment tax that seem hard enough to make workable in practice that capabilities researchers won’t bother using by default, but would bother using if there was existing work on how to make them work.
  - (For the different question of preventing AIs from using “harmful knowledge”, I think work on robust unlearning and gradient routing may have this property—the current SoTA is far enough from a solution that I expect labs to not bother doing anything good enough here, but I think there is a path to legible success, and conditional on success I expect labs to pick it up because it would be obviously better, more robust, and plausibly cheaper than refusal training + monitoring. And I think robust unlearning and gradient routing have better safety properties than refusal training + monitoring.)
- If I thought people were making progress towards understanding when not using process-based supervision and debate is risky. This looks like demos and model organisms aimed at measuring when, in real life, not using these simple precautions would result in very bad outcomes while using the simple precautions would help.
  - (In the different domain of CoT-faithfulness I think there is a lot of value in demonstrating the risk of opaque CoT well-enough that labs don’t build techniques that make CoT more opaque if it just slightly increased performance because I expect that it will be easier to justify. I think GDM’s updated safety framework is a good step in this direction as it hints at additional requirements GDM may have to fulfill if it wanted to deploy models with opaque CoT past a certain level of capabilities.)
- If I thought that research directions included in the cluster you are pointing at were making progress towards speeding up capabilities in safety-critical domains (e.g. conceptual thinking on alignment, being trusted neutral advisors on geopolitical matters, …) relative to baseline methods (i.e. the sort of RL you would do by default if you wanted to make the model better at the safety-critical task if you had no awareness of anything people did in the safety literature).
I am not very aware of what is going on in this part of the AI safety field. It might be the case that I would change my mind if I was aware of certain existing pieces of work or certain arguments. In particular I might be too skeptical about progress on methods for things like debate and process-based supervision—I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
It’s also possible that I am missing an important theory of change for this sort of research.
(I’d still like it if more people worked on alignment: I am excited about projects that look more like the moonshots described in the RFP and less like the kind of research I think you are pointing at.)
- Rohin Shah 8 Feb 2025 22:35 UTC
  LW: 4 AF: 5
  1
  AF Parent
  I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
  I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn’t have to grapple as hard with the challenge of defining the approval feedback as I’d expect in a realistic deployment. But it does impose an alignment tax, so there’s no point in using MONA currently, when good enough alignment is easy to achieve with RLHF and its variants, or RL on ground truth signals. I guess in some sense the question is “how big is the alignment tax”, and I agree we don’t know the answer to that yet and may not have enough understanding by the time it is relevant, but I don’t really see why one would think “nah it’ll only work in toy domains”.
  I agree debate doesn’t work yet, though I think >50% chance we demonstrate decent results in some LLM domain (possibly a “toy” one) by the end of this year. Currently it seems to me like a key bottleneck (possibly the only one) is model capability, similarly to how model capability was a bottleneck to achieving the value of RL on ground truth until ~2024).
  It also seems like it would still be useful if the methods were used some time after the labs want to use it for production runs.
  It’s wild to me that you’re into moonshots when your objection to existing proposals is roughly “there isn’t enough time for research to make them useful”. Are you expecting the moonshots to be useful immediately?
  - Fabien Roger 9 Feb 2025 14:14 UTC
    LW: 5 AF: 4
    0
    AF Parent
    Which of the bullet points in my original message do you think is wrong? Do you think MONA and debate papers are:
    on the path to techniques that measurably improve feedback quality on real domains with potentially a low alignment tax, and that are hard enough to find that labs won’t use them by default?
    on the path to providing enough evidence of their good properties that even if they did not measurably help with feedback quality in real domains (and slightly increased cost), labs could be convinced to use them because they are expected to improve non-measurable feedback quality?
    on the path to speeding up safety-critical domains?
    (valuable for some other reason?)
    - Rohin Shah 9 Feb 2025 18:25 UTC
      LW: 2 AF: 2
      0
      AF Parent
      I have some credence in all three of those bullet points.
      For MONA it’s a relatively even mixture of the first and second points.
      (You are possibly the first person I know of who reacted to MONA with “that’s obvious” instead of “that obviously won’t perform well, why would anyone ever do it”. Admittedly you are imagining a future hypothetical where it’s obvious to everyone that long-term optimization is causing problems, but I don’t think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)
      For debate it’s mostly the first point, and to some extent the third point.
      - Fabien Roger 10 Feb 2025 12:34 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Admittedly you are imagining a future hypothetical where it’s obvious to everyone that long-term optimization is causing problems, but I don’t think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality
        That’s right. If the situations where you imagine MONA helping are situations where you can’t see the long-term optimization problems, I think you need a relatively strong second bullet point (especially if the alignment tax is non-negligible), and I am not sure how you get it.
        In particular, for the median AIs that labs use to 20x AI safety research, my guess is that you won’t have invisible long-term reward hacking problems, and so I would advise labs to spend the alignment tax on other interventions (like using weaker models when possible, or doing control), not on using process-based rewards. I would give different advice if
        the alignment tax of MONA were tiny
        there were decent evidence for invisible long-term reward hacking problems with catastrophic consequences solved by MONA
        I think this is not super plausible to happen, but I am sympathetic to research towards these two goals. So maybe we don’t disagree that much (except maybe on the plausibility of invisible long-term reward hacking problems for the AIs that matter the most).
        Rohin Shah 11 Feb 2025 23:34 UTC
        LW: 2 AF: 2
        0
        AF Parent
        If the situations where you imagine MONA helping are situations where you can’t see the long-term optimization problems, I think you need a relatively strong second bullet point
        That doesn’t seem right. It can simultaneously be the case that you can’t tell that there are problems stemming from long-term optimization problems when you don’t use MONA, and also if you actually use MONA, then it will measurably improve quality.
        For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we’d penalize if we knew about it, but we don’t realize that’s happening). Later when things are put into production errors happen, but it’s chalked up to “well it’s hard to anticipate everything”.
        Instead you use MONA, and it doesn’t learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent’s lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.
        Fabien Roger 14 Feb 2025 10:38 UTC
        LW: 3 AF: 3
        0
        AF Parent
        To rephrase what I think you are saying are situations where work on MONA is very helpful:
        By default people get bitten by long-term optimization. They notice issues in prod because it’s hard to catch everything. They patch individual failures when they come up, but don’t notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don’t try MONA-like techniques because it’s not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures.
        If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don’t by default think of the individual failures you’ve seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk—though given competitive pressures, that might not be the main factor driving decisions and so you don’t have to make an ironclad case for MONA reducing catastrophic risk).
        Is that right?
        I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
        Rohin Shah 14 Feb 2025 10:54 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Is that right?
        Yes, that’s broadly accurate, though one clarification:
        This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly
        That’s a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
        I think this will become much more likely once we actually start observing long-term optimization failures in prod.
        Agreed, we’re not advocating for using MONA now (and say so in the paper).
        Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
        Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)
      - ryan_greenblatt 9 Feb 2025 19:17 UTC
        LW: 2 AF: 2
        0
        AF Parent
        
        You are possibly the first person I know of who reacted to MONA with “that’s obvious”
        
        I also have the “that’s obvious reaction”, but possibly I’m missing somne details. I also think it won’t perform well enough in practice to pencil given other better places to allocate safety budget (if it does trade off which is unclear).
        Rohin Shah 10 Feb 2025 5:22 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I meant “it’s obvious you should use MONA if you are seeing problems with long-term optimization”, which I believe is Fabien’s position (otherwise it would be “hard to find”).
        Your reaction seems more like “it’s obvious MONA would prevent multi-step reward hacks”; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).
  - Fabien Roger 9 Feb 2025 13:52 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I don’t really see why one would think “nah it’ll only work in toy domains”.
    That is not my claim. By “I would have guessed that methods work on toy domains and math will be useless” I meant “I would have guessed that if a lab decided to do process-based feedback, it will be better off not doing a detailed literature review of methods in MONA and followups on toy domains, and just do the process-based supervision that makes sense in the real domain they now look at. The only part of the method section of MONA papers that matters might be “we did process-based supervision”.”
    I did not say “methods that work on toy domains will be useless” (my sentence was easy to misread).
    I almost have the opposite stance, I am closer to “it’s so obvious that process-based feedback helps that if capabilities people ever had issues stemming from long-term optimization, they would obviously use more myopic objectives. So process-based feedback so obviously prevents problems from non-myopia in real life that the experiments in the MONA paper don’t increase the probability that people working on capabilities implement myopic objectives.”
    But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings (above the “capability researcher does what looks easiest and sensible to them” baseline)?
    I agree debate doesn’t work yet
    Not a crux. My guess is that if debate did “work” to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality. So my low confidence guess is that it’s not high priority for people working on x-risk safety to speed up that work.
    But I am excited about debate work that is not just about improving feedback quality. For example I am interested in debate vs schemers or debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn’t show up in “feedback quality benchmarks (e.g. rewardbench)” but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.
    - Rohin Shah 9 Feb 2025 18:23 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Got it, that makes more sense. (When you said “methods work on toy domains” I interpreted “work” as a verb rather than a noun.)
      But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings
      I think by far the biggest open question is “how do you provide the nonmyopic approval so that the model actually performs well”. I don’t think anyone has even attempted to tackle this so it’s hard to tell what you could learn about it, but I’d be surprised if there weren’t generalizable lessons to be learned.
      I agree that there’s not much benefit in “methods work” if that is understood as “work on the algorithm / code that given data + rewards / approvals translates it into gradient updates”. I care a lot more about iterating on how to produce the data + rewards / approvals.
      My guess is that if debate did “work” to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality.
      I’d weakly bet against this, I think there will be lots of fiddly design decisions that you need to get right to actually see the benefits, plus iterating on this is expensive and hard because it involves multiagent RL. (Certainly this is true of our current efforts; the question is just whether this will remain true in the future.)
      For example I am interested in [...] debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn’t show up in “feedback quality benchmarks (e.g. rewardbench)” but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.
      I’m confused. This seems like the central example of work I’m talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you’re talking about as well.)
      EDIT: And tbc this is the kind of thing I mean by “improving average-case feedback quality”. I now feel like I don’t know what you mean by “feedback quality”.
      - Fabien Roger 10 Feb 2025 12:21 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I’m confused. This seems like the central example of work I’m talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you’re talking about as well.)
        My bad, I was a bit sloppy here. The debate-for-control stuff is in the RFP but not the debate vs subtle reward hacks that don’t show up in feedback quality evals.
        I think we agree that there are some flavors of debate work that are exciting and not present in the RFP.