Fabien Roger comments on Research directions Open Phil wants to fund in technical AI safety

Fabien Roger 10 Feb 2025 12:34 UTC
LW: 2 AF: 2
0
AF
Admittedly you are imagining a future hypothetical where it’s obvious to everyone that long-term optimization is causing problems, but I don’t think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality
That’s right. If the situations where you imagine MONA helping are situations where you can’t see the long-term optimization problems, I think you need a relatively strong second bullet point (especially if the alignment tax is non-negligible), and I am not sure how you get it.
In particular, for the median AIs that labs use to 20x AI safety research, my guess is that you won’t have invisible long-term reward hacking problems, and so I would advise labs to spend the alignment tax on other interventions (like using weaker models when possible, or doing control), not on using process-based rewards. I would give different advice if
- the alignment tax of MONA were tiny
- there were decent evidence for invisible long-term reward hacking problems with catastrophic consequences solved by MONA
I think this is not super plausible to happen, but I am sympathetic to research towards these two goals. So maybe we don’t disagree that much (except maybe on the plausibility of invisible long-term reward hacking problems for the AIs that matter the most).
- Rohin Shah 11 Feb 2025 23:34 UTC
  LW: 2 AF: 2
  0
  AF Parent
  If the situations where you imagine MONA helping are situations where you can’t see the long-term optimization problems, I think you need a relatively strong second bullet point
  That doesn’t seem right. It can simultaneously be the case that you can’t tell that there are problems stemming from long-term optimization problems when you don’t use MONA, and also if you actually use MONA, then it will measurably improve quality.
  For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we’d penalize if we knew about it, but we don’t realize that’s happening). Later when things are put into production errors happen, but it’s chalked up to “well it’s hard to anticipate everything”.
  Instead you use MONA, and it doesn’t learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent’s lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.
  - Fabien Roger 14 Feb 2025 10:38 UTC
    LW: 3 AF: 3
    0
    AF Parent
    To rephrase what I think you are saying are situations where work on MONA is very helpful:
    By default people get bitten by long-term optimization. They notice issues in prod because it’s hard to catch everything. They patch individual failures when they come up, but don’t notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don’t try MONA-like techniques because it’s not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures.
    If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don’t by default think of the individual failures you’ve seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk—though given competitive pressures, that might not be the main factor driving decisions and so you don’t have to make an ironclad case for MONA reducing catastrophic risk).
    Is that right?
    I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
    - Rohin Shah 14 Feb 2025 10:54 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Is that right?
      Yes, that’s broadly accurate, though one clarification:
      This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly
      That’s a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
      I think this will become much more likely once we actually start observing long-term optimization failures in prod.
      Agreed, we’re not advocating for using MONA now (and say so in the paper).
      Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
      Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)