Akash comments on AI Control: Improving Safety Despite Intentional Subversion

Akash 5 Dec 2024 16:48 UTC
LW: 1 AF: 1
−18
AF
A few quick thoughts on control (loosely held; I consider myself an “amateur spectator” on this topic and suspect a lot of my opinions may change if I spent 100+ hours engaging with this work):
- I think it’s a clever technical approach and part of me likes that it’s become “a thing.”
- I’m worried that people have sort of “bandwagonned” onto it because (a) there are some high-status people who are involved, (b) there’s an interest in finding things for technical researchers to do, and (c) it appeals to AGI companies.
- I think the AI control discourse would improve if it focused more on discussing concrete control strategies (and their limitations) and focused less on the high-level idea that “AI control is super important and a bedrock approach for mitigating catastrophic risk.” I sometimes feel like the AI control discourse is relatively shallow and just cites the idea/meme of AI control instead of making concrete empirical, methodological, or strategic contributions.
- In contrast, I think this paper is a meaningful contribution. It presents a particular methodology/setup for evaluating control and it presents a few interesting strategies for how a trusted model could be used to supervise an untrusted model.
- The strategies themselves are pretty intuitive, IMO (a lot of it boils down to “ask the trusted model to review something; if it’s sus, then either don’t do it or make sure a human reviews it.”) Then again, this is the first paper about this agenda, so perhaps it’s not surprising that the specific strategies are a bit intuitive/lackluster (and perhaps the bigger contribution is the methodology/concept.)
- Should more technical people work on AI control? My best guess is yes– it does seem like a pretty exciting new field. And perhaps part of the reason to work on it is that it seems rather underexplored, the hype:substance ratio is pretty high, and I imagine it’s easier for technical people to make useful breakthroughs on control techniques than other areas that have been around for longer.
- I also think more technical people should work on policy/comms. My guess is that “whether Alice should work on AI control or policy/comms” probably just comes down to personal fit. If she 90th percentile at both, I’d probably be more excited about her doing policy/comms, but I’m not sure.
- I’ll be interested to see Buck’s thoughts on the “case for ensuring” post.
- Buck 5 Dec 2024 21:32 UTC
  LW: 22 AF: 13
  20
  AF Parent
  I think the AI control discourse would improve if it focused more on discussing concrete control strategies (and their limitations) and focused less on the high-level idea that “AI control is super important and a bedrock approach for mitigating catastrophic risk.” I sometimes feel like the AI control discourse is relatively shallow and just cites the idea/meme of AI control instead of making concrete empirical, methodological, or strategic contributions.
  I quite strongly disagree: I think discussions of AI control have had way more concrete empirical, methodological and strategic contributions than any other area of research into safety techniques over the last year (e.g. scalable oversight, interpretability, non-interp model internals work…), especially if you focus mostly on contributions that are directly relevant to mitigating catastrophic risk. E.g., aside from the contributions of this paper, see our articles on preventing collusion, the game theory of adaptive deployments, eight strategies for few-shot catastrophe detection, the harm control safety case, a systems architecture, etc.
  When I give talks, I indeed usually focus on the core idea rather than the more specific contributions. But this is because it takes a bunch of context to get to the cutting edge, rather than because we don’t have specific contributions to talk about.
  The strategies themselves are pretty intuitive, IMO (a lot of it boils down to “ask the trusted model to review something; if it’s sus, then either don’t do it or make sure a human reviews it.“) Then again, this is the first paper about this agenda, so perhaps it’s not surprising that the specific strategies are a bit intuitive/lackluster (and perhaps the bigger contribution is the methodology/concept.)
  I disagree pretty strongly here too. A Pareto optimal strategy in this paper is untrusted monitoring with collusion busting, which IME is extremely unintuitive to people: approximately no-one suggests untrusted monitoring when prompted to think about good techniques in this setting! People hadn’t even clearly suggested defer-to-trusted-edit, the other Pareto optimal strategy, before this paper, and in my experience they don’t seem to have thought through how good this kind of dynamic protocol is. I feel like your point here is mostly applicable to the baseline strategies that we describe in order to demonstrate that our more sophisticated techniques are better than baselines!
  I’ll be interested to see Buck’s thoughts on the “case for ensuring” post.
  Worst case scenario, I’ll write up these thoughts as part of the LW review process in a year’s time :P
  - Akash 6 Dec 2024 0:58 UTC
    1 point
    0
    Parent
    Thank you for the thoughtful response! It’ll push me to move from “amateur spectator” mode to “someone who has actually read all of these posts” mode :)
    Optional q: What would you say are the 2-3 most important contributions that have been made in AI control research over the last year?