Rohin Shah comments on Mark Xu’s Shortform

Rohin Shah 7 Oct 2024 9:13 UTC
LW: 9 AF: 6
2
AF
This seems to presume that you can divide up research topics into “alignment” vs “control” but this seems wrong to me. E.g. my categorization would be something like:
- Clearly alignment: debate theory, certain flavors of process supervision
- Clearly control: removing affordances (e.g. “don’t connect the model to the Internet”)
- Could be either one: interpretability, critique models (in control this is called “untrusted monitoring”), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain of thought, …
Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn’t make much sense for research, most of which helps with both. It does make sense as a distinction for “what plan will you implement in practice”—but labs have said very little publicly about that.
Other things that characterize work done under the name of “control” so far are (1) it tries to be very concrete about its threat models, to a greater degree than most other work in AI safety, and (2) it tries to do assurance, taking a very worst case approach. Maybe you’re saying that people should do those things more, but this seems way more contentious and I’d probably just straightforwardly disagree with the strength of your recommendation (though probably not its direction).
Nitpick: I would also quibble with your definitions; under your definitions, control seems like a subset of alignment (the one exception if you notice the model is scheming and then simply stop using AI). I think you really have to define alignment as models reliably doing what you want independent of the surrounding context, or talk about “trying to do what you want” (which only makes sense when applied to models, so has similar upshots).
Tbc I like control and think more effort should be put into it; I just disagree with the strength of the recommendation here.
- ryan_greenblatt 7 Oct 2024 16:01 UTC
  LW: 6 AF: 6
  1
  AF Parent
  This seems to presume that you can divide up research topics into “alignment” vs “control” but this seems wrong to me.
  
  Agreed. (And I made a comment to this effect on Mark’s draft of this short form.) I don’t think it is that productive to classify research directions, the central thing we’re trying to point at is a classification for reasons why you might think that you’re fine where one reason is “the AI couldn’t do bad stuff without getting caught” and another reason is “the AI wouldn’t do bad stuff” (obviously simplifying somewhat).
  
  I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.)
  
  The claims I’d like to make is:
  - There should be more research specifically targeting a black-box control story and which is operating from a black-box control perspective. (I’d guess like 20% of the overall AI safety portfolio should be this and maybe 30% of safety work at labs. This is not including work like “get AIs to be more useful at alignment research and get better at DC eval methodology” which should also be subsidized from a control perspective but seem non-central.)
  - When people do research which could be pretty useful from a black-box control perspective (e.g. debate or other scalable oversight style research), they should consider sometimes explicitly think about the control perspective and consider adjusting their research accordingly (to the extent they can understand this perspective well enough). To the extent that people have an alignment style story they think about while doing their research, they should be roughly equally excited about thinking about the control story. I also find that the control story is often more straightforward.
  - People thinking about safety plans for near term powerful AI should strongly consider control.
- ryan_greenblatt 7 Oct 2024 15:48 UTC
  LW: 5 AF: 5
  0
  AF Parent
  
  Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn’t make much sense for research, most of which helps with both. It does make sense as a distinction for “what plan will you implement in practice”—but labs have said very little publicly about that.
  
  You’re potentially thinking about footnote 4 in our post on control:
  
  As one operationalization, suppose that you had two teams at your AI lab, one of which was much more competent, and you had to assign one to be in charge of minimizing the probability that the AIs you’re about to train and deploy were scheming, and the other to be in charge of ensuring control. To address risk due to powerful AI in the next 8 years, we think it would be better to assign the more competent team to ensuring control, because we think that the quality of control interventions is more influential on this risk.
  
  This footnote is a bit confusingly worded, but I think the situation we were trying to say is “Suppose you expect dangerously powerful AI in the next year, and the current year is prior to 2032. We’re guessing you should put the better team on control.”
  
  This is different than research in the run up.
  
  I also think that to the extent people are trying to do backchained research focused on specific applications, it makes sense to put the better team on control over reducing the chance that scheming arises. (But these aren’t the only classes of interventions and some interventions don’t nicely fit into these buckets, e.g., you can do work on differentially making AIs more useful for alignment work which isn’t well classified as either and you can work on high level interpretability which aims to roughly understand how AIs make decisions in some cases (this high-level interp doesn’t clearly help with reducing the chance that scheming arises very directly, but could help with a bunch of stuff).)