Charlie Steiner comments on LLMs for Alignment Research: a safety priority?

Charlie Steiner 5 Apr 2024 7:27 UTC
LW: 6 AF: 4
4
AF
Wouldn’t other people also like to use an AI that can collaborate with them on complex topics? E.g. people planning datacenters, or researching RL, or trying to get AIs to collaborate with other instances of themselves to accurately solve real-world problems?

I don’t think people working on alignment research assistants are planning to just turn it on and leave the building, they on average (weighted by money) seem to be imagining doing things like “explain an experiment in natural language and have an AI help implement it rapidly.”

So I think both they and this post are describing the strategy of “building very generally useful AI, but the good guys will be using it first.” I hear you as saying you want a slightly different profile of generally-useful skills to be targeted.
- abramdemski 5 Apr 2024 15:26 UTC
  LW: 5 AF: 5
  0
  AF Parent
  I don’t think the plan is “turn it on and leave the building” either, but I still think the stated goal should not be automation.
  I don’t quite agree with the framing “building very generally useful AI, but the good guys will be using it first”—the approach I am advocating is not to push general capabilities forward and then specifically apply those capabilities to safety research. That is more like the automation-centric approach I am arguing against.
  Hmm, how do I put this...
  I am mainly proposing more focused training of modern LLMs with feedback from safety researchers themselves, toward the goal of safety researchers getting utility out of these systems; this boosts capabilities for helping-with-safety-research specifically, in a targeted way, because that is what you are getting more+better training feedback on. (Furthermore, checking and maintaining this property would be an explicit goal of the project.)
  I am secondarily proposing better tools to aid in that feedback process; these can be applied to advance capabilities in any area, I agree, but I think it only somewhat exacerbates the existing “LLM moderation” problem; the general solution of “train LLMs to do good things and not bad things” does not seem to get significantly more problematic in the presence of better training tools (perhaps the general situation even gets better). If the project was successful for safety research, it could also be extended to other fields. The question of how to avoid LLMs being helpful for dangerous research would be similar to the LLM moderation question currently faced by Claude, ChatGPT, Bing, etc: when do you want the system to provide helpful answers, and when do you want it to instead refuse to help?
  I am thirdly also mentioning approaches such as training LLMs to interact with proof assistants and intelligently decide when to translate user arguments into formal languages. This does seem like a more concerning general-capability thing, to which the remark “building very generally useful AI, but the good guys will be using it first” applies.