paulfchristiano comments on Thoughts on sharing information about language model capabilities

paulfchristiano 14 Aug 2023 16:51 UTC
LW: 19 AF: 7
2
AF
By process-based RL, I mean: the reward for an action doesn’t depend on the consequences of executing that action. Instead it depends on some overseer’s evaluation of the action, potentially after reading justification or a debate about it or talking with other AI assistants or whatever. I think this has roughly the same risk profile as imitation learning, while potentially being more competitive.
I’m generally excited and optimistic about coordination. If you are just saying that AI non-proliferation isn’t that much harder than nuclear non-proliferation, then I think I’m with you. But I think (i) it’s totally fair to call that “strong global coordination,” (ii) you would probably have to do a somewhat better job than we did of nuclear non-proliferation.
I think the technical question is usually going to be about how to trade off capability against risk. If you didn’t care about that at all, you could just not build scary ML systems. I’m saying that you should build smaller models with process-based RL.
It might be good to focus on legible or easy-to-enforce lines rather than just trading off capability vs risk optimally. But I don’t think that “no RL” is effective as a line—it still leaves you with a lot of reward-hacking (e.g. by planning against an ML model, or predicting what actions lead to a high reward, or expert iteration...). Trying to avoid all these things requires really tightly monitoring every use of AI, rather than just training runs. And I’m not convinced it helps significantly with deceptive alignment.
So in any event it seems like you are going to care about model size. “No big models” is also a way easier line to enforce. This is pretty much like saying “minimize the amount of black-box end-to-end optimization you do,” which feels like it gets closer to the heart of the issue.
If you are taking that approach, I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models (and will ultimately want to use outcomes in relatively safe ways). Yes it would be safer to use neither process-based RL nor big models, and just make your AI weaker. But the main purpose of technical work is to reduce how demanding the policy ask is—how much people are being asked to give up, how unstable the equilibrium is, how much powerful AI we can tolerate in order to help enforce or demonstrate necessity. Otherwise we wouldn’t be talking about these compromises at all—we’d just be pausing AI development now until safety is better understood.
I would quickly change my tune on this if e.g. we got some indication that process-based RL increased rather than decreased the risk of deceptive alignment at a fixed level of capability.
- michaelcohen 17 Aug 2023 20:59 UTC
  LW: 13 AF: 6
  0
  AF Parent
  I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.
  I agree with this in a sense, although I may be quite a bit a more harsh about what counts as “executing an action”. For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as “executing the action” in the overseer-conversation environment, even if the action looks like it’s for some other environment, like a plan to launch a new product in the market. I do think myopia in this environment would suffice for existential safety, but I don’t know how much myopia we need.
  If you’re always talking about myopic/process-based RLAIF when you say RLAIF, then I think what you’re saying is defensible. I speculate that not everyone reading this recognizes that your usage of RLAIF implies RLAIF with a level of myopia that matches current instances of RLAIF, and that that is a load-bearing part of your position.
  I say “defensible” instead of fully agreeing because I weakly disagree that increasing compute is any more of a dangerous way to improve performance than by modifying the objective to a new myopic objective. That is, I disagree with this:
  I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models
  You suggest that increasing compute is the last thing we should do if we’re looking for performance improvements, as opposed to adding a very myopic approval-seeking objective. I don’t see it. I think changing the objective from imitation learning is more likely to lead to problems than scaling up the imitation learners. But this is probably beside the point, because I don’t think problems are particularly likely in either case.