ryan_greenblatt answers Does reducing the amount of RL for a given capability level make AI safer?

ryan_greenblatt 6 May 2024 18:08 UTC
16 points
6
If you avoid using RL, then you might need a much “smarter” model for a given level of usefulness.

And even without RL, you need to be getting bits of selection from somewhere: to get useful behavior you have to at the very least specify what useful behavior would be (though the absolute minimum number of bits would be very small given a knowledgable model). (So some selection or steering is surely required, but you might hope this selection/steering is safer for some reason or perhaps more interpretable (like e.g. prompting can in principle be).)

Dramatically cutting down on RL might imply that you need a much, much smarter model overall. (For instance, the safety proposal discussed in “conditioning predictive models” seems to me like it would require a dramatically smarter model than would be required if you used RL normally (if this stuff worked at all).)

Given that a high fraction of the concern (IMO) is proportional to how smart your model is, needing a much smarter model seems very concerning.

Ok, so cutting RL can come with costs, what about the benefits to cutting RL? I think the main concern with RL is that it either teaches the model things that we didn’t actually need and which are dangerous or that it gives it dangerous habits/propensities. For instance, it might teach models to consider extremely creative strategies which humans would have never thought of and which humans don’t at all understand. It’s not clear we need this to do extremely useful things with AIs. Another concern is that some types of outcome-based RL will teach the AI to cleverly exploit our reward provisioning process which results in a bunch of problems.

But, there is a bunch of somewhat dangerous stuff that RL teaches which seems clearly needed for high usefulness. So, if we fix the level of usefulness, this stuff has to be taught to the model by something. For instance, being a competent agent that is at least somewhat aware of its own abilities is probably required. So, when thinking about cutting RL, I don’t think you should be thinking about cutting agentic capabilities as that is very likely required.

My guess is that much more of the action is not in “how much RL”, but is instead in “how much RL of the type that seems particular dangerous and which didn’t result in massive increases in usefulness”. (Which mirrors porby’s answer to some extent.)

In particular we’d like to avoid:
1. RL that will result in AIs learning to pursue clever strategies that humans don’t understand or at least wouldn’t think of. (Very inhuman strategies.) (See also porby’s answer which seems basically reasonable to me.)
2. RL on exploitable outcome-based feedback that results in the AI actually doing the exploitation a non-trivial fraction of the time.
(Weakly exploitable human feedback without the use of outcomes (e.g. the case where the human reviews the full trajectory and rates how good it seems overall) seems slightly concerning, but much less concerning overall. Weak exploitation could be things like sycophancy or knowing when to lie/deceive to get somewhat higher performance.)

Then the question is just how much of a usefulness tax it is to cut back on these types of RL, and then whether this usefulness tax is worth it given that it implies we have to have a smarter model overall to reach a fixed level of usefulness.

(Type (1) of RL from the above list is eventually required for AIs with general purpose qualitatively wildly superhuman capabilities (e.g. the ability to execute very powerful strategies that humans have a very hard time understanding) , but we can probably get done almost everything we want without such powerful models.)

My guess is that in the absence of safety concerns, society will do too much of these concerning types of RL, but might actually do too little of safer types of RL that help to elicit capabilities (because it is easier to just scale up the model further than to figure out how to maximally elicit capabilities).

(Note that my response ignores the cost of training “smarter” models and just focuses on hitting a given level of usefulness as this seems to be the requested analysis in the question.)
- Chris_Leong 7 May 2024 4:13 UTC
  2 points
  0
  Parent
  You mention that society may do too little of the safer types of RL. Can you clarify what you mean by this?
  - ryan_greenblatt 7 May 2024 17:33 UTC
    5 points
    0
    Parent
    In brief: large amounts of high quality process based RL might result in AI being more useful earlier (prior to them becoming much smarter). This might be expensive and annoying (e.g. it might require huge amounts of high quality human labor) such that by default labs do less of this relative to just scaling up models than would be optimal from a safety perspective.