[Question] Does reducing the amount of RL for a given capability level make AI safer?

Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model a) tries and b) finds to be successful. It doesn’t matter whether it includes things like deceiving us or increasing its power.

If this were the case, then when we are building a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible.

For example, we might attempt to achieve the objective using imitation learning instead. Some people would push back against this and argue that this is still a black-box that uses gradient descent with no way of knowing that the internals are safe.

Is this likely to produce safer models or is the risk mostly independent of RL?

Further thoughts: Does this actually matter?

Even if there are techniques involving avoiding or reducing RL that would make AI’s safer, this could backfire if it results in capability externalities. Attempting to reproduce a certain level of capabilities without RL would be highly likely to lead to higher capabilities once RL was added on top and there are always actors who would wish to do this.

I don’t think this invalidates this question though. Some of the various governance interventions may end up bearing fruit and so we may end up with the option of accepting less powerful systems.

No comments.