ryan_greenblatt comments on The case for ensuring that powerful AIs are controlled

ryan_greenblatt 29 Jan 2024 19:06 UTC
6 points
3
Control doesn’t assume “no sharp left turns” and “the work you do on smaller systems will apply to the next ones you build”, though things along these lines do help with control.

(For the most obvious interventions, the work seems very likely to be applicable at multiple model scales in practice.)

This might depend on exactly what you mean by “sharp left turn”, I think people often use the term to point at several different different possibly correlated issues, but I think most of them can in principle be handled with control.

The two assumptions we need that could be broken by something like “sharp left turn”:
- Some continuity in capabilities in a small local regime: We do need to assume that we can scale things up by a small amount which only makes the AI somewhat smarter in order to ensure we can run evaluations safely. Or we need some other way of getting slightly continuous capabilties (e.g. by varying how long a given AI runs). However, the actual quantitative amount can be quite small and we can adjust how we do this depending on our observations. See here for some discussion.
- Sample efficient training for capabilities evaluations that is robust to intentional sandbagging: we need to be able to train our AIs to do tasks in a reasonable sample efficient way in a way that is robust to AIs wanting to sandbag on the task. If sharp left turn implies “very low sample efficiency because capabilities now come from a non-retargetable process”, then this would cause problems. In principle, this is a relatively nice problem to work on because you just want to make a number go up and you just want the AI to do as well as if it was trying. We also have some flexibility here because capability evaluation is easier than elicitation (footnote 8).