@bhauth@Rohin Shah I think that bhauth has an important point here about the danger of large gaps between judge and debaters. Similarly, between a trusted overseer and a smarter worker. Keeping the gaps small is really important for a lot of oversight plans to work out well!
Here’s some research I am doing which I think answers this point thoroughly: it is possible to smoothly, continuously, incrementally scale-down the capabilities of a model by injecting carefully controlled amounts of noise into its activations. I’m calling this ‘noise injection impairment’.
This removes the need to have precisely created a whole series of models with precise capability steps between each one. You can instead train a single strong model, and scale it all the way down to be just a tiny step above the next most strong model. Then you create as large a number of intermediate steps of capability as you need by reducing the noise magnitude.
Without this technique, then I believe bhauth’s point would stand, and capability gaps between model versions would lead to dangerous failures of various control and monitoring schemes.
I think the basic idea of using more steps of smaller size is worth considering. Maybe it reduces overall drift, but I suspect it doesn’t, because my view is:
Models have many basins of attraction for sub-elements. As model capability increases continuously, there are nearly-discrete points where aspects of the model jump from 1 basin to another, perhaps with cascading effect. I expect this to produce large gaps from small changes to models.
@bhauth @Rohin Shah I think that bhauth has an important point here about the danger of large gaps between judge and debaters. Similarly, between a trusted overseer and a smarter worker. Keeping the gaps small is really important for a lot of oversight plans to work out well!
Here’s some research I am doing which I think answers this point thoroughly: it is possible to smoothly, continuously, incrementally scale-down the capabilities of a model by injecting carefully controlled amounts of noise into its activations. I’m calling this ‘noise injection impairment’.
This removes the need to have precisely created a whole series of models with precise capability steps between each one. You can instead train a single strong model, and scale it all the way down to be just a tiny step above the next most strong model. Then you create as large a number of intermediate steps of capability as you need by reducing the noise magnitude.
Without this technique, then I believe bhauth’s point would stand, and capability gaps between model versions would lead to dangerous failures of various control and monitoring schemes.
Link to details of ongoing research: https://www.apartresearch.com/project/sandbag-detection-through-model-degradation
I think the basic idea of using more steps of smaller size is worth considering. Maybe it reduces overall drift, but I suspect it doesn’t, because my view is: