My view on practical safety cases over the next few years is that a lot of the safety cases that are makeable rely on the ability to argue that based on the data that it has, and based on the generalization properties from the data, that the AI is unlikely to be a schemer or a sabotager of evals.
The only good news I can say on safety cases is that we thankfully can assume that at least for some time, the biggest force into what an AI values, and importantly what goals an AI system is likely to have will be deeply influenced by data, since this is the way things worked for a whole lot of AI methods beyond LLMs.
My view on practical safety cases over the next few years is that a lot of the safety cases that are makeable rely on the ability to argue that based on the data that it has, and based on the generalization properties from the data, that the AI is unlikely to be a schemer or a sabotager of evals.
The only good news I can say on safety cases is that we thankfully can assume that at least for some time, the biggest force into what an AI values, and importantly what goals an AI system is likely to have will be deeply influenced by data, since this is the way things worked for a whole lot of AI methods beyond LLMs.
What do you think of the arguments in this post that it’s possible to make safety cases that don’t rely on the model being unlikely to be a schemer?
I’m somewhat optimistic on this happening, conditional on considerable effort being invested.
As always, we will need more work on this agenda, and there will be more information about what control techniques work in practice, and which don’t.