CoT provides pretty little safety guarantee at the relevant scales
Even if faithfulness goes down at some model scale for a given task, that doesn’t mean that we’ll be using models at that scale (e.g., for cost reasons or since we might not have models at a large scale yet). The results on the addition task show that there are some task difficulties for which even the largest models we tested don’t start to show lower faithfulness, and people will be pushing the difficulties of the tasks they use models on as they get better. So it seems likely to me that no matter the model scale, people will be using models on some tasks where they’ll have faithful reasoning (e.g., tasks near the edge of that model’s abilities).
It seems that almost everyone will likely just continue using the model with the best performance
If you’re using the model in a high-stakes setting and you’re an aligned actor, it’s nice to be able to make tradeoffs between performance and safety. For example, you might care more about safety properties than raw capabilities if you’re an alignment researcher at an AGI lab who’s trying to make progress on the alignment problem with AIs.
Even if faithfulness goes down at some model scale for a given task, that doesn’t mean that we’ll be using models at that scale (e.g., for cost reasons or since we might not have models at a large scale yet). The results on the addition task show that there are some task difficulties for which even the largest models we tested don’t start to show lower faithfulness, and people will be pushing the difficulties of the tasks they use models on as they get better. So it seems likely to me that no matter the model scale, people will be using models on some tasks where they’ll have faithful reasoning (e.g., tasks near the edge of that model’s abilities).
If you’re using the model in a high-stakes setting and you’re an aligned actor, it’s nice to be able to make tradeoffs between performance and safety. For example, you might care more about safety properties than raw capabilities if you’re an alignment researcher at an AGI lab who’s trying to make progress on the alignment problem with AIs.