Good questions. Doing any kind of technical safety research that leads to better understanding of state of the art models carries with it the risk that by understanding models better, we might learn how to improve them. However, I think that the safety benefit of understanding models outweighs the risk of small capability increases, particularly since any capability increase is likely heavily skewed towards model specific interventions (e.g. “this specific model trained on this specific dataset exhibits bias x in domain y, and could be improved by retraining with more varied data from domain y”, rather than “the performance of all of models of this kind could be improved with some intervention z”). I’m thinking about this a lot at the moment and would welcome further input.
Good questions. Doing any kind of technical safety research that leads to better understanding of state of the art models carries with it the risk that by understanding models better, we might learn how to improve them. However, I think that the safety benefit of understanding models outweighs the risk of small capability increases, particularly since any capability increase is likely heavily skewed towards model specific interventions (e.g. “this specific model trained on this specific dataset exhibits bias x in domain y, and could be improved by retraining with more varied data from domain y”, rather than “the performance of all of models of this kind could be improved with some intervention z”). I’m thinking about this a lot at the moment and would welcome further input.