I expect a lot more open releases this year and am committed to test their capabilities and safety guardrails rigorously.
Glad you’re planning on continual testing, that seems particularly important here, where the default is every once in awhile some new report comes out with a single data point about how good some model is and people slightly freak out. Having the context of testing numerous models over time seems crucial for actually understanding the situation and being able to predict upcoming trends. Hopefully you have and will continue to find ways to reduce the effort needed to run marginal experiments, e.g., having a few clearly defined tasks you repeatedly use, reusing finetuning datasets, etc.
Glad you’re planning on continual testing, that seems particularly important here, where the default is every once in awhile some new report comes out with a single data point about how good some model is and people slightly freak out. Having the context of testing numerous models over time seems crucial for actually understanding the situation and being able to predict upcoming trends. Hopefully you have and will continue to find ways to reduce the effort needed to run marginal experiments, e.g., having a few clearly defined tasks you repeatedly use, reusing finetuning datasets, etc.