I think your point that “The problem here is that fine-tuning easily strips any safety changes and easily adds all kinds of dangerous things (as long as capability is there).” is spot on and maps to my intuitions about the weaknesses of fine-tuning and one of strongest points in favor of the significant risks to open-sourcing foundation models.
I appreciate your suggestions for other methods of auditing that could possibly work such as a model being run within a protected framework and open-sourcing encrypted weights. I think these allow for something like risk mitigations for partial open-sourcing but would be less feasible for fully open sourced models where weights represented by plain tensors would be more likely to be available
Your comment is helpful and gave me some additional ideas to consider. Thanks!
One thing I would add is that the idea I had in mind for auditing was more of a broader process than a specific tool. The paper I mention to support this idea of a healthy ecosystem for auditing foundation models is “Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing.” Here the authors point to an auditing process that would guide a decision of whether or not to release a specific model and the types of decision points, stakeholders, and review process that might aid in making this decision. At the most abstract level the process includes scoping, mapping, artifact collection, testing, reflection, and post-audit decisions of whether or not to release the model.
Thank you for this comment!
I think your point that “The problem here is that fine-tuning easily strips any safety changes and easily adds all kinds of dangerous things (as long as capability is there).” is spot on and maps to my intuitions about the weaknesses of fine-tuning and one of strongest points in favor of the significant risks to open-sourcing foundation models.
I appreciate your suggestions for other methods of auditing that could possibly work such as a model being run within a protected framework and open-sourcing encrypted weights. I think these allow for something like risk mitigations for partial open-sourcing but would be less feasible for fully open sourced models where weights represented by plain tensors would be more likely to be available
Your comment is helpful and gave me some additional ideas to consider. Thanks!
One thing I would add is that the idea I had in mind for auditing was more of a broader process than a specific tool. The paper I mention to support this idea of a healthy ecosystem for auditing foundation models is “Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing.” Here the authors point to an auditing process that would guide a decision of whether or not to release a specific model and the types of decision points, stakeholders, and review process that might aid in making this decision. At the most abstract level the process includes scoping, mapping, artifact collection, testing, reflection, and post-audit decisions of whether or not to release the model.