A rigorous and healthy ecosystem for auditing foundation models could alleviate substantial risks of open sourcing.
The problem here is that fine-tuning easily strips any safety changes and easily adds all kinds of dangerous things (as long as capability is there).
The only form of auditing that might work is if a model can only be run from within a protected framework, which is doing quite a bit of auditing on the fly, before allowing an inference to go through...
I can see how this can be compatible with open-sourcing encrypted weights (which, by the way, might prevent the ability to fine-tune at all as well)...
It’s more difficult to imagine how this might work for weights represented by plain tensors (assuming that the model architecture is understood, and that it’s not too difficult to write an unprotected version of the fine-tuning and inference engines).
I think your point that “The problem here is that fine-tuning easily strips any safety changes and easily adds all kinds of dangerous things (as long as capability is there).” is spot on and maps to my intuitions about the weaknesses of fine-tuning and one of strongest points in favor of the significant risks to open-sourcing foundation models.
I appreciate your suggestions for other methods of auditing that could possibly work such as a model being run within a protected framework and open-sourcing encrypted weights. I think these allow for something like risk mitigations for partial open-sourcing but would be less feasible for fully open sourced models where weights represented by plain tensors would be more likely to be available
Your comment is helpful and gave me some additional ideas to consider. Thanks!
One thing I would add is that the idea I had in mind for auditing was more of a broader process than a specific tool. The paper I mention to support this idea of a healthy ecosystem for auditing foundation models is “Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing.” Here the authors point to an auditing process that would guide a decision of whether or not to release a specific model and the types of decision points, stakeholders, and review process that might aid in making this decision. At the most abstract level the process includes scoping, mapping, artifact collection, testing, reflection, and post-audit decisions of whether or not to release the model.
Thanks for the post.
The problem here is that fine-tuning easily strips any safety changes and easily adds all kinds of dangerous things (as long as capability is there).
The only form of auditing that might work is if a model can only be run from within a protected framework, which is doing quite a bit of auditing on the fly, before allowing an inference to go through...
I can see how this can be compatible with open-sourcing encrypted weights (which, by the way, might prevent the ability to fine-tune at all as well)...
It’s more difficult to imagine how this might work for weights represented by plain tensors (assuming that the model architecture is understood, and that it’s not too difficult to write an unprotected version of the fine-tuning and inference engines).
Thank you for this comment!
I think your point that “The problem here is that fine-tuning easily strips any safety changes and easily adds all kinds of dangerous things (as long as capability is there).” is spot on and maps to my intuitions about the weaknesses of fine-tuning and one of strongest points in favor of the significant risks to open-sourcing foundation models.
I appreciate your suggestions for other methods of auditing that could possibly work such as a model being run within a protected framework and open-sourcing encrypted weights. I think these allow for something like risk mitigations for partial open-sourcing but would be less feasible for fully open sourced models where weights represented by plain tensors would be more likely to be available
Your comment is helpful and gave me some additional ideas to consider. Thanks!
One thing I would add is that the idea I had in mind for auditing was more of a broader process than a specific tool. The paper I mention to support this idea of a healthy ecosystem for auditing foundation models is “Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing.” Here the authors point to an auditing process that would guide a decision of whether or not to release a specific model and the types of decision points, stakeholders, and review process that might aid in making this decision. At the most abstract level the process includes scoping, mapping, artifact collection, testing, reflection, and post-audit decisions of whether or not to release the model.