Are there alternatives to limiting the distribution/releasing model weights publicly?
Would it be possible to use other technologies and/or techniques (e.g. federated learning, smart contracts, cryptography, etc.) for interpretability researchers to still have access to weights with the presence of bad actors in mind?
I understand that the benefits of limiting outweighs the costs, but I still think leaving the capacity to perform safety research to centralized private orgs could result in less robust capabilities and/or alignment methods. In addition, I believe it’s not likely that releasing model weights will stop due to economic pressures (I need to research more on this).
I’m curious to hear about creative, subversive alternatives/theories that do not require limiting the distribution.
Ok, so in the Overview we cite Yang et al. While their work is similar they do have a somewhat different take and support open releases, *if*:
“1. Data Filtering: filtering harmful text when constructing training data would potentially reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing models: once the models are safely aligned, aligning them toward harmful content will destroy them, concurrently also discussed by (Henderson et al., 2023).” yang et al.
I also looked into henderson et al. but I am not sure if it is exactly what we would be looking for. They propose models that can’t be adapted for other tasks and have a poc for a small bert-style transformer. But i can’t evaluate if this would work with our models.
Are there alternatives to limiting the distribution/releasing model weights publicly?
Would it be possible to use other technologies and/or techniques (e.g. federated learning, smart contracts, cryptography, etc.) for interpretability researchers to still have access to weights with the presence of bad actors in mind?
I understand that the benefits of limiting outweighs the costs, but I still think leaving the capacity to perform safety research to centralized private orgs could result in less robust capabilities and/or alignment methods. In addition, I believe it’s not likely that releasing model weights will stop due to economic pressures (I need to research more on this).
I’m curious to hear about creative, subversive alternatives/theories that do not require limiting the distribution.
Ok, so in the Overview we cite Yang et al. While their work is similar they do have a somewhat different take and support open releases, *if*:
“1. Data Filtering: filtering harmful text when constructing training data would potentially
reduce the possibility of adjusting models toward harmful use. 2. Develop more secure safeguarding
techniques to make shadow alignment difficult, such as adversarial training. 3. Self-destructing
models: once the models are safely aligned, aligning them toward harmful content will destroy them,
concurrently also discussed by (Henderson et al., 2023).” yang et al.
I also looked into henderson et al. but I am not sure if it is exactly what we would be looking for. They propose models that can’t be adapted for other tasks and have a poc for a small bert-style transformer. But i can’t evaluate if this would work with our models.