In this @RogerDearnaleypost, A “Bitter Lesson” Approach to Aligning AGI and ASI, Roger proposes training an AI on a synthetic dataset where all intelligences are motivated by the collective well-being of humanity. You are trying to bias the model to be as close to the basin of attraction for alignment as possible. In-Run Data Shapley could be used to construct such a dataset and guide the training process so that the training data best exemplifies the desired aligned behaviour.
I love this idea! Thanks for suggesting it. (It is of course, not a Bitter Lesson approach, but may well still be a great idea.)
Another area where being able to do this efficiently at scale is going to be really important is once models start showing dangerous levels of capability on WMB-dangerous chem/bio/radiological/nuclear (CBRN) and self-replication skills. The best way to deal with this is to make sure these skills aren’t in the model at all, so the model can’t be fine-tuned back to these capabilities (as is required to produce a model of this level where one could at least discuss open-sourcing it, rather than that being just flagrantly crazy and arguably perhaps already illegal), is to omit key knowledge from the training set entirely. Which inevitably isn’t going to succeed on the first pass, but this technique applied to the first pass gives us a way to find (hopefully) everything we need to remove from the training set so we can do a second training run that has specific, focused, narrow gaps in its capabilities.
And yes, I’m interested in work in this area (my AI day-job allowing).
I love this idea! Thanks for suggesting it. (It is of course, not a Bitter Lesson approach, but may well still be a great idea.)
Another area where being able to do this efficiently at scale is going to be really important is once models start showing dangerous levels of capability on WMB-dangerous chem/bio/radiological/nuclear (CBRN) and self-replication skills. The best way to deal with this is to make sure these skills aren’t in the model at all, so the model can’t be fine-tuned back to these capabilities (as is required to produce a model of this level where one could at least discuss open-sourcing it, rather than that being just flagrantly crazy and arguably perhaps already illegal), is to omit key knowledge from the training set entirely. Which inevitably isn’t going to succeed on the first pass, but this technique applied to the first pass gives us a way to find (hopefully) everything we need to remove from the training set so we can do a second training run that has specific, focused, narrow gaps in its capabilities.
And yes, I’m interested in work in this area (my AI day-job allowing).