Yeah, I was thinking about using SAD. The main issue is that for non-AGI-lab-sized models, you’ll have a tough time eliciting SA. However, we could potentially focus on precursor capabilities and such.
If you are concerned about capabilities like SA, then you might ask yourself, “it seems highly unlikely that you can figure out which data points impact SA the most because it will likely be a mix of many things and each data point will play a role in accumulating to SA.” My guess is that you can break down SA into enough precursor capabilities that this approach can still be highly predictive even if it isn’t 100%/
I think forcing them to retrieve in-context sounds good, but I also think labs may not want this, not sure. Basically, they’ll want to train things into the model eventually, like for many CoT things.
Agreed on having a validation set for reducing the alignment tax.
Yeah, I was thinking about using SAD. The main issue is that for non-AGI-lab-sized models, you’ll have a tough time eliciting SA. However, we could potentially focus on precursor capabilities and such.
If you are concerned about capabilities like SA, then you might ask yourself, “it seems highly unlikely that you can figure out which data points impact SA the most because it will likely be a mix of many things and each data point will play a role in accumulating to SA.” My guess is that you can break down SA into enough precursor capabilities that this approach can still be highly predictive even if it isn’t 100%/
I think forcing them to retrieve in-context sounds good, but I also think labs may not want this, not sure. Basically, they’ll want to train things into the model eventually, like for many CoT things.
Agreed on having a validation set for reducing the alignment tax.