Note that you can lock a model such that it remains locked even with few-shot prompting by just training on such examples (training a model to ignore the few-shot examples is easy).
Also, few-shot prompting sometimes suck at eliciting capabilities even in the wild. For instance, few-shot prompting sucks at eliciting sorting abilities from GPT-3, and few-shot prompting with random labels “”works”″, which means models can’t learn a counter-intuitive task on a dataset that hints at another task.
What I am currently running are experiments where you lock a model, and then measure if fine-tuning for many epochs on very few high-quality samples (e.g. 8) unlocks it. For this setup, it’s unclear how you’d lock if you don’t know in advance what the few high-quality samples will be. Preliminary results suggest that it’s very easy to unlock pwd-locked models with few samples.
Thanks those ‘in the wild’ examples, they’re informative for me on the effectiveness of prompting for elicitation in the cases I’m thinking about. However, I’m not clear on whether we should expect such results to continue to hold given the results here that larger models can learn to override semantic priors when provided random/inverted few-shot labels.
Agreed that for your research question (is worst-case SFT password-ing fixable) the sort of FT experiment you’re conducting is the relevant thing to check.
Haven’t run such experiments.
Note that you can lock a model such that it remains locked even with few-shot prompting by just training on such examples (training a model to ignore the few-shot examples is easy).
Also, few-shot prompting sometimes suck at eliciting capabilities even in the wild. For instance, few-shot prompting sucks at eliciting sorting abilities from GPT-3, and few-shot prompting with random labels “”works”″, which means models can’t learn a counter-intuitive task on a dataset that hints at another task.
What I am currently running are experiments where you lock a model, and then measure if fine-tuning for many epochs on very few high-quality samples (e.g. 8) unlocks it. For this setup, it’s unclear how you’d lock if you don’t know in advance what the few high-quality samples will be. Preliminary results suggest that it’s very easy to unlock pwd-locked models with few samples.
Thanks those ‘in the wild’ examples, they’re informative for me on the effectiveness of prompting for elicitation in the cases I’m thinking about. However, I’m not clear on whether we should expect such results to continue to hold given the results here that larger models can learn to override semantic priors when provided random/inverted few-shot labels.
Agreed that for your research question (is worst-case SFT password-ing fixable) the sort of FT experiment you’re conducting is the relevant thing to check.