Jacob Pfau comments on Password-locked models: a stress case for capabilities evaluation

Jacob Pfau 31 Jan 2024 5:54 UTC
1 point
0
Do you have results on the effectiveness of few-shot prompting (using correct, hard queries in prompt) for password-tuned models? Particularly interested in scaling w.r.t. number of shots. This would get at the extent to which few-shot capabilities elicitation can fail in pre-train-only models which is a question I’m interested in.
- Fabien Roger 31 Jan 2024 17:19 UTC
  5 points
  0
  Parent
  Haven’t run such experiments.
  Note that you can lock a model such that it remains locked even with few-shot prompting by just training on such examples (training a model to ignore the few-shot examples is easy).
  Also, few-shot prompting sometimes suck at eliciting capabilities even in the wild. For instance, few-shot prompting sucks at eliciting sorting abilities from GPT-3, and few-shot prompting with random labels “”works”″, which means models can’t learn a counter-intuitive task on a dataset that hints at another task.
  What I am currently running are experiments where you lock a model, and then measure if fine-tuning for many epochs on very few high-quality samples (e.g. 8) unlocks it. For this setup, it’s unclear how you’d lock if you don’t know in advance what the few high-quality samples will be. Preliminary results suggest that it’s very easy to unlock pwd-locked models with few samples.
  - Jacob Pfau 1 Feb 2024 19:48 UTC
    1 point
    0
    Parent
    Thanks those ‘in the wild’ examples, they’re informative for me on the effectiveness of prompting for elicitation in the cases I’m thinking about. However, I’m not clear on whether we should expect such results to continue to hold given the results here that larger models can learn to override semantic priors when provided random/inverted few-shot labels.
    
    Agreed that for your research question (is worst-case SFT password-ing fixable) the sort of FT experiment you’re conducting is the relevant thing to check.