Jacob Pfau comments on Password-locked models: a stress case for capabilities evaluation

Jacob Pfau 1 Feb 2024 19:48 UTC
1 point
0
Thanks those ‘in the wild’ examples, they’re informative for me on the effectiveness of prompting for elicitation in the cases I’m thinking about. However, I’m not clear on whether we should expect such results to continue to hold given the results here that larger models can learn to override semantic priors when provided random/inverted few-shot labels.

Agreed that for your research question (is worst-case SFT password-ing fixable) the sort of FT experiment you’re conducting is the relevant thing to check.