It seems like fine-tuning a general language model on truthful question-answers didn’t generalize all that well
I’m not sure what you’re referring to. Is there some experiment you have in mind? People have finetuned general language models to answer general-knowledge questions correctly, but I don’t know of people finetuning for being truthful (i.e. avoiding falsehoods).
How might you design or train a language model so that it would have some general factor of honesty that could be easily affected by fine-tuning?
There’s a more general question. Can you train a model to have property X in such a way that finetuning is unlikely to easily remove property X? If any kind of finetuning is allowed, I think the answer will be “No”. So you’d need to place restrictions of the finetuning. There is probably some work in ML on this problem.
To address the specific question about honesty/truthfulness, we discuss some ways to design language models in the executive summary and (at greater length) in Section 5 of the paper (e.g. discussion of “robust” truthfulness).
Huh, I think I hallucinated a result from the TruthfulQA paper where you fine-tuned on most of the dataset but didn’t see gains on the held-out portion.
Okay, new AMA question: have you already done the experiment that I hallucinated? If not, what do you think would happen?
I’m not sure what you’re referring to. Is there some experiment you have in mind? People have finetuned general language models to answer general-knowledge questions correctly, but I don’t know of people finetuning for being truthful (i.e. avoiding falsehoods).
There’s a more general question. Can you train a model to have property X in such a way that finetuning is unlikely to easily remove property X? If any kind of finetuning is allowed, I think the answer will be “No”. So you’d need to place restrictions of the finetuning. There is probably some work in ML on this problem.
To address the specific question about honesty/truthfulness, we discuss some ways to design language models in the executive summary and (at greater length) in Section 5 of the paper (e.g. discussion of “robust” truthfulness).
Huh, I think I hallucinated a result from the TruthfulQA paper where you fine-tuned on most of the dataset but didn’t see gains on the held-out portion.
Okay, new AMA question: have you already done the experiment that I hallucinated? If not, what do you think would happen?