(Thanks to Evan Hubinger and Nicholas Schiefer for comments on these ideas.)
These are some notes on the relation between conditioning language models, prompting, and fine-tuning. The key takeaways are:
Prompting and fine-tuning can both be used to condition language models.
Prompting is quite restricted in the kinds of conditionals it can achieve.
Fine-tuning can implement arbitrary conditionals in principle, though not in practice.
In practice fine-tuning can still implement more kinds of conditionals than prompting.
We don’t understand how fine-tuning conditionals generalize, which seems dangerous.
Conditioning
We can think of a language model as specifying a probability distribution π(x), where x is a sequence of tokens of fixed length N (the length of the context window). We generate text by sampling sequences from π.
Sometimes we don’t want to just sample from a language model. Instead, we want to condition the model on some facts about the sequence x. We can write the conditioned distribution as
π(x|c(x))
where c(x) encodes some constraints on x. For instance c(x) might require that the first token is “Apple”, or that the 7th and 12th tokens are the same, etc.
Some conditions are easy, some are hard
It’s easy to sample from a language model conditioned on the first two tokens being the same, but not all conditionals are so straightforward. Suppose we condition on the sequence x beginning with the factorization of a large composite number. There exist valid sequences unambiguously satisfying the conditional, but sampling them is hard if we don’t know the factorization ahead of time. So there are limits to the kinds of conditionals we can apply in practice.
Prompting
A prompt is a very restricted kind of conditional where the condition is that certain tokens in x are known in advance. For instance, we might specify that the first four words are “Mary had a little”, or that the last three words are “happily ever after.”
Prompts are nice in a few ways:
It’s easy to sample from a language model given an arbitrary prompt.
We sort of understand what prompts do. A prompt asks the model to predict the output of a text-generation process given that it knows the values of the fixed tokens.
The downside with prompting is that there are lots of conditionals we can’t turn into prompts. For instance:
Sample text from the model that humans will rate as having positive sentiment.
Sample text from the model that never involves violence.
Sample text from the model that contains a valid chess game.
None of these can be expressed in terms of fixed tokens in the context window.
Fine-Tuning
Instead of prompting, we can fine-tune a model, either with an explicit reward function or with Reinforcement Learning from Human Feedback (RLHF). We start with a pre-trained model, then fine-tune it to maximize either an explicit or a learned reward.
Subject to actually converging to the optimum distribution, fine-tuning with a KL penalty is a form of variational bayesian inference. The result is a variational approximation of the Bayesian update on human feedback using the pre-trained model as a prior. That is, we obtain a new model which produces the probability distribution
π′(x)∝π(x)L(x)
where the likelihood is L(x)=er(x)/β, β is the KL penalty weight, and r(x) is the reward for sequence x. A more formal discussion was given by Korbak, Perez & Buckley.
Fine-tuning can approximate prompts
Fine-tuning can approximate any conditional a prompt can achieve. To see this, note that every prompt consists of setting tokens at some positions i∈S to values yi, where the indices in S form a subset of the context window. A prompt in this form is approximated by fine-tuning on the reward function
r(x)≡λ∑i∈Sδxi,yi
where δxi,yi=1 if xi=yi and is zero otherwise. In the limit of large λ, fine-tuning on this reward function amounts to providing enormous evidence in favor of the desired token values, which is equivalent to conditioning with a prompt that directly fixes those tokens.
Fine-tuning can approximate any conditional
With appropriate choices of the reward r(x) we can achieve any shift in the probability distribution that doesn’t expand the support of π(x), and so in principle fine-tuning can approximate any conditional.
Some conditions are easy, some are hard
In practice some conditionals are hard to achieve because they require an unrealistically large number of samples for fine-tuning to converge to the full Bayesian update. For instance it is hard to fine-tune on the reward corresponding to “the sequence x begin with a factorization of a large composite number” because it is takes many tries to find an x satisfying the conditional.
Still, there are many kinds of conditionals that fine-tuning can access in practice. For instance, RLHF can condition on positive human sentiment rating, or on not containing malicious plans.
More generally, fine-tuning seems to be good for conditioning on properties that are:
Easy to identify/evaluate.
Not too rare under the initial distribution π(x) (or some pre-conditioned version of this, e.g. via prompts).
Generalization Concerns
Because fine-tuning with a KL penalty implements Bayesian updates, every reward function describes a conditional of the form “condition on the following sequences being more/less likely according to their reward”. Unfortunately we may not understand at a deeper level what this conditional means.
In particular, it is not obvious how this conditional generalizes. Consider RLHF with a sentiment reward. There are multiple ways a model could interpret the implied conditional:
Positive-sentiment text is more likely, so humans are kinder in the world than the pre-training distribution suggested.
Positive-sentiment text is more likely, so there are legal restrictions on the kinds of text that are recorded.
These two interpretations generalize very differently. For instance (1) could increase the probability of text describing humans helping each other while (2) could decrease that probability by implying a world with little social trust.
This sort of generalization ambiguity seems really dangerous, because we could end up with very different behavior from what we intended in specifying the reward function or providing feedback.
Summary
My key takeaways are:
Prompting and fine-tuning can both be used to condition language models.
Prompting is quite restricted in the kinds of conditionals it can achieve.
Fine-tuning can implement arbitrary conditionals in principle, though not in practice.
In practice fine-tuning can still implement more kinds of conditionals than prompting.
We don’t understand how fine-tuning conditionals generalize, which seems dangerous.
(1-4) suggest that we will need some sort of fine-tuning/RLHF to achieve the kinds of complex conditionals that are useful in practice/for alignment schemes. If so, (5) says we should try to figure out more about how fine-tuning conditionals generalize, because that’s where a lot of the danger lies.
Conditioning, Prompts, and Fine-Tuning
(Thanks to Evan Hubinger and Nicholas Schiefer for comments on these ideas.)
These are some notes on the relation between conditioning language models, prompting, and fine-tuning. The key takeaways are:
Prompting and fine-tuning can both be used to condition language models.
Prompting is quite restricted in the kinds of conditionals it can achieve.
Fine-tuning can implement arbitrary conditionals in principle, though not in practice.
In practice fine-tuning can still implement more kinds of conditionals than prompting.
We don’t understand how fine-tuning conditionals generalize, which seems dangerous.
Conditioning
We can think of a language model as specifying a probability distribution π(x), where x is a sequence of tokens of fixed length N (the length of the context window). We generate text by sampling sequences from π.
Sometimes we don’t want to just sample from a language model. Instead, we want to condition the model on some facts about the sequence x. We can write the conditioned distribution as
π(x|c(x))where c(x) encodes some constraints on x. For instance c(x) might require that the first token is “Apple”, or that the 7th and 12th tokens are the same, etc.
Some conditions are easy, some are hard
It’s easy to sample from a language model conditioned on the first two tokens being the same, but not all conditionals are so straightforward. Suppose we condition on the sequence x beginning with the factorization of a large composite number. There exist valid sequences unambiguously satisfying the conditional, but sampling them is hard if we don’t know the factorization ahead of time. So there are limits to the kinds of conditionals we can apply in practice.
Prompting
A prompt is a very restricted kind of conditional where the condition is that certain tokens in x are known in advance. For instance, we might specify that the first four words are “Mary had a little”, or that the last three words are “happily ever after.”
Prompts are nice in a few ways:
It’s easy to sample from a language model given an arbitrary prompt.
We sort of understand what prompts do. A prompt asks the model to predict the output of a text-generation process given that it knows the values of the fixed tokens.
The downside with prompting is that there are lots of conditionals we can’t turn into prompts. For instance:
Sample text from the model that humans will rate as having positive sentiment.
Sample text from the model that never involves violence.
Sample text from the model that contains a valid chess game.
None of these can be expressed in terms of fixed tokens in the context window.
Fine-Tuning
Instead of prompting, we can fine-tune a model, either with an explicit reward function or with Reinforcement Learning from Human Feedback (RLHF). We start with a pre-trained model, then fine-tune it to maximize either an explicit or a learned reward.
Subject to actually converging to the optimum distribution, fine-tuning with a KL penalty is a form of variational bayesian inference. The result is a variational approximation of the Bayesian update on human feedback using the pre-trained model as a prior. That is, we obtain a new model which produces the probability distribution
π′(x)∝π(x)L(x)where the likelihood is L(x)=er(x)/β, β is the KL penalty weight, and r(x) is the reward for sequence x. A more formal discussion was given by Korbak, Perez & Buckley.
Fine-tuning can approximate prompts
Fine-tuning can approximate any conditional a prompt can achieve. To see this, note that every prompt consists of setting tokens at some positions i∈S to values yi, where the indices in S form a subset of the context window. A prompt in this form is approximated by fine-tuning on the reward function
r(x)≡λ∑i∈Sδxi,yiwhere δxi,yi=1 if xi=yi and is zero otherwise. In the limit of large λ, fine-tuning on this reward function amounts to providing enormous evidence in favor of the desired token values, which is equivalent to conditioning with a prompt that directly fixes those tokens.
Fine-tuning can approximate any conditional
With appropriate choices of the reward r(x) we can achieve any shift in the probability distribution that doesn’t expand the support of π(x), and so in principle fine-tuning can approximate any conditional.
Some conditions are easy, some are hard
In practice some conditionals are hard to achieve because they require an unrealistically large number of samples for fine-tuning to converge to the full Bayesian update. For instance it is hard to fine-tune on the reward corresponding to “the sequence x begin with a factorization of a large composite number” because it is takes many tries to find an x satisfying the conditional.
Still, there are many kinds of conditionals that fine-tuning can access in practice. For instance, RLHF can condition on positive human sentiment rating, or on not containing malicious plans.
More generally, fine-tuning seems to be good for conditioning on properties that are:
Easy to identify/evaluate.
Not too rare under the initial distribution π(x) (or some pre-conditioned version of this, e.g. via prompts).
Generalization Concerns
Because fine-tuning with a KL penalty implements Bayesian updates, every reward function describes a conditional of the form “condition on the following sequences being more/less likely according to their reward”. Unfortunately we may not understand at a deeper level what this conditional means.
In particular, it is not obvious how this conditional generalizes. Consider RLHF with a sentiment reward. There are multiple ways a model could interpret the implied conditional:
Positive-sentiment text is more likely, so humans are kinder in the world than the pre-training distribution suggested.
Positive-sentiment text is more likely, so there are legal restrictions on the kinds of text that are recorded.
These two interpretations generalize very differently. For instance (1) could increase the probability of text describing humans helping each other while (2) could decrease that probability by implying a world with little social trust.
This sort of generalization ambiguity seems really dangerous, because we could end up with very different behavior from what we intended in specifying the reward function or providing feedback.
Summary
My key takeaways are:
Prompting and fine-tuning can both be used to condition language models.
Prompting is quite restricted in the kinds of conditionals it can achieve.
Fine-tuning can implement arbitrary conditionals in principle, though not in practice.
In practice fine-tuning can still implement more kinds of conditionals than prompting.
We don’t understand how fine-tuning conditionals generalize, which seems dangerous.
(1-4) suggest that we will need some sort of fine-tuning/RLHF to achieve the kinds of complex conditionals that are useful in practice/for alignment schemes. If so, (5) says we should try to figure out more about how fine-tuning conditionals generalize, because that’s where a lot of the danger lies.