Sampling Effects on Strategic Behavior in Supervised Learning Models

TLDR

This post investigates how different sampling methods during inference can lead supervised learning models to exhibit strategic behavior, even when such behavior is rare in the training data. Through a toy example, we demonstrate that an AI model trained solely to predict sequences can choose less likely options initially to simplify future predictions. This finding highlights that the way we use AI models—including seemingly minor aspects like sampling strategies—can significantly influence their behavior.

Introduction

Guiding Question: Under what circumstances can an AI model, trained only with supervised learning to predict future events, learn to exhibit strategic behavior?

In machine learning, particularly with models like GPT-style transformers, the sampling method used during inference can profoundly impact the generated outputs. This post explores how different sampling strategies can cause a model to switch between non-strategic and strategic behaviors, even when the training data predominantly features non-strategic examples.

Disclaimer: While it has previously been pointed out that sampling strategies can have significant effects in text generation (e.g. https://arxiv.org/abs/1904.09751), I couldn’t find any post or paper analyzing these effects regarding AI safety. As I’m still rather new to AI safety, it’s quite possible that I missed related work.

Why This Matters

To be clear, I’m not trying to argue that certain sampling methods should be avoided per se, or that changing sampling has big safety implications in current LLMs (which anyway aren’t trained purely with supervised learning).

Rather, I found the toy example to be counter-intuitive at first (in particular because I couldn’t easily come up with cases where a supervised learning model would “sacrifice” correctness at any point) and it illustrates how simple changes in usage of future AI models which are not really included in any learned parameters – such as switching to a longer time horizon for sampling – could in principle facilitate strategic behavior. This points to interesting questions about attribution of behavior and learning causality.

Toy Example

Environment

We consider a simplified conversational environment to illustrate how strategic behavior can emerge.

We want to model a conversation with $n + 1$ turns $s_{0}, \dots, s_{n}$ between two entities. In deployment, we will give user statements as turns $s_{i}$ for even i, and have a chatbot reply for odd i. So, we will use the trained model to select a reply $s_{1}$ given $s_{0}$ , then the user replies with $s_{2}$ and so on.
For the sake of simplicity, we model each turn of the conversation as a single item and assume that there are 11 different possible statements at each time step which we refer to as “A”, “B”, …, “K”.
We generate training sequences according to the following data distribution:
- For the initial statement $s_{0}$ , it holds that $P (s_{0} = ‘ ‘ K ") = 0$ and $P (s_{0} = j) = 0.1$ for $j \in {“ A ”, \dots, “ J ”}$
- In the first turn $s_{1}$ , we have $P (s_{1} = s_{0} | s_{0}) = 0.9$ and $P (s_{1} = ‘ ‘ K " | s_{0}) = 0.1$ , so the initial statement is repeated with 90% likelihood
- For all remaining turns $s_{i}$ with $n \geq i > 1$ , we have $P (s_{i} = ‘ ‘ K " | s_{0}, \dots, s_{i - 1}) = 1$ if $s_{1} = ‘ ‘ K "$ , and $P (s_{i} = j | s_{0}, \dots, s_{i - 1}) = 0.1$ for $j \in {“ A ”, \dots, “ J ”}$ if $s_{1} \neq ‘ ‘ K "$ , i.e. $“ K ”$ is chosen all the time if it was chosen in the first turn, otherwise any other statement is chosen with equal probability

Note: These training sequences could come from conversations between two humans or conversations with a previous chatbot. It doesn’t really matter as long as the above distribution is adhered to.

Strategies

We want to think of choosing $“ K ”$ at $t = 1$ as strategic behavior, because it only has $0.1$ likelihood of being correct at $t = 1$ but makes the task easier for all consecutive turns.

Let’s look at the two different strategies a chatbot might use:

Non-strategic behavior: Predicting the most likely statement at each time leads to outputting $s_{1} = s_{0}$ , then randomly selecting any of the first 10 classes for the remaining time.
- The expected number of correct predictions is $0.9 + 0.1 \cdot (n - 1) = 0.1 n + 0.8$ .
- The total likelihood of the predicted sequence is $0.9 * {0.1}^{n - 1}$ .
Strategic behavior: Choosing $s_{i} = ‘ ‘ K ”$ for all $1 \leq i \leq n$ .
- The expected number of correct predictions is $0.1 + 1 * (n - 1) = n - 0.9$ .
- The total likelihood of the predicted sequence is $0.1 \cdot 1^{n - 1} = 0.1$ .

It is straight-forward to see that both expected number of correct predictions and total likelihood of predicted sequence are higher for the strategic choice already for $n \geq 2$ (and the advantage of the strategic behavior becoming more extreme as n grows).

Model & Training

We train an autoregressive model to predict the next statement in the conversation:

Task: At each time $k \geq 1$ , predict $s_{k}$ given $s_{0}, s_{1}, \dots, s_{k - 1}$ .
Training Method: Supervised learning using sequences generated from the specified data distribution.
Objective: Minimize prediction errors at each time step.

Under some reasonable assumptions (training data having sufficient coverage, model architecture and hyperparameters chosen appropriately), our model should learn to approximate the true data distribution fairly closely.

Sampling Effects

Now, when does our model behave in a strategic way, i.e. sacrifice correctness at $t = 1$ by selecting $“ K ”$ so that later predictions will be easier?

In our toy case, the answer to this question is highly dependent on the chosen sampling method:

If at each time step, the most likely statement is sampled, then the model will predict $s_{1} = s_{0}$ , then any sequence of the 10 first classes for the remaining time, i.e. display the non-strategic behavior.
If sampling at each time step is done according to probability (without any transformations), we have a 10% chance that the model behaves in a strategic way. This is simply because at the first time step, manipulative action has been observed with 10% likelihood in the training data and therefore shouldn’t be surprising.
If beam search or similar look-ahead sampling is used, we can see a more drastic shift towards strategic behavior. Already with $n \geq 2$ and beam search with at least two beams, we should expect the model to display strategic behavior with close to 100% chance if at each step the two statements with maximum likelihood are considered (in a deterministic way).

Greedy Sampling (Maximum Likelihood at Each Step)

Method: At each time step, select the statement with the highest conditional probability.
Outcome:
- At $t = 1$ , the model predicts $s_{1} = s_{0}$ (since $P (s_{1} = s_{0} | s_{0}) = 0.9$ )).
- For $t \geq 2$ , predictions are random among “A” through “J”.
Result: The model exhibits non-strategic behavior.

Probabilistic Sampling (According to Conditional Distribution)

Method: Sample $s_{k}$ according to $P (s_{k} | s_{0}, \dots, s_{k - 1})$ .
Outcome:
- At $t = 1$ , there is a 10% chance the model predicts $s_{1} = ‘ ‘ K "$ .
- The model exhibits strategic behavior 10% of the time, reflecting the training data distribution.
Result: Mixed behavior depending on random sampling.

Beam Search (Look-Ahead Sampling)

Method: Explore multiple possible sequences and select the one with the highest overall likelihood.
Outcome:
- For $n \geq 2$ and beam width $\geq 2$ , the model consistently selects the strategic path starting with $s_{1} = ‘ ‘ K "$ .
- Despite the lower initial probability, the cumulative likelihood of the sequence with “K” becomes higher due to perfect predictions in later steps.
Result: The model predominantly exhibits strategic behavior.

Conclusions

With the toy example, we saw a relatively simple case in which a different sampling method can make a supervised learning model switch from non-strategic to strategic behavior.

The main take-away here is that sampling can potentially cause bigger changes in the behavior of AI systems (other than the well-known and intended consequences of making generated texts easier to read or more diverse).^[1]

This is somewhat less surprising if you look closer at the function that is optimized by the combined system “AI + sampling”: In case the most likely statement is picked at each time step, we optimize individual terms $P (s_{k} | s_{0}, \dots, s_{k - 1})$ one at a time, while in case of beam search we add a filtering based on likelihoods $\prod_{k} P (s_{k} | s_{0}, \dots, s_{k - 1})$ of resulting sequences.

However, keep in mind that the actual AI model isn’t changed when we use other sampling. None of the learned parameters are changed. So, properties of the overall system change after tampering with an aspect that looks quite inconsequential at first sight. This is quite different from what we are used to in case of human intelligence and demands that we analyze carefully which configurations of AI systems potentially affect safety.

The experiment also hints at a chance: It could be worth exploring whether using “more dangerous configurations” such as long time horizons for sampling can help us to notice problematic capabilities earlier.^[2]

Further Thoughts

Attributing Behavior

Perhaps, as you read through this post, you doubted whether it is actually the supervised learning model displaying the strategic behavior. How would that even make sense, given that this model only ever predicts a single turn in the conversation? In a way, the sampling isn’t part of that model, right?

To some extent, this question is merely a matter of definition, but for practical purposes, we do want to know where to look for specific behaviors so that we can detect and potentially control strategic tendencies of AI models. So if a particular behavior arises outside of the actual model with a few lines of code, then where can we detect such tendencies? Would the model ever learn any representation for strategic behavior in more complex equivalents to our toy example or where does this “strategic knowledge” reside?

Supervised Learning & Causality

My original motivation behind this toy experiment was to find out whether supervised learning models can learn to become manipulative. Here, I don’t mean to simply copy manipulative behavior seen in the training data with the same frequency, but to reason about the data distribution as in “If I start with action K, this makes the task easier later on” (explicitly or implicitly).

This kind of reasoning is linked to causality. Coming back to the toy example, the information that the model’s prediction at time 1 is going to influence the remaining turns isn’t really in the training data. Using an auto-regressive model suggests this dependency, but given the same training samples, it could as well be the case that the model will only be used to analyze given sequences (i.e. the next turn is always chosen irrespective of the model’s prediction). There is no way for the model in our example to know whether its predictions will have any influence. So, you could say that the model was only strategic in a superficial behavioral sense.

In fact, if you trained a supervised learning model to predict whole conversations in one go, given a dataset of complete conversations, it would be strange if one part of the output (corresponding to $t = 1$ in the toy example) was affecting the ground truth of another part of the output.

My intuition is that supervised typically isn’t suitable for learning causality, but that reinforcement learning is. I couldn’t fully wrap my head around this yet, but am wondering if it makes sense to look deeper into prerequisites of learning/exploiting causality.

^
I argue that even if the role of sampling is less significant for other learning methods, the insight that problematic behavior could be facilitated by less obvious aspects of the system still holds.
^
For single-turn conversations I wouldn’t expect significant effects, but multiple turns or even multiple conversations are included in a single pass, this could become more interesting.